At 04:25 UTC the imgix rendering service began exhibiting elevated error rates.
On-call engineers responded by 04:29 UTC and were able to quickly identify the issue as correlated to scheduled network maintenance that was on-going. The maintenance steps were rolled back by 04:34 UTC.
Normal service was fully restored by 04:35 UTC. Engineers continued to monitor the service until 04:58 UTC, at which point the incident was resolved.
Image requests which had already been cached were not impacted by this incident.
Between 04:25 UTC and 04:35 UTC: Approximately 60% of requests to the imgix rendering service failed. Some successful requests had elevated response times.
imgix on-call engineers were able to rapidly identify the cause of the incident, and did not require substantial time to implement remediation steps.
The network maintenance being performed was not scoped at the level of severity we observed from this incident, and this slightly delayed the engineering team from taking the appropriate remediation steps. It also extended beyond its originally scheduled end time, which was only partially communicated to the team.
Additionally, the imgix service degraded more aggressively than expected in this scenario. This resulted in a higher impact than ideally possible.
This incident exposed a gap in our network monitoring and automation tools, which would have likely prevented its occurrence. We will also revise our procedures to better communicate planned maintenances to the team.
We will also review the service impact and look for ways to improve service architecture to better handle these types of incidents.