Elevated 500 Errors.
Incident Report for imgix
Postmortem

What happened?

At 04:25 UTC the imgix rendering service began exhibiting elevated error rates.

On-call engineers responded by 04:29 UTC and were able to quickly identify the issue as correlated to scheduled network maintenance that was on-going. The maintenance steps were rolled back by 04:34 UTC.

Normal service was fully restored by 04:35 UTC. Engineers continued to monitor the service until 04:58 UTC, at which point the incident was resolved.

How were customers impacted?

Image requests which had already been cached were not impacted by this incident.

Between 04:25 UTC and 04:35 UTC: Approximately 60% of requests to the imgix rendering service failed. Some successful requests had elevated response times.

What went wrong during the incident?

imgix on-call engineers were able to rapidly identify the cause of the incident, and did not require substantial time to implement remediation steps.

The network maintenance being performed was not scoped at the level of severity we observed from this incident, and this slightly delayed the engineering team from taking the appropriate remediation steps. It also extended beyond its originally scheduled end time, which was only partially communicated to the team.

Additionally, the imgix service degraded more aggressively than expected in this scenario. This resulted in a higher impact than ideally possible.

What will imgix do to prevent this in the future?

This incident exposed a gap in our network monitoring and automation tools, which would have likely prevented its occurrence. We will also revise our procedures to better communicate planned maintenances to the team.

We will also review the service impact and look for ways to improve service architecture to better handle these types of incidents.

Posted 12 months ago. Sep 20, 2018 - 17:40 PDT

Resolved
The incident has been resolved.
Posted 12 months ago. Sep 19, 2018 - 21:58 PDT
Monitoring
Error rates are recovering and we are continuing to monitor the situation.
Posted 12 months ago. Sep 19, 2018 - 21:43 PDT
Investigating
We are currently investigating elevated render error rates. We will update once when we obtain more information.
Posted 12 months ago. Sep 19, 2018 - 21:38 PDT
This incident affected: Rendering Infrastructure.