Elevated 500 Errors.

Incident Report for imgix

Postmortem

What happened?

At 04:25 UTC the imgix rendering service began exhibiting elevated error rates.

On-call engineers responded by 04:29 UTC and were able to quickly identify the issue as correlated to scheduled network maintenance that was on-going. The maintenance steps were rolled back by 04:34 UTC.

Normal service was fully restored by 04:35 UTC. Engineers continued to monitor the service until 04:58 UTC, at which point the incident was resolved.

How were customers impacted?

Image requests which had already been cached were not impacted by this incident.

Between 04:25 UTC and 04:35 UTC: Approximately 60% of requests to the imgix rendering service failed. Some successful requests had elevated response times.

What went wrong during the incident?

imgix on-call engineers were able to rapidly identify the cause of the incident, and did not require substantial time to implement remediation steps.

The network maintenance being performed was not scoped at the level of severity we observed from this incident, and this slightly delayed the engineering team from taking the appropriate remediation steps. It also extended beyond its originally scheduled end time, which was only partially communicated to the team.

Additionally, the imgix service degraded more aggressively than expected in this scenario. This resulted in a higher impact than ideally possible.

What will imgix do to prevent this in the future?

This incident exposed a gap in our network monitoring and automation tools, which would have likely prevented its occurrence. We will also revise our procedures to better communicate planned maintenances to the team.

We will also review the service impact and look for ways to improve service architecture to better handle these types of incidents.

Posted Sep 20, 2018 - 17:40 PDT

Resolved

The incident has been resolved.

Posted Sep 19, 2018 - 21:58 PDT

Monitoring

Error rates are recovering and we are continuing to monitor the situation.

Posted Sep 19, 2018 - 21:43 PDT

Investigating

We are currently investigating elevated render error rates. We will update once when we obtain more information.

Posted Sep 19, 2018 - 21:38 PDT

This incident affected: Rendering Infrastructure.