Elevated 500 Errors.

Incident Report for imgix

Postmortem

What happened?

At 16:09 UTC the imgix rendering service began experiencing elevated error rates when rendering some customer images. Normal service was fully restored at 18:12 UTC. imgix engineers continued to monitor the service behavior until 18:35 UTC before closing the incident.

How were customers impacted?

Image requests which had already been cached were not impacted by this incident.

Between 16:09 UTC and 17:55 UTC: Approximately 30% of rendering requests failed, partially impacting many imgix customers.

Between 17:55 UTC and 18:12 UTC: 100% of rendering requests failed, partially impacted all imgix customers. During this period, previously rendered images continued to be served.

What went wrong during the incident?

imgix engineers were able to quickly identify the cause of service degradation, but encountered difficulties in implementing the necessary remediation due to internal tooling and monitoring issues.

This resulted in both a slower time to resolution than ideally possible, as well as contributing to the further elevation of render error rates between 17:55 UTC and 18:12 UTC.

What will imgix do to prevent this in the future?

As a result of this incident, imgix engineering has identified revised procedures which in the immediate term are expected to reduce the severity and time window of any similar future incidents.

The team is also continuing to work on future iterations to the imgix service architecture and internal tooling. These changes are focused solely around reducing the possibility and severity of similar incidents.

Posted Sep 20, 2018 - 17:02 PDT

Resolved

This incident has been resolved.

Posted Sep 20, 2018 - 11:35 PDT

Monitoring

Error rates are returning to normal and we will continue to monitor the results.

Posted Sep 20, 2018 - 11:20 PDT

Identified

We've identified the issue and we're currently working on returning success rates to normal.

Posted Sep 20, 2018 - 10:15 PDT

Investigating

We are currently investigating elevated render error rates. We will update once when we obtain more information.

Posted Sep 20, 2018 - 09:47 PDT

This incident affected: Rendering Infrastructure.