Elevated 504 Errors

Incident Report for imgix

Postmortem

What happened?

At March 18 2019 01:23 UTC, the imgix service began to experience elevated error rates for approximately 10% of requests. Impacted customers continued to experience elevated error rates for approximately 50 minutes, until 02:11 UTC. During this period, the rate of errors declined to less than 10% of requests, but did not return to normal. The incident was resolved at 03:04 UTC, after further remediation and a period of observation by the engineering team.

How were customers impacted?

Image URLs which had previously been rendered and were cached by the imgix CDN were not impacted by this incident.

During the period of customer impact (01:23 to 02:11 UTC), requests for imgix-hosted image URLs may have been returned with a 5xx HTTP status code. These responses are not placed into long term cache by the imgix CDN, and subsequent requests for these URLs would have succeed shortly after 02:11 UTC.

What went wrong during the incident?

Our service monitoring identified elevated errors on image rendering and paged the on-call engineer. This engineer responded and began troubleshooting the issue immediately. Unfortunately, it took us longer than expected to identify and mitigate the underlying cause of the failures.

Despite attempts to mitigate the issue, behavior did not return to normal. Additional members of the engineering team were brought in to resolve the incident and together the team was able to identify and take the failed service component out of production. This returned service behavior to normal, with a slight degradation to capacity and redundancy.

What will imgix do to prevent this in the future?

We have identified deficiencies with our internal processes and our adherence to documented procedure that contributed to the prolonged nature of this incident. An action plan has been developed to implement improvements in the related engineering teams.

Posted Mar 20, 2019 - 11:39 PDT

Resolved

504 error rates continue to be normal and images are rendering as expected.

Posted Mar 17, 2019 - 20:04 PDT

Monitoring

A fix had been implemented and 504 error rates are falling. We will continue to monitor the situation.

Posted Mar 17, 2019 - 19:15 PDT

Investigating

We are currently investigating elevated render error rates on new renders. We will update once when we obtain more information.

Posted Mar 17, 2019 - 18:59 PDT

This incident affected: Rendering Infrastructure.