At March 18 2019 01:23 UTC, the imgix service began to experience elevated error rates for approximately 10% of requests. Impacted customers continued to experience elevated error rates for approximately 50 minutes, until 02:11 UTC. During this period, the rate of errors declined to less than 10% of requests, but did not return to normal. The incident was resolved at 03:04 UTC, after further remediation and a period of observation by the engineering team.
Image URLs which had previously been rendered and were cached by the imgix CDN were not impacted by this incident.
During the period of customer impact (01:23 to 02:11 UTC), requests for imgix-hosted image URLs may have been returned with a 5xx HTTP status code. These responses are not placed into long term cache by the imgix CDN, and subsequent requests for these URLs would have succeed shortly after 02:11 UTC.
Our service monitoring identified elevated errors on image rendering and paged the on-call engineer. This engineer responded and began troubleshooting the issue immediately. Unfortunately, it took us longer than expected to identify and mitigate the underlying cause of the failures.
Despite attempts to mitigate the issue, behavior did not return to normal. Additional members of the engineering team were brought in to resolve the incident and together the team was able to identify and take the failed service component out of production. This returned service behavior to normal, with a slight degradation to capacity and redundancy.
We have identified deficiencies with our internal processes and our adherence to documented procedure that contributed to the prolonged nature of this incident. An action plan has been developed to implement improvements in the related engineering teams.