Elevated 503 errors for all new renders

Incident Report for imgix

Postmortem

What happened?

On June 4th, 2020 at 09:50 UTC one of our network service providers suffered a loss of power at their facility. This affected all inbound traffic from our edge and shield caches to our rendering service, which prevented new derivative images from being created. Due to the nature of the network outage, we did not experience an automatic failover to an alternate service provider. By 10:25 UTC network connectivity had been restored and derivative images were able to be successfully processed.

How were customers impacted?

Unlike other service impacting events we have seen in recent months, this one manifested as a complete inability for our shield cache to access our rendering service. While the net effect was the same (503 for new derivative images) the specific error messages customers would have seen could have varied and custom error fallback images were not displayed.

What went wrong during the incident?

The most impactful thing which went wrong was the lack of automatic failover to an alternate network service provider. This, combined with improperly configured external synthetic checks, delayed our overall response. We also saw more variance from our established incident remediation policy than ideal, which delayed customer outreach and escalations.

What will imgix do to prevent this in the future?

Previously identified monitoring changes will be implemented, with some additional near term work around increased availability for our network infrastructure. Long term global resiliency work which could have helped mitigate this incident is already accounted for and will continue.

Posted Jun 09, 2020 - 11:27 PDT

Resolved

Between 9:53 UTC and 10:24 UTC, our service began returning 503 errors for all new renders. Our engineers investigated the incident and found that it was related to an outage from one of our network providers. The issue has since been resolved. We will be publishing a public post-mortem with more details at a later date.

Posted Jun 04, 2020 - 02:53 PDT