Elevated rendering errors
Incident Report for imgix
Postmortem

What happened?

On September 09, 2021, between 22:08 UTC and 22:22 UTC, imgix experienced a major rendering outage affecting non-cached derivative images.

How were customers impacted?

Starting at 22:08 UTC, our service began to experience an increase in rendering error rates, with requests to our rendering service receiving 502 error responses for some non-cached assets. At the short peak of the incident, 9% of requests returned an error, though this only lasted a minute before sharply dropping back to normal at 22:22 UTC.

What went wrong during the incident?

At 22:03 UTC, alerts indicated that there was a connectivity issue with a service provider. Errors were still normal, though our backup servers were showing rapidly increasing network load.

At the same time, there were some external issues with utilizing our database services tooling, so our team was forced to utilize other methods of investigating the sudden server downtime. While investigations were underway, our backup infrastructure had started showing signs of stress under the increasing load, which manifested as increasing error rates from our service starting at 22:08 UTC.

Our engineers quickly discovered that a datacenter technician inadvertently powered off equipment during preliminary work for capacity expansion. While a failover device existed, traffic exceeded the available capacity on the device.

After the change was discovered, it was quickly reversed, allowing the service to instantly recover.

What will imgix do to prevent this in the future?

We will be working with our service provider to eliminate scenarios where unexpected modifications can be made to our hardware configurations, along with getting additional safeguards from our service provider to ensure we can speed up remediation in issues related to offsite infrastructure hardware.

We will also be expanding our backup capacity in the near future.

Posted Oct 14, 2021 - 12:17 PDT

Resolved
Service has been completely restored.
Posted Sep 30, 2021 - 15:45 PDT
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Sep 30, 2021 - 15:27 PDT
Investigating
We are currently investigating elevated render error rates for non-cached derivative images. We will update once when we obtain more information.

Previously cached derivatives are not impacted.
Posted Sep 30, 2021 - 15:19 PDT
This incident affected: Rendering Infrastructure.