On September 09, 2021, between 22:08 UTC and 22:22 UTC, imgix experienced a major rendering outage affecting non-cached derivative images.
Starting at 22:08 UTC, our service began to experience an increase in rendering error rates, with requests to our rendering service receiving 502
error responses for some non-cached assets. At the short peak of the incident, 9% of requests returned an error, though this only lasted a minute before sharply dropping back to normal at 22:22 UTC.
At 22:03 UTC, alerts indicated that there was a connectivity issue with a service provider. Errors were still normal, though our backup servers were showing rapidly increasing network load.
At the same time, there were some external issues with utilizing our database services tooling, so our team was forced to utilize other methods of investigating the sudden server downtime. While investigations were underway, our backup infrastructure had started showing signs of stress under the increasing load, which manifested as increasing error rates from our service starting at 22:08 UTC.
Our engineers quickly discovered that a datacenter technician inadvertently powered off equipment during preliminary work for capacity expansion. While a failover device existed, traffic exceeded the available capacity on the device.
After the change was discovered, it was quickly reversed, allowing the service to instantly recover.
We will be working with our service provider to eliminate scenarios where unexpected modifications can be made to our hardware configurations, along with getting additional safeguards from our service provider to ensure we can speed up remediation in issues related to offsite infrastructure hardware.
We will also be expanding our backup capacity in the near future.