On December 17, 2021, 05:06 UTC, some uncached requests to the imgix service began to return a
By 05:36 UTC the issue had been completely resolved.
Between the hours of 05:01 UTC and 05:36 UTC, some requests to non-cached derivative images began to return a
503 error, with a 10% peak error rate being reached for parts of the incident.
At 5:07 UTC, error rates began to decrease slowly, though a 5% error rate persisted until a fix was pushed at 5:36, which completely restored the service.
Large unexpected traffic patterns triggered a problematic interaction with a newly built internal automation, causing the initial incident.
Our team pushed mitigations early on in the incident, though the mitigations had further unexpected interactions with the newly built automation. While the service did begin to recover, the rate of recovery was slower than expected due to these interactions.
Once the interaction was identified, another manual change was made which completely restored the service.
We will be adding additional tooling which will enable us to more quickly identify proximate causes during incidents. We will also internally document the interactions and behaviors of our existing automation and mitigation runbooks to ensure smoother recovery times in the future. We also identified some improvement opportunities for some of our existing automation, which have completed fine-tuning.