On March 2, 19:50 UTC the imgix rendering service experienced network instability which triggered an outage affecting some uncached image renders. Mitigations were implemented, which enabled the service to begin recovery by 20:10 UTC.
During this incident, requests for some uncached derivative images received error responses. Approximately 3.5% of requests returned an error during the peak of the incident between 19:50 UTC and 20:10 UTC, with service being completely restored to the majority of customers by 20:11 UTC. The incident was marked as fully resolved by 22:47 UTC.
Our engineers were alerted to an increasing amount of errors generated from our rendering stack. The cause was due to a brief spate of network instability which eventually culminated into cascading failures across our origin cache. Our engineers then identified the cause of the failures and applied mitigations using new tooling, which minimized the duration and effect of system failures on imgix traffic.
While our recovery was swift thanks to newly implemented tooling, there are a few improvements that we will be making to our incident runbooks and processes so that we improve response times to incident alerts. We will also improve monitoring of network connectivity and implement tooling to enable us to rapidly shift traffic to alternate paths in the event of network instability.