On June 08, 2021, between the hours of 09:58 UTC and 10:36 UTC, our CDN provider experienced a global CDN disruption that severely impacted the imgix service.
At 10:36 UTC, our CDN provider applied a fix that allowed us to begin restoring service, and we began processing a large backlog of render requests that had accumulated from the start of the outage. Our CDN provider marked their incident as resolved at 12:41 UTC.
Our engineers later identified an issue affecting <1% of image requests to the service and applied a fix at 20:22 UTC.
The incident was marked fully resolved by 21:26 UTC although most images were being successfully served by 12:41.
Between 09:58 UTC and 10:50 UTC, a significant percentage of requests to the imgix service returned an error. The CDN provider outage prevented logs from being sent from the CDN to imgix so imgix customers will not have analytics for this time period.
After the CDN outage was marked as resolved by our provider (12:41 UTC), approximately 94% of all requests to imgix resulting in a successful response. This number gradually increased over the course of several hours.
By 17:14 UTC, 99% of all requests resulted in a successful response, and the service was mostly restored. Due to an issue affecting a small number of images, the service was not marked as fully resolved until 21:19 UTC.
This outage was unprecedented in a few ways:
By 10:36 UTC a fix was applied which restored the CDN service. However, the sheer volume of incoming rendering traffic combined with the increased origin load from a much lower-than-normal cache-hit ratio prevented our service from completing an immediate recovery. By 11:00 UTC, traffic was being served with an 89% success rate.
Several recovery strategies were implemented to handle the fallout caused by the initial outage:
As mitigations were implemented, the service reached a > 99% delivery success rate by 17:14 UTC. By this time, our status page should have been updated. Due to a smaller issue affecting a small portion of renders, we did not consider the incident fully resolved until 20:22 UTC, which was when all errors had returned to normal levels.
We will be evaluating possible fallbacks in the case of total CDN failure, along with investigating our caching capacities to prevent events of a similar scale from having such an impact on our rendering services in the future. Projects are already underway to dramatically increase render capacity under sustained loads as well as unexpected spikes.
At the same time, we will be working closely with our CDN partner to discuss their own remediation steps and how we can best interact with them moving forward.
We will also be revisiting our processes to ensure that status updates are more frequent, as the majority of the service was restored far sooner than we indicated on our status page.