On August 10, 2021 19:05 UTC, our CDN provider experienced a brief outage which resulted in elevated rendering rates from the imgix service.
Error rates returned to almost normal levels by 19:30 UTC with a small percentage of errors continuing to occur in imgix. By 19:58 UTC, error rates were restored to completely normal levels, though there continued to be non-user affecting errors appearing in our stack. Our team continued to apply mitigations and fixes, with the incident being marked as fully resolved on August 11, 2:15 UTC.
Between the times of 19:05 UTC and 19:30 UTC, users experienced elevated render rates for non-cached images for requests to the imgix service. At the height of the incident (19:12 UTC), 11% of requests to imgix received a
503 response. After 19:12 UTC, errors sharply dropped to 5% and continued to drop until being restored to almost normal levels by 19:30 UTC. By this time, there were only a small percentage of errors that continued to occur for requests (<1%). Ongoing work fully restored the rendering service by 19:58 UTC.
From this time until the incident was resolved at August 11, 2:15 UTC, backend errors continued to occur, though these errors did not have an impact on image deliverability.
At 19:05 UTC, our CDN provider posted a status update concerning performance impact to their CDN services, which subsequently affected imgix services by elevating error rates. Our monitoring tools alerted our engineering team to the elevating error rates, which allowed us to apply quick mitigations to control the growth of errors.
Our own status page was updated at 19:16 UTC. Thanks to the mitigations applied by both our CDN provider and our engineering team, the service began to recover at 19:30 UTC, with just a small percentage of errors that had persisted. Our team continued to apply changes to sustain mitigations, with errors being restored normal levels by 19:58 UTC.
Though rendering had been restored, non-end user facing errors continued to surface within our infrastructure. Our team continued to investigate and apply fixes, though erratic behaviors continued for a much longer time than anticipated as a result of the initial outage. Eventually, the incident was marked as resolved on August 11, 2:15 UTC.
This incident exposed an issue with brief CDN service outages causing lengthy incident times for our rendering service. We will tune our infrastructure and we’ll investigate further to explore opportunities for mitigating the after effects of CDN outages on imgix services.