Today beginning at 19:13 UTC, imgix experienced a sudden and unexpected purge of its entire edge cache. The culminating result of this incident was broken images for many customers. We are incredibly sorry for the difficulties this may have caused you and your business. Our team takes service reliability very seriously and we are implementing new safeguards to ensure this can never happen again.
As of 22:02 UTC, we are considering the issue resolved, but we are continuing to monitor carefully.
An engineer working on a deployment inadvertently incremented a critical cache token, creating a new global cache space for all image content on imgix. This meant that all previously cached images were being forced to re-render in our systems. Within 30 seconds of the change, every image request was traversing a full path through the imgix infrastructure without any caching to reduce the load. This cache stampede quickly exposed bottlenecks in areas of our infrastructure that had not yet been exercised to that level. This led to many requests queueing and subsequently erroring, leading to 502’s and 503’s (broken images) for end users. Our ops and engineering teams were notified immediately.
Unsure of the operating state of our previous global cache space, our team split into two efforts: One team got to work identifying the critical bottlenecks, balancing in more servers, and adjusting limits in the systems themselves to facilitate more connections and higher throughput. The goal was to handle the influx of traffic should recovery of the previous global cache space prove unlikely. The other team got to work analyzing the likelihood of recovering our previous global cache space and putting together the necessary configuration changes to revert the global cache key successfully.
Working in parallel, the two teams were able to make progress by 20:27 UTC. The first team had successfully added more servers and had eliminated several chokepoints in our load balancing and distribution. This dropped error rates by over 50% and stability improved. However, both teams determined that rebuilding the entire cache from scratch would still have taken hours. The second team had built and tested a configuration change to rollback to the previous global cache space. A decision was made to attempt the rollback given the estimated time it would otherwise take to rebuild the cache.
At 21:10 UTC, we deployed the configuration change to recover the original cache space. It was unclear how much of the previous cache was retained versus evicted by the time we were able to deploy the rollback configuration, but it proved to be nearly entirely intact. By 21:12 UTC, error rates returned to normal operating levels near zero.
The team continued to monitor the situation, and declared the issue resolved as of 22:02 UTC.
To prevent this problem and problems like it from happening in the future, our team is implementing the following changes:
Images are a very important piece to operating a site or app, and imgix knows this very well. We do not take today’s performance degradation lightly and sincerely apologize to our customers for problems this may have caused.
If you would like to discuss this outage or the post-mortem further, we will be more than happy to answer any questions via support@imgix.com.
Special thanks to Austin Spires and the team at Fastly for helping us debug and monitor this incident.