Degraded Rendering Performance
Incident Report for imgix
Postmortem

Today beginning at 19:13 UTC, imgix experienced a sudden and unexpected purge of its entire edge cache. The culminating result of this incident was broken images for many customers. We are incredibly sorry for the difficulties this may have caused you and your business. Our team takes service reliability very seriously and we are implementing new safeguards to ensure this can never happen again.

As of 22:02 UTC, we are considering the issue resolved, but we are continuing to monitor carefully.

What Happened

An engineer working on a deployment inadvertently incremented a critical cache token, creating a new global cache space for all image content on imgix. This meant that all previously cached images were being forced to re-render in our systems. Within 30 seconds of the change, every image request was traversing a full path through the imgix infrastructure without any caching to reduce the load. This cache stampede quickly exposed bottlenecks in areas of our infrastructure that had not yet been exercised to that level. This led to many requests queueing and subsequently erroring, leading to 502’s and 503’s (broken images) for end users. Our ops and engineering teams were notified immediately.

Unsure of the operating state of our previous global cache space, our team split into two efforts: One team got to work identifying the critical bottlenecks, balancing in more servers, and adjusting limits in the systems themselves to facilitate more connections and higher throughput. The goal was to handle the influx of traffic should recovery of the previous global cache space prove unlikely. The other team got to work analyzing the likelihood of recovering our previous global cache space and putting together the necessary configuration changes to revert the global cache key successfully.

Working in parallel, the two teams were able to make progress by 20:27 UTC. The first team had successfully added more servers and had eliminated several chokepoints in our load balancing and distribution. This dropped error rates by over 50% and stability improved. However, both teams determined that rebuilding the entire cache from scratch would still have taken hours. The second team had built and tested a configuration change to rollback to the previous global cache space. A decision was made to attempt the rollback given the estimated time it would otherwise take to rebuild the cache.

At 21:10 UTC, we deployed the configuration change to recover the original cache space. It was unclear how much of the previous cache was retained versus evicted by the time we were able to deploy the rollback configuration, but it proved to be nearly entirely intact. By 21:12 UTC, error rates returned to normal operating levels near zero.

The team continued to monitor the situation, and declared the issue resolved as of 22:02 UTC.

Preventing Similar Incidents

To prevent this problem and problems like it from happening in the future, our team is implementing the following changes:

  • Prevent cache tokens from being dynamically incremented. These tokens are critical to maintaining our global cache space and should not be programmatically accessible. At the time of publishing this blog post, this is no longer possible. We have already patched our systems accordingly.
  • Perform a system-wide audit of operating system limits to ensure that we are optimally positioned for increased traffic. We should not put ourselves in a position where system resources are being constricted due to an operating system default or improperly set limit.
  • Finalize the roll out of new caching and work distribution infrastructure. Currently, this technology is only operating on secondary image content (e.g. watermarks, blends, etc.). The production roll out is on hold during the holidays. Many of the challenges that we faced during this incident would have been naturally mitigated by the new approaches we are employing in this forthcoming infrastructure push.

Images are a very important piece to operating a site or app, and imgix knows this very well. We do not take today’s performance degradation lightly and sincerely apologize to our customers for problems this may have caused.

If you would like to discuss this outage or the post-mortem further, we will be more than happy to answer any questions via support@imgix.com.

Special thanks to Austin Spires and the team at Fastly for helping us debug and monitor this incident.

Posted Dec 28, 2015 - 15:40 PST

Resolved
The cache incident has been resolved, and we are writing a post-mortem that will be available at http://status.imgix.com/ shortly.
Posted Dec 28, 2015 - 14:02 PST
Monitoring
The affected caches have been restored to a previously known good state, and service should be back to normal. We are continuing to monitor the situation.
Posted Dec 28, 2015 - 13:12 PST
Update
The affected caches are continuing to repopulate. Rendering performance is beginning to return to normal for images in these caches. Our engineers are still working to make sure the repopulation continues as quickly as possible.
Posted Dec 28, 2015 - 12:27 PST
Identified
A temporary cache fluctuation has degraded rendering performance. Our team is working on it, and the issue should be resolved shortly.
Posted Dec 28, 2015 - 11:20 PST