On April 10, 2020 14:24 UTC, the imgix service started seeing elevated latency when retrieving images via our origin cache. This caused an increase in error rates for uncached derivative images.
The imgix engineering team attempted several different measures and ultimately were able to restore service to normal levels by 19:47 UTC
During the period of the incident, customers would have noticed some uncached derivative images either taking longer than normal to return successfully, or uncached derivative images would return an error.
Cached derivative images were not impacted and continued to be served as normal.
On April 9th, the day before the incident, we completed an important milestone in a long-running infrastructure upgrade project designed to increase capacity across the imgix rendering service. Due to the timing of the incident so soon after the upgrade completion, the first priority of the team was to roll back to the previous version. Unfortunately, the previous version exhibited the same behavior and this resulted in wasted time while we reconfigured the old version and shifted traffic to it.
The root cause of the incident turned out to be some origin cache hosts got in a bad state and slowed down the processing pipeline, resulting in a long queue of items to be processed. The long processing times caused upstream hosts to consider these slow-to-respond hosts as non-responsive, which shifted traffic to the healthier hosts. Unfortunately the healthier hosts were unable to clear the processing queue quickly enough to avoid being detected as non-responsive.
This caused a feedback loop where healthy hosts were continuously being marked as non-responsive, removed from the pool, then the remaining hosts would get swamped and considered non-responsive, removed from the pool, and so on.
The ultimate fix was to increase the time a host had to respond before it would be considered non-responsive. Once this change was tested and deployed we saw a sharp decrease in errors and soon after, full recovery of the imgix service.
What will imgix do to prevent this in the future?
Since previous incidents, we have increased capacity and improved tooling which has made the service more reliable and able to support more throughput overall. However, this most recent incident uncovered new gaps in our monitoring and internal tooling.
We will be putting into place new monitoring that will clearly show which hosts are in a healthy vs unresponsive state and how often they are changing states in order to detect and remediate issues of this nature earlier.
We will also be building additional tooling that will enable us to do more targeted content filtering and detection when origin hosts exhibit new and unexpected behavior.