On September 13, 2021, between the times of 14:53 UTC and 15:44 UTC, the imgix service experienced increased rendering latency primarily for non-cached derivative images. Error rates remained low, though a very small percentage of images returned a timeout error during the peak of the incident.
Between the time of 14:53 UTC and 15:27 UTC, some customers experienced dramatically increased latency for requests to non-cached and cached assets served by imgix. At the peak of the incident, cached requests averaged at 1s/request to complete, while non-cached requests averaged at 40s/request to complete.
The majority of requests to the service returned a
200 response, with a small percentage of images (<2%) returning a timeout error during the peak of the incident.
By 15:03 UTC, response times began to gradually recover, though the average response time was still higher than normal, especially for non-cached derivatives.
By 15:27 UTC the service had recovered to the point where timeouts were no longer occurring. Though higher-than-normal response times were still being reported in our monitoring, by this time the service was considered to be mostly recovered.
By 15:44 UTC, the service had completely recovered.
At 14:53 UTC, we started receiving reports from customers regarding increased latency to the rendering service. At the time of these reports (and during the incident), our monitoring had not observed any behavior that had indicated service degradation. Because of this, manual alarms had to be raised which slowed our initial response and investigation.
Our engineers identified that, while our service’s reported errors were very low, our rendering latency was rapidly increasing. After verifying the issue, our team began tuning our rendering infrastructure in order to improve rendering performance.
After additional investigation, our team correlated the increased latency to the enablement of a new feature that resulted in higher render requests than expected. This specific feature interacted with our caching and rendering infrastructure by causing many requests to be immediately re-cached and re-rendered. The immediate increase in caching activity and volume triggered a bottleneck which eventually resolved itself once the cache had been mostly rebuilt.
We will be re-visiting our procedures for rolling out new features, which will include:
This incident also exposed an error with our rendering performance monitoring, which we have now fixed.