On May 7, 23:55 UTC, the imgix service saw elevated latency when retrieving images via our origin cache. This caused an increase in error rates for uncached derivative images. The imgix engineering team implemented remediations that restored normal service by May 8, 01:06 UTC.
We saw similarly elevated rendering errors again starting on May 8 at 15:05 UTC. Additional remediations were put into place with service restored by 16:13 UTC.
The issue occurred again on May 12 from 02:18 to 02:52 UTC and from 14:16 to 15:07 UTC.
During the period of the incidents, customers may have noticed some uncached derivative images return an error. We saw up to 5%-15% of requests fail to return successfully. The May 12 incidents were shorter and were not as severe.
Cached derivative images were not impacted and continued to be served as normal.
The May 7 and May 8 issues were quickly identified as being caused by slow origins and a fix to decrease the origin timeout was prepared and approved. Full rollout was delayed by the resource contention caused by the slow origins.
While the May 12 degradations were ultimately traced back to slow origins they manifested slightly differently and were not only less severe but remediations from earlier incidents helped our team to resolve them more quickly.
Additional tooling has already been built and deployed to give the engineers the ability to toggle more configurations quickly without needing a full deployment.
Furthermore, we will be changing our stance toward slow or misbehaving origins. We have always tried to do our best to serve every image and every request, even for slow, error-prone, non-RFC-complaint, or otherwise misbehaving origins. During the past few incidents we have become increasingly aggressive in rate limiting and isolating these origins and we will continue to automate systems that will block or isolate these origins further in order to ensure reliable performance for the rest of our customers.