On May 18, 2021, 12:54 UTC, the imgix service experienced disruption caused by long-running processes within our origin cache. Once our engineers identified the issue, they began to implement remediation by 13:07 UTC. Error rates began subsiding by 13:27 UTC with full restoration of the service by 14:20 UTC.
Hours later on May 19, 2021, at 2:24 UTC, imgix experienced another issue with some slow renders and timeouts. Ongoing work from the earlier incident interrupted the service’s typical automatic recovery and required manual intervention. Progress began at 3:15 UTC with service being fully restored by 3:55 UTC.
During the first incident, customers may have noticed some uncached derivative images return an error.
During the subsequent incident, some uncached derivatives took longer than normal to render, with some requests timing out.
In both events, cached derivative images were not impacted and continued to be served as normal.
Our engineers were alerted to an increasing amount of elevated error responses from our service. Investigating the issue, our engineers identified a bottleneck in our origin cache. Our engineers isolated the issue and implemented limits to prevent further issues from stalling again after recovery.
After this incident subsided, we picked up slowness in image rendering which eventually culminated into timeouts for some requests. Manual intervention to restart some components was required in a later incident. A combination of rate-limiting and component restarts aided service recovery.
We will continue to fine-tune our tooling to detect and isolate problems before they can trigger larger failures. We will also be implementing changes targeting unexpected origin behavior observed during the time of the incident.