On February 19, 17:20 UTC the imgix rendering service experienced a major outage affecting uncached image renders. Once our engineers were alerted, we immediately began to implement mitigations towards fixing the issue, with the service being fully restored at 17:50 UTC.
During this incident, requests for some uncached derivative images received a
503 response, with approximately 3% of all requests to the imgix service returning a
503 error during the incident.
Our service experienced an unexpected issue with retrieving assets from origins behind certain CDNs. While the issue initially was not enough to cause a service disruption on its own, the issue uncovered gaps with our monitoring tools, which prevented alerts from going to our site reliability team. By the time the alarm was manually sounded, the issue had escalated to the point where it had begun to affect a broader range of traffic coming in to the service.
We will be correcting our monitoring patterns to ensure similar retrieval issues notify our engineers in the future. We will also be modifying our retrieval behavior to place limits on conditions that would have caused an outage.