On May 1st, 2023, between the hours of 08:23 UTC and 15:08 UTC, imgix experienced intermittent errors affecting a small percentage of non-cached renders.
During the affected period, a small percentage of requests to the Rendering API returned a 502
or 503
error for non-cached requests. Errors slowly and gradually increased, with <.5% of requests returning an error at the height of the incident.
Our upstream provider experienced communication issues between CDN POPs, causing intermittent 502
/503
responses in a small percentage of requests to our Rendering API. The increase in errors was so minor that it did not meet our monitoring thresholds for triggering alerts. One of our engineers observed a slow increase in errors and alerted other team members to a potential issue with our service.
After tracing the issue to our upstream provider, we pushed a patch to mitigate intermittent connectivity issues, resolving the incident.
We have refined our alerting to better catch the slowly increasing error rates. We have also ensured that the root cause of this incident has been fixed by our upstream provider. We are also updating our traffic routing in the case that the upstream issue occurs again.