Intermittent 5xx errors
Incident Report for imgix
Postmortem

What happened

On May 1st, 2023, between the hours of 08:23 UTC and 15:08 UTC, imgix experienced intermittent errors affecting a small percentage of non-cached renders. 

How were customers impacted?

During the affected period, a small percentage of requests to the Rendering API returned a 502 or 503 error for non-cached requests. Errors slowly and gradually increased, with <.5% of requests returning an error at the height of the incident.

What went wrong during the incident?

Our upstream provider experienced communication issues between CDN POPs, causing intermittent 502/503 responses in a small percentage of requests to our Rendering API. The increase in errors was so minor that it did not meet our monitoring thresholds for triggering alerts. One of our engineers observed a slow increase in errors and alerted other team members to a potential issue with our service.

After tracing the issue to our upstream provider, we pushed a patch to mitigate intermittent connectivity issues, resolving the incident.

What will imgix do to prevent this in the future?

We have refined our alerting to better catch the slowly increasing error rates. We have also ensured that the root cause of this incident has been fixed by our upstream provider. We are also updating our traffic routing in the case that the upstream issue occurs again.

Posted May 11, 2023 - 13:59 PDT

Resolved
This incident has been resolved.
Posted May 01, 2023 - 08:35 PDT
Monitoring
A fix has been implemented, and we are monitoring the results.
Posted May 01, 2023 - 08:08 PDT
Identified
The issue has been identified, and a fix is being implemented.
Posted May 01, 2023 - 07:47 PDT
Investigating
We are currently investigating reports of intermittent 5xx errors causing some images to initially return a 5xx error.
Posted May 01, 2023 - 07:20 PDT
This incident affected: Rendering Infrastructure.