Intermittent elevated 503 errors
Incident Report for imgix
Postmortem

What happened?

On November 17, 00:15 UTC the imgix rendering service was affected by packet loss stemming from one of our network providers.

How were customers impacted?

During brief periods between the hours of 8:15 UTC and 10:40 UTC , a small percentage of requests (1.7%) returned the error message 503 No Healthy Backends. These periods lasted between one and three minutes and reoccurred several times. The incident was completely resolved by 10:40 UTC.

What went wrong during the incident?

We began experiencing packet loss stemming from one of our network providers which caused some images to return a 503 response code. The transient and limited impact of the incident stalled our escalation processes and obfuscated our decision tree for remediating incidents. This also prevented the status page from being updated since the issues would disappear as quickly as they had started.

What will imgix do to prevent this in the future?

We are redefining escalation conditions in regards to recurring, self-solving incidents. We are also updating our tooling to both implement better monitoring on transient issues and to provide resilience when experiencing packet loss between transit providers.

Posted Nov 17, 2020 - 20:31 PST

Resolved
Packet loss from a network provider affected a small percentage of image requests. The issue was intermittent, manifesting in brief periods over 2 hours. The issue was completely resolved by 10:29 UTC
Posted Nov 17, 2020 - 00:00 PST