Elevated fetch errors for some Sources
Incident Report for imgix
Postmortem

What happened?

On April 21, 2020 starting at 13:54 UTC, the imgix service started receiving elevated origin errors, concentrated among some Source types.

The imgix engineering team identified the root cause of the errors as a service problem with one of our internet providers at 15:49 UTC. Traffic was rerouted away from the affected provider onto alternate providers, and the service saw full recovery by 15:52 UTC.

How were customers impacted?

During the period of the incident, some customers would have noticed some uncached derivative images returning errors. Only Google Cloud Storage, Web Folders, and Web Proxy Sources were impacted. The errors during this time accounted for less than 0.2% of total requests served.

Cached derivative images were not impacted and continued to be served as normal.

What went wrong during the incident?

One of our internet providers suffered simultaneous fiber optic cable cuts near Fort Worth, TX, and Milwaukee, WI, resulting in increased packet loss and latency within their network. This caused timeouts and errors for a portion of imgix’s fetch requests from customer origins.

Because the errors across the service were relatively low, our monitoring did not catch the issue immediately, which resulted in a delay in routing traffic away from the problematic provider.

What will imgix do to prevent this in the future?

We will add additional monitoring to be able to detect issues of this nature more quickly.

We note that the tools to quickly route traffic to alternate providers were already in place and well documented.

Posted Apr 23, 2020 - 11:29 PDT

Resolved
We've confirmed that our services have been restored to normal. Note that one of our providers is still experiencing service issues due to a cut fiber cable, though our service is not impacted. We will continue monitoring to ensure that our service remains unaffected.
Posted Apr 21, 2020 - 11:15 PDT
Monitoring
We have applied a fix and services have returned to normal. We are continuing to monitor the situation.
Posted Apr 21, 2020 - 09:06 PDT
Identified
The issue has been identified and a fix is being implemented.
Posted Apr 21, 2020 - 08:57 PDT
Investigating
Our team is currently investigating higher than normal error rates when fetching a small number of images from certain Sources. Previously cached derivatives are not impacted.
Posted Apr 21, 2020 - 08:15 PDT
This incident affected: Rendering Infrastructure.