On April 21, 2020 starting at 13:54 UTC, the imgix service started receiving elevated origin errors, concentrated among some Source types.
The imgix engineering team identified the root cause of the errors as a service problem with one of our internet providers at 15:49 UTC. Traffic was rerouted away from the affected provider onto alternate providers, and the service saw full recovery by 15:52 UTC.
During the period of the incident, some customers would have noticed some uncached derivative images returning errors. Only Google Cloud Storage, Web Folders, and Web Proxy Sources were impacted. The errors during this time accounted for less than 0.2% of total requests served.
Cached derivative images were not impacted and continued to be served as normal.
One of our internet providers suffered simultaneous fiber optic cable cuts near Fort Worth, TX, and Milwaukee, WI, resulting in increased packet loss and latency within their network. This caused timeouts and errors for a portion of imgix’s fetch requests from customer origins.
Because the errors across the service were relatively low, our monitoring did not catch the issue immediately, which resulted in a delay in routing traffic away from the problematic provider.
What will imgix do to prevent this in the future?
We will add additional monitoring to be able to detect issues of this nature more quickly.
We note that the tools to quickly route traffic to alternate providers were already in place and well documented.