Elevated rendering errors
Incident Report for imgix
Postmortem

What happened?

On April 13, 2023, between 17:09 UTC and 17:32 UTC, imgix experienced a partial outage affecting non-cached renders. During this time, requests to cached assets continued to serve a 200 response, while requests to non-cached assets returned a server error.

A fix was implemented at 17:32 UTC, restoring service.

How were customers impacted?

Between 17:09 UTC and 17:32 UTC, requests to the Rendering API for non-cached renders returned a server error, with 9% of all requests to the Rendering API returning an error at the height of the incident.

What went wrong during the incident?

We identified an error in one of our connections to customer origins. This error lead to significant slowdown in the retrieval process of new assets from customer origins. The errors rapidly grew in a short amount of time, causing our Rendering API to return 5xx errors.

To restore the service, our engineers redirected some of our network traffic. The service was fully restored by 17:32 UTC, but some errors persisted and were being served from the cache until they were completely cleared at 17:35 UTC.

What will imgix do to prevent this in the future?

We have taken the following steps to prevent this issue from re-occurring:

  • Fixed the misconfigured alert so our monitoring and alerts will trigger and identify potential issues before they become critical.
  • Removed the connection from our routing, replacing it with a new connection that will not experience the same errors.

We are in the process of implementing the following:

  • Conducting a review of our current tooling to increase our traffic and network configuration capabilities.
  • Reviewing our current configuration to limit the affected services should a similar incident happen in the future.
Posted Apr 28, 2023 - 14:54 PDT

Resolved
This incident has been resolved.
Posted Apr 13, 2023 - 11:10 PDT
Monitoring
Our engineering team has applied a fix, restoring services to normal. We are currently monitoring the situation.
Posted Apr 13, 2023 - 10:57 PDT
Identified
The issue has been identified and a fix is being implemented.
Posted Apr 13, 2023 - 10:39 PDT
Investigating
We are currently investigating elevated render error rates for uncached derivative images. We will update once when we obtain more information.

Previously cached derivatives are not impacted.
Posted Apr 13, 2023 - 10:20 PDT
This incident affected: Rendering Infrastructure.