Elevated rendering errors
Incident Report for imgix
Postmortem

What happened?

On November 22, 2021, 17:20 UTC, the imgix service experienced disruption affecting non-cached image derivatives. A fix was pushed at 18:40 UTC, fully restoring the service by 18:50 UTC.

How were customers impacted?

Between 17:00 UTC and 18:40 UTC, 6% of all requests to the imgix service returned a 503 error. During this time, errors were returned only for new derivative images which had not been cached by imgix.

At 18:40 UTC, a fix was pushed out, which began restoration of the service. By 18:50 UTC,  the service was marked as fully restored.

What went wrong during the incident?

At the start of the outage, our team identified network behaviors that caused the initial incident. Our team then pushed configuration changes to begin the restoration of the service. Despite these mitigations, recovery stalled. We continued to investigate, however, our logs did not reveal any additional information about the root cause of the issue. This prevented us from pushing out further mitigations.

We eventually traced the issue to incorrect traffic configurations for a major region. Once the issue was verified, mitigation was implemented, restoring the service.

What will imgix do to prevent this in the future?

This incident exposed some gaps in our logging which may have allowed us to initiate a swifter recovery of the rendering service. As a result, we’ll be analyzing and closing gaps in our logging to prevent similar roadblocks in the future.

We will also be tweaking and adding configurations to identify and automate the handling of major traffic patterns that would have otherwise affected our rendering service.

Posted Nov 29, 2021 - 08:12 PST

Resolved
This incident has been resolved.
Posted Nov 22, 2021 - 11:26 PST
Monitoring
The service has returned to normal. We will continue to monitor the service for any changes.
Posted Nov 22, 2021 - 11:14 PST
Update
A fix has been implemented and errors have returned to normal. We are continuing investigations to ensure that error rates remain normal.
Posted Nov 22, 2021 - 11:03 PST
Identified
The issue has been identified and a fix is being implemented.
Posted Nov 22, 2021 - 10:52 PST
Update
We are continuing to investigate the issue.
Posted Nov 22, 2021 - 10:45 PST
Update
We are continuing to investigate the issue.
Posted Nov 22, 2021 - 09:54 PST
Investigating
We are currently investigating elevated render error rates for uncached derivative images. We will update once when we obtain more information.

Previously cached derivatives are not impacted.
Posted Nov 22, 2021 - 09:24 PST
This incident affected: Rendering Infrastructure.