On November 22, 2021, 17:20 UTC, the imgix service experienced disruption affecting non-cached image derivatives. A fix was pushed at 18:40 UTC, fully restoring the service by 18:50 UTC.
Between 17:00 UTC and 18:40 UTC, 6% of all requests to the imgix service returned a 503
error. During this time, errors were returned only for new derivative images which had not been cached by imgix.
At 18:40 UTC, a fix was pushed out, which began restoration of the service. By 18:50 UTC, the service was marked as fully restored.
At the start of the outage, our team identified network behaviors that caused the initial incident. Our team then pushed configuration changes to begin the restoration of the service. Despite these mitigations, recovery stalled. We continued to investigate, however, our logs did not reveal any additional information about the root cause of the issue. This prevented us from pushing out further mitigations.
We eventually traced the issue to incorrect traffic configurations for a major region. Once the issue was verified, mitigation was implemented, restoring the service.
This incident exposed some gaps in our logging which may have allowed us to initiate a swifter recovery of the rendering service. As a result, we’ll be analyzing and closing gaps in our logging to prevent similar roadblocks in the future.
We will also be tweaking and adding configurations to identify and automate the handling of major traffic patterns that would have otherwise affected our rendering service.