Elevated rendering errors

Incident Report for imgix

Postmortem

What happened?

On June 08, 2021, between the hours of 09:58 UTC and 10:36 UTC, our CDN provider experienced a global CDN disruption that severely impacted the imgix service.

At 10:36 UTC, our CDN provider applied a fix that allowed us to begin restoring service, and we began processing a large backlog of render requests that had accumulated from the start of the outage. Our CDN provider marked their incident as resolved at 12:41 UTC.

Our engineers later identified an issue affecting <1% of image requests to the service and applied a fix at 20:22 UTC.

The incident was marked fully resolved by 21:26 UTC although most images were being successfully served by 12:41.

How were customers impacted?

Between 09:58 UTC and 10:50 UTC, a significant percentage of requests to the imgix service returned an error. The CDN provider outage prevented logs from being sent from the CDN to imgix so imgix customers will not have analytics for this time period.

After the CDN outage was marked as resolved by our provider (12:41 UTC), approximately 94% of all requests to imgix resulting in a successful response. This number gradually increased over the course of several hours.

By 17:14 UTC, 99% of all requests resulted in a successful response, and the service was mostly restored. Due to an issue affecting a small number of images, the service was not marked as fully resolved until 21:19 UTC.

What went wrong during the incident?

This outage was unprecedented in a few ways:

Our CDN provider experienced a major outage that lasted longer than any incident we have previously experienced with them.
A large volume of rendering traffic queued up during the longer-than-average downtime.
The CDN provider outage also caused a large number of previously rendered derivatives to be removed from cache, which when combined with the render backlog, contributed to our systems seeing sustained loads up to 4x our typical peak traffic levels.

By 10:36 UTC a fix was applied which restored the CDN service. However, the sheer volume of incoming rendering traffic combined with the increased origin load from a much lower-than-normal cache-hit ratio prevented our service from completing an immediate recovery. By 11:00 UTC, traffic was being served with an 89% success rate.

Several recovery strategies were implemented to handle the fallout caused by the initial outage:

We repurposed rendering capacity from other environments to process the dramatic surge of render requests
Our engineers tuned the rendering stack to handle a much higher load than normal
Tapered load shedding was implemented to reduce stress on the network

As mitigations were implemented, the service reached a > 99% delivery success rate by 17:14 UTC. By this time, our status page should have been updated. Due to a smaller issue affecting a small portion of renders, we did not consider the incident fully resolved until 20:22 UTC, which was when all errors had returned to normal levels.

What will imgix do to prevent this in the future?

We will be evaluating possible fallbacks in the case of total CDN failure, along with investigating our caching capacities to prevent events of a similar scale from having such an impact on our rendering services in the future. Projects are already underway to dramatically increase render capacity under sustained loads as well as unexpected spikes.

At the same time, we will be working closely with our CDN partner to discuss their own remediation steps and how we can best interact with them moving forward.

We will also be revisiting our processes to ensure that status updates are more frequent, as the majority of the service was restored far sooner than we indicated on our status page.

Posted Jun 11, 2021 - 09:55 PDT

Resolved

The incident has been completely resolved.

Posted Jun 08, 2021 - 14:26 PDT

Monitoring

The fix has resolved the rest of the rendering issues we identified, and the service is back to normal. We will continue to monitor the situation.

Posted Jun 08, 2021 - 14:19 PDT

Identified

The service has dramatically recovered since this morning, though we have identified a rendering issue affecting a small percentage of images. A fix is currently being rolled out.

Posted Jun 08, 2021 - 13:22 PDT

Update

We are monitoring the gradual recovery of the service. We are seeing improved render times along with slightly improved delivery rates.

Posted Jun 08, 2021 - 09:04 PDT

Update

We are still seeing increased origin load as a result of the earlier outage from our service provider.

Recovery of the service is gradual as our systems continue to catch up to the increased load post outage.

Posted Jun 08, 2021 - 06:19 PDT

Update

A fix was implemented by our service provider at 10:36 UTC. As a result, users may experience increased origin load times as services return.

We are seeing gradual service recovery and will continue to monitor the service closely.

Posted Jun 08, 2021 - 05:12 PDT

Monitoring

The issue has been identified and our service provider is implementing a fix.

We are seeing recovery in the service and are currently monitoring the situation.

Posted Jun 08, 2021 - 04:02 PDT

Identified

The issue has been identified and we are working with service providers to investigate this issue further.

Posted Jun 08, 2021 - 03:42 PDT

Update

We are currently working with service providers to investigate this issue further.

Posted Jun 08, 2021 - 03:22 PDT

Investigating

We are currently investigating elevated render error rates for images. We will update once when we obtain more information.

Posted Jun 08, 2021 - 03:12 PDT

This incident affected: Rendering Infrastructure.