Elevated rendering errors
Incident Report for imgix
Postmortem

What happened?

On August 10, 2021 19:05 UTC, our CDN provider experienced a brief outage which resulted in elevated rendering rates from the imgix service.

Error rates returned to almost normal levels by 19:30 UTC with a small percentage of errors continuing to occur in imgix. By 19:58 UTC, error rates were restored to completely normal levels, though there continued to be non-user affecting errors appearing in our stack. Our team continued to apply mitigations and fixes, with the incident being marked as fully resolved on August 11, 2:15 UTC.

How were customers impacted?

Between the times of 19:05 UTC and 19:30 UTC, users experienced elevated render rates for non-cached images for requests to the imgix service. At the height of the incident (19:12 UTC), 11% of requests to imgix received a 503 response. After 19:12 UTC, errors sharply dropped to 5% and continued to drop until being restored to almost normal levels by 19:30 UTC. By this time, there were only a small percentage of errors that continued to occur for requests (<1%). Ongoing work fully restored the rendering service by 19:58 UTC.

From this time until the incident was resolved at August 11, 2:15 UTC, backend errors continued to occur, though these errors did not have an impact on image deliverability.

What went wrong during the incident?

At 19:05 UTC, our CDN provider posted a status update concerning performance impact to their CDN services, which subsequently affected imgix services by elevating error rates. Our monitoring tools alerted our engineering team to the elevating error rates, which allowed us to apply quick mitigations to control the growth of errors.

Our own status page was updated at 19:16 UTC. Thanks to the mitigations applied by both our CDN provider and our engineering team, the service began to recover at 19:30 UTC, with just a small percentage of errors that had persisted. Our team continued to apply changes to sustain mitigations, with errors being restored normal levels by 19:58 UTC.

Though rendering had been restored, non-end user facing errors continued to surface within our infrastructure. Our team continued to investigate and apply fixes, though erratic behaviors continued for a much longer time than anticipated as a result of the initial outage. Eventually, the incident was marked as resolved on August 11, 2:15 UTC.

What will imgix do to prevent this in the future?

This incident exposed an issue with brief CDN service outages causing lengthy incident times for our rendering service. We will tune our infrastructure and we’ll investigate further to explore opportunities for mitigating the after effects of CDN outages on imgix services.

Posted Sep 03, 2021 - 12:17 PDT

Resolved
This incident has been completely resolved.
Posted Aug 10, 2021 - 19:15 PDT
Monitoring
Mitigation work for this issue is complete. We will continue to monitor the results.

Error rates are at normal levels.
Posted Aug 10, 2021 - 18:52 PDT
Update
We are continuing mitigation work for this incident.

Error rates have returned to normal.
Posted Aug 10, 2021 - 18:32 PDT
Update
We are continuing mitigation work for this incident.

Error rates have returned to normal.
Posted Aug 10, 2021 - 17:59 PDT
Update
We are continuing mitigation work for this incident.

Error rates have returned to normal.
Posted Aug 10, 2021 - 17:26 PDT
Update
We are continuing mitigation work for this incident.

Error rates have returned to normal.
Posted Aug 10, 2021 - 16:53 PDT
Update
We are continuing mitigation work for this incident.
Posted Aug 10, 2021 - 16:21 PDT
Update
We are continuing mitigation work for this incident.
Posted Aug 10, 2021 - 15:51 PDT
Update
We are continuing mitigating work for this incident.
Posted Aug 10, 2021 - 15:11 PDT
Update
We are continuing to work on mitigating this issue.
Posted Aug 10, 2021 - 14:41 PDT
Update
We are continuing mitigating work for this incident.
Posted Aug 10, 2021 - 14:09 PDT
Update
We are continuing to work on mitigating this issue.
Posted Aug 10, 2021 - 13:32 PDT
Update
Error rates have returned to normal levels. Our engineers are continuing mitigation work for this incident.
Posted Aug 10, 2021 - 12:58 PDT
Identified
The issue has been identified. We are seeing some recovery, though our engineers are applying changes to continue mitigating upstream issues.

Previously cached derivatives are not impacted.
Posted Aug 10, 2021 - 12:42 PDT
Investigating
We are currently investigating elevated render error rates caused by upstream issues from our CDN provider. We will update once when we obtain more information.

Previously cached derivatives are not impacted.
Posted Aug 10, 2021 - 12:16 PDT
This incident affected: Rendering Infrastructure.