Elevated rendering errors
Incident Report for imgix
Postmortem

What happened?

On June 17, 2024, at 00:00 UTC, imgix experienced an extreme spike in requests to our render stack. This unexpected surge caused a failure in our auto-scaling infrastructure, leading to an inability to manage all incoming traffic effectively. A fix was implemented at 00:38 UTC, and the issue was resolved by 01:06 UTC.

How were customers impacted?

Between 00:00 and 01:06 UTC, customers may have experienced failures when requesting new renders. However, previously cached assets served successfully during this time.

What went wrong during the incident?

The incident was triggered by a significant increase in requests, which our automated systems did not properly handle. Although the system started to auto-scale as expected, the unexpected surge caused issues with the health checks used for auto-scaling. The combination of extra traffic and health check failure led to an inability to render new images that required manual intervention to resolve.

What will imgix do to prevent this in the future?

To avoid similar incidents in the future, imgix is taking the following actions:

  1. Health Check Enhancement: We have investigated and implemented updated health checks to support increased traffic volumes.
  2. Rate Limiting: Further rate limits will be applied to manage traffic spikes and minimize their impact.
  3. Traffic Routing: Traffic will be rerouted as necessary to distribute the load and reduce the risk of system overloads.
  4. Automated Alerts Improvement: We will enhance our automated alert systems to respond more effectively to traffic surges and potential issues, including health check failures.

By addressing these areas, we aim to further improve our system's resilience and ensure a smoother customer experience during periods of high demand.

Posted Jun 21, 2024 - 16:55 PDT

Resolved
This incident has been resolved.
Posted Jun 16, 2024 - 18:19 PDT
Monitoring
A fix has been implemented and error rates are returning to normal. We are continuing to monitor the service.
Posted Jun 16, 2024 - 18:06 PDT
Update
We are continuing to work on a fix for this issue.
Posted Jun 16, 2024 - 17:51 PDT
Identified
The issue has been identified and our engineering team is developing a fix.
Posted Jun 16, 2024 - 17:50 PDT
Investigating
We are currently investigating elevated render error rates for uncached derivative images. We will update once when we obtain more information.

Previously cached derivatives are not impacted.
Posted Jun 16, 2024 - 17:21 PDT
This incident affected: Rendering Infrastructure.