On June 17, 2024, at 00:00 UTC, imgix experienced an extreme spike in requests to our render stack. This unexpected surge caused a failure in our auto-scaling infrastructure, leading to an inability to manage all incoming traffic effectively. A fix was implemented at 00:38 UTC, and the issue was resolved by 01:06 UTC.
Between 00:00 and 01:06 UTC, customers may have experienced failures when requesting new renders. However, previously cached assets served successfully during this time.
The incident was triggered by a significant increase in requests, which our automated systems did not properly handle. Although the system started to auto-scale as expected, the unexpected surge caused issues with the health checks used for auto-scaling. The combination of extra traffic and health check failure led to an inability to render new images that required manual intervention to resolve.
To avoid similar incidents in the future, imgix is taking the following actions:
By addressing these areas, we aim to further improve our system's resilience and ensure a smoother customer experience during periods of high demand.