Summary
Between October 21 and October 29, customers in Europe experienced 3 separate periods of increased latency for rendering requests. In a small number of cases, requests temporarily failed with “429 – concurrency limit reached” responses.
What Went Wrong
The incident was traced to a GPU scaling issue from one of our upstream infrastructure providers. This led to temporary slowdowns and under higher-than-usual load.
Timeline
- October 21: Increased rendering latency in EU region, self-resolved. Investigation traced issue to GPU scaling in upstream infrastructure. Mitigation prepared.
- October 27: Issue recurred. Manual mitigation deployed to stabilize rendering and automate future handling.
- October 29: Latency alert triggered again. Previous fix mitigates impact, but latency becomes intermittent; additional configuration changes implemented to fully restore service and prevent recurrence.
What we will do to prevent this in the future
While the new configurations will prevents recurring incidents, we are making further improvements to rendering resiliency and recovery speed:
- Added more GPU hardware types to reduce the risk of scaling delays during peak demand.
- Testing and evaluating additional hardware configurations to improve resiliency.
- Finalizing fine-tuning of current configurations and exploring cross-regional load-balancing capabilities to further strengthen reliability.
- Adjusted alerting thresholds to provide earlier notification of emerging issues.