On February 1st, 2023 14:07 UTC, the imgix service experienced intermittent spikes in latency for web administration services, such as the imgix Dashboard and Management API. The incident was resolved later in the day at 20:03 UTC.
Customers may have experienced issues with using the Dashboard and the Management API. Actions such as logging in, loading pages, and making requests to the Management API resulted in intermittent timeouts.
The Rendering API was not affected by this incident.
After our engineers identified the initial latency spike, we deployed a workaround that initially resolved the issue. After monitoring the results, we closed the incident, but latency shortly spiked again. The spike was sustained, and requests to the Web Administration parts of our service started to show long response times.
The identified issues were similar to a recent incident that had occurred due to upstream providers. Our engineers applied similar mitigation steps, though they were less effective for this incident.
Upon further discussion, our engineering team identified a path to resolution by fast-tracking a future planned infrastructure change. This involved reducing connections between our internal services. This change immediately fixed the latency in our Web Administration services.
Internal documentation and tooling allowed our team to easily apply configuration changes and quickly push the needed architecture updates. We have updated this documentation and tooling involving the communication between our internal services to further facilitate these deployments in the future. The diagnostic steps and active monitoring/alerting have been updated as well.
Additionally, we have completed an infrastructure upgrade which is designed to prevent this issue from recurring. As we gather more data on the new and improved performance metrics, we will proactively continue tuning our configurations to ensure future stability.