Intermittent Dashboard and Management API Issues

Incident Report for imgix

Postmortem

What happened?

On February 1st, 2023 14:07 UTC, the imgix service experienced intermittent spikes in latency for web administration services, such as the imgix Dashboard and Management API. The incident was resolved later in the day at 20:03 UTC.

How were customers impacted?

Customers may have experienced issues with using the Dashboard and the Management API. Actions such as logging in, loading pages, and making requests to the Management API resulted in intermittent timeouts.

The Rendering API was not affected by this incident.

What went wrong during the incident?

After our engineers identified the initial latency spike, we deployed a workaround that initially resolved the issue. After monitoring the results, we closed the incident, but latency shortly spiked again. The spike was sustained, and requests to the Web Administration parts of our service started to show long response times.

The identified issues were similar to a recent incident that had occurred due to upstream providers. Our engineers applied similar mitigation steps, though they were less effective for this incident.

Upon further discussion, our engineering team identified a path to resolution by fast-tracking a future planned infrastructure change. This involved reducing connections between our internal services. This change immediately fixed the latency in our Web Administration services.

What will imgix do to prevent this in the future?

Internal documentation and tooling allowed our team to easily apply configuration changes and quickly push the needed architecture updates. We have updated this documentation and tooling involving the communication between our internal services to further facilitate these deployments in the future. The diagnostic steps and active monitoring/alerting have been updated as well.

Additionally, we have completed an infrastructure upgrade which is designed to prevent this issue from recurring. As we gather more data on the new and improved performance metrics, we will proactively continue tuning our configurations to ensure future stability.

Posted Feb 02, 2023 - 16:08 PST

Resolved

This incident has been resolved.

Posted Feb 01, 2023 - 00:02 PST

Monitoring

We are currently monitoring Dashboard and Management API performance.

Posted Jan 31, 2023 - 19:22 PST

This incident affected: Web Administration Tools.