Elevated rendering errors

Incident Report for imgix

Postmortem

Incident Summary

Between 17:55 and 20:22 UTC on June 12, 2025, Imgix services experienced major service disruptions across several key interfaces:

  • Dashboard and Asset Manager: These interfaces were inaccessible, preventing users from managing their assets or viewing account information.
  • Management API: Requests to the Management API consistently returned errors, affecting workflows reliant on programmatic updates or asset administration.
  • Rendering API: Approximately 8% of all Imgix service requests failed due to a high error rate (~80%) for uncached assets via the Rendering API. Requests in the EU saw a lower failure rate (~50%) and a faster recovery time (45 minutes) for these uncached requests.

What caused it

The incident was triggered by a global outage within Google Cloud, which serves as a core infrastructure provider for Imgix. The outage affected most services in all regions simultaneously.

You can read more about the Google Cloud outage here.

What happened

  • 17:55 UTC: Internal alerts triggered due to a spike in rendering errors and service timeouts.
  • 18:01 UTC: A short investigation uncovers several timeouts and increased error rates from Google Cloud.
  • 18:09 UTC: Our status page is updated.
  • 18:53 UTC: A major Google Cloud outage is confirmed, after which we update our status page.
  • 18:00–20:16 UTC: Mitigation efforts were hampered by the far-reaching effects of the outage, preventing us from redirecting traffic or applying configuration changes.
  • Throughout: We confirm that cached images were not affected, though the downtime of several data sources prevents evaluating the full scope and effect of the outage.
  • 20:16 UTC: Google reported recovery in all regions except us-central1. This allowed us to verify significantly lower error rates for EU traffic
  • 20:47 UTC: Imgix systems achieved full recovery.
  • 20:52 UTC: The incident was officially resolved on our status page.
  • 21:23 UTC: Google confirms a full-service recovery at 21:23 UTC.

What went wrong

  • Google Cloud experienced an outage that simultaneously affected nearly every service in every region worldwide, negating our multi-region redundancy for the image rendering service.
  • 3rd party services (such as our CDN) were also affected by the outage, which removed some of our options for redirecting traffic across regions based on performance.
  • The outage included the control panes that Google provides to its customers, which removed additional options for redirecting traffic and implementing mitigations.

What we will do to prevent this in the future

  • Continuing our ongoing internal discussions and evaluations of a multi-cloud render stack to enable failover in the event of a provider-wide outages.
  • Continue evaluating and improving tools to automatically and manually shift traffic as necessary at each layer of the stack.
  • Review and enhance incident communication protocols, focusing on faster root cause disclosure and update frequency.
Posted Jun 20, 2025 - 12:05 PDT

Resolved

The service is completely restored.
Posted Jun 12, 2025 - 13:52 PDT

Monitoring

The Rendering API is fully recovered.

Web Administration tooling (logins and the Management API) are recovering. We are monitoring the results.
Posted Jun 12, 2025 - 13:47 PDT

Identified

The Rendering API has fully recovered.

We are continuing to investigate Web Administration related issues (login and Management API) related to the incident.
Posted Jun 12, 2025 - 13:38 PDT

Monitoring

The service is restored. We are monitoring the situation.
Posted Jun 12, 2025 - 13:26 PDT

Update

The service is experiencing elevated error rates due to a major Google Cloud outage affecting services downstream.

Previously cached derivatives are not impacted.

We are investigating ways to mitigate this issue.
Posted Jun 12, 2025 - 11:53 PDT

Identified

The issue has been identified and we are investigating a solution.
Posted Jun 12, 2025 - 11:15 PDT

Investigating

We are investigating elevated render error rates for the service.

Previously cached derivatives are not impacted.
Posted Jun 12, 2025 - 11:09 PDT
This incident affected: Rendering Infrastructure, Web Administration Tools, and API Service.