Elevated rendering errors
Incident Report for imgix
Postmortem

What happened?

On September 09, 2021, at 14:02 UTC, an improper configuration prevented imgix servers from connecting to some Web folder and Web Proxy origins, which caused non-cached derivative image requests for affected Web Folder / Web Proxy customer origins to return a 503 error.

How were customers impacted?

The impact of this incident was isolated to some Web Folder and Web Proxy customers sharing a common configuration setting.

Between the hours of 14:02 UTC and 18:56 UTC, affected Web Folder and Web Proxy customers experienced a variable increase in errors to non-cached derivative images.

At the height of the incident, a small percentage of Web Folder and Web Proxy requests returned a  503 error, which amounted to 0.16% of all imgix requests.

At 18:56 UTC, a fix was applied, allowing the service to be completely restored.

What went wrong during the incident?

At 14:20 UTC, our team was alerted to a small increase in fetch errors to some Web Folder and Web Proxy origins. Due to the small number of errors that were reported by our monitoring service, it was unclear whether or not this was the result of some customer origins misbehaving, or if this was an issue with our service’s ability to fetch images.

Eventually, our engineering team tracked down the change to a specific service provider, which we correlated to the increase in errors for some Web Folder / Web Proxy customers.

As our team looked into solutions, several external factors severely slowed remediation efforts:

  • Our internal communication platform was experiencing connectivity issues
  • Some critical database services were unavailable during the incident
  • Service error messaging was ambiguous as to the cause of the issue
  • We experienced discrepancies between applied system changes and running processes

Eventually, the imgix team deployed a fix that enabled our servers to successfully talk to all Web Folder and Web Proxy origins.

What will imgix do to prevent this in the future?

We will be updating our configurations for fetching assets from customer origins to prevent similar issues from occurring, along with updating our service runbooks to include rolling restarts for some types of configuration updates.

We will also be migrating some of our database tooling to mitigate connectivity limitations, along with updating our internal processes to address cases where communication outages occur.

Posted Oct 11, 2021 - 13:07 PDT

Resolved
This incident has been resolved.
Posted Sep 30, 2021 - 13:36 PDT
Monitoring
Error rates have returned to normal levels for all Source types. We are continuing to monitor the results.
Posted Sep 30, 2021 - 13:21 PDT
Update
Our engineers have applied a temporary fix for Web Folder and Web Proxy Sources. Error rates are back to normal levels for all Source types.

Error rates are expected to remain at normal levels as we continue to work on a permanent solution for this incident.
Posted Sep 30, 2021 - 12:20 PDT
Update
We are continuing to work on a fix for this issue.
Posted Sep 30, 2021 - 11:20 PDT
Identified
The issue has been identified, with the impact being isolated to Web Folder and Web Proxy Sources. Our engineers are working on resolving the issue.

Other Source types (S3, GCS, and Azure) are working as expected.
Posted Sep 30, 2021 - 09:11 PDT
Investigating
We are currently investigating elevated render error rates for a small percentage of uncached derivative images. We will update once when we obtain more information.

Previously cached derivatives are not impacted.
Posted Sep 30, 2021 - 08:16 PDT
This incident affected: Rendering Infrastructure.