Elevated rendering errors
Incident Report for imgix
Postmortem

What happened?

On May 15, 2020 at 02:25 UTC, the imgix service saw elevated latency when retrieving images via our origin cache. This caused an increase in error rates for uncached derivative images. The imgix engineering team implemented remediations that restored normal service by 03:05 UTC.

How were customers impacted?

During the period of the incidents, customers may have noticed some uncached derivative images return an error. We saw up to 15% of requests fail to return successfully.

Cached derivative images were not impacted and continued to be served as normal.

What went wrong during the incident?

The issue was quickly identified as slow origins impacting the service. Previous remediations which had already put into place were expected to enable the system to recover on its own and we have seen the system do so in similar circumstances.

This time, however, an older configuration had been loaded onto some servers which caused the servers to be unable to recover on their own. The engineering team had to manually intervene to restore each server to a good state and allow the system to recover.

What will imgix do to prevent this in the future?

This incident exposed a weak spot in our infrastructure that did not already have instant rollbacks, which we will be addressing immediately.

Automated slow origin detection and rate limiting has already been deployed to isolate the impact and additional capacity has been added to accommodate general traffic increases.

Posted May 22, 2020 - 11:21 PDT

Resolved
Service has been completely restored.
Posted May 14, 2020 - 20:42 PDT
Monitoring
Our engineering team has applied a fix, restoring services to normal. We are currently monitoring the situation.
Posted May 14, 2020 - 20:31 PDT
Identified
The issue has been identified and a fix is being implemented. Error rates are recovering.
Posted May 14, 2020 - 20:11 PDT
Update
Our engineering team is investigating elevated render error rates for uncached derivative images. We will update once when we obtain more information.

Previously cached derivatives are not impacted.
Posted May 14, 2020 - 20:03 PDT
Investigating
We are currently investigating elevated render error rates for uncached derivative images. We will update once when we obtain more information.

Previously cached derivatives are not impacted.
Posted May 14, 2020 - 19:25 PDT
This incident affected: Rendering Infrastructure.