Elevated rendering errors
Incident Report for imgix
Postmortem

What happened?

On May 7, 23:55 UTC, the imgix service saw elevated latency when retrieving images via our origin cache. This caused an increase in error rates for uncached derivative images. The imgix engineering team implemented remediations that restored normal service by May 8, 01:06 UTC.

We saw similarly elevated rendering errors again starting on May 8 at 15:05 UTC. Additional remediations were put into place with service restored by 16:13 UTC.

The issue occurred again on May 12 from 02:18 to 02:52 UTC and from 14:16 to 15:07 UTC.

How were customers impacted?

During the period of the incidents, customers may have noticed some uncached derivative images return an error. We saw up to 5%-15% of requests fail to return successfully. The May 12 incidents were shorter and were not as severe.

Cached derivative images were not impacted and continued to be served as normal.

What went wrong during the incident?

The May 7 and May 8 issues were quickly identified as being caused by slow origins and a fix to decrease the origin timeout was prepared and approved. Full rollout was delayed by the resource contention caused by the slow origins.

While the May 12 degradations were ultimately traced back to slow origins they manifested slightly differently and were not only less severe but remediations from earlier incidents helped our team to resolve them more quickly.

What will imgix do to prevent this in the future?

Additional tooling has already been built and deployed to give the engineers the ability to toggle more configurations quickly without needing a full deployment.

Furthermore, we will be changing our stance toward slow or misbehaving origins. We have always tried to do our best to serve every image and every request, even for slow, error-prone, non-RFC-complaint, or otherwise misbehaving origins. During the past few incidents we have become increasingly aggressive in rate limiting and isolating these origins and we will continue to automate systems that will block or isolate these origins further in order to ensure reliable performance for the rest of our customers.

Posted May 13, 2020 - 18:09 PDT

Resolved
Service has been completely restored.
Posted May 11, 2020 - 20:05 PDT
Update
We are continuing to monitor for any further issues.
Posted May 11, 2020 - 19:59 PDT
Monitoring
Our engineering team has applied a fix, restoring services to normal. We are currently monitoring the situation.
Posted May 11, 2020 - 19:57 PDT
Identified
The issue has been identified and our engineering team is developing a fix.
Posted May 11, 2020 - 19:37 PDT
Investigating
We are currently investigating elevated render error rates for uncached derivative images. We will update once when we have more information.

Previously cached derivatives are not impacted.
Posted May 11, 2020 - 19:28 PDT
This incident affected: Rendering Infrastructure.