Elevated rendering errors

Incident Report for imgix

Postmortem

What happened?

On May 7, 23:55 UTC, the imgix service saw elevated latency when retrieving images via our origin cache. This caused an increase in error rates for uncached derivative images. The imgix engineering team implemented remediations that restored normal service by May 8, 01:06 UTC.

We saw similarly elevated rendering errors again starting on May 8 at 15:05 UTC. Additional remediations were put into place with service restored by 16:13 UTC.

The issue occurred again on May 12 from 02:18 to 02:52 UTC and from 14:16 to 15:07 UTC.

How were customers impacted?

During the period of the incidents, customers may have noticed some uncached derivative images return an error. We saw up to 5%-15% of requests fail to return successfully. The May 12 incidents were shorter and were not as severe.

Cached derivative images were not impacted and continued to be served as normal.

What went wrong during the incident?

The May 7 and May 8 issues were quickly identified as being caused by slow origins and a fix to decrease the origin timeout was prepared and approved. Full rollout was delayed by the resource contention caused by the slow origins.

While the May 12 degradations were ultimately traced back to slow origins they manifested slightly differently and were not only less severe but remediations from earlier incidents helped our team to resolve them more quickly.

What will imgix do to prevent this in the future?

Additional tooling has already been built and deployed to give the engineers the ability to toggle more configurations quickly without needing a full deployment.

Furthermore, we will be changing our stance toward slow or misbehaving origins. We have always tried to do our best to serve every image and every request, even for slow, error-prone, non-RFC-complaint, or otherwise misbehaving origins. During the past few incidents we have become increasingly aggressive in rate limiting and isolating these origins and we will continue to automate systems that will block or isolate these origins further in order to ensure reliable performance for the rest of our customers.

Posted May 13, 2020 - 18:09 PDT

Resolved

Service has been completely restored.

Posted May 12, 2020 - 08:23 PDT

Monitoring

Our engineering team has applied a fix, restoring services to normal. We are currently monitoring the situation.

Posted May 12, 2020 - 08:11 PDT

Update

We have rolled out some changes to restore service and are seeing improvements. Our engineers are continuing to apply fixes.

Posted May 12, 2020 - 08:03 PDT

Identified

We have identified an issue causing elevated rendering errors. Our engineering team is applying a fix.

Posted May 12, 2020 - 07:28 PDT

This incident affected: Rendering Infrastructure.