Elevated rendering errors

Incident Report for imgix

Postmortem

What happened?

On April 10, 2020 14:24 UTC, the imgix service started seeing elevated latency when retrieving images via our origin cache. This caused an increase in error rates for uncached derivative images.

The imgix engineering team attempted several different measures and ultimately were able to restore service to normal levels by 19:47 UTC

How were customers impacted?

During the period of the incident, customers would have noticed some uncached derivative images either taking longer than normal to return successfully, or uncached derivative images would return an error.

Cached derivative images were not impacted and continued to be served as normal.

What went wrong during the incident?

On April 9th, the day before the incident, we completed an important milestone in a long-running infrastructure upgrade project designed to increase capacity across the imgix rendering service. Due to the timing of the incident so soon after the upgrade completion, the first priority of the team was to roll back to the previous version. Unfortunately, the previous version exhibited the same behavior and this resulted in wasted time while we reconfigured the old version and shifted traffic to it.

The root cause of the incident turned out to be some origin cache hosts got in a bad state and slowed down the processing pipeline, resulting in a long queue of items to be processed. The long processing times caused upstream hosts to consider these slow-to-respond hosts as non-responsive, which shifted traffic to the healthier hosts. Unfortunately the healthier hosts were unable to clear the processing queue quickly enough to avoid being detected as non-responsive.

This caused a feedback loop where healthy hosts were continuously being marked as non-responsive, removed from the pool, then the remaining hosts would get swamped and considered non-responsive, removed from the pool, and so on.

The ultimate fix was to increase the time a host had to respond before it would be considered non-responsive. Once this change was tested and deployed we saw a sharp decrease in errors and soon after, full recovery of the imgix service.

‌

What will imgix do to prevent this in the future?

Since previous incidents, we have increased capacity and improved tooling which has made the service more reliable and able to support more throughput overall. However, this most recent incident uncovered new gaps in our monitoring and internal tooling.

We will be putting into place new monitoring that will clearly show which hosts are in a healthy vs unresponsive state and how often they are changing states in order to detect and remediate issues of this nature earlier.

We will also be building additional tooling that will enable us to do more targeted content filtering and detection when origin hosts exhibit new and unexpected behavior.

Posted Apr 17, 2020 - 11:49 PDT

Resolved

This incident has been resolved.

Posted Apr 10, 2020 - 14:51 PDT

Update

Image rendering has returned to normal but we are continuing to monitor all health metrics out of an abundance of caution.

Posted Apr 10, 2020 - 14:16 PDT

Monitoring

We have implemented a fix. Error rates are returning to normal levels. We will continue monitoring the situation.

Posted Apr 10, 2020 - 12:56 PDT

Update

We are continuing to evaluate possible fixes for this issue.

Posted Apr 10, 2020 - 11:57 PDT

Update

We are continuing to investigate possible fixes for this issue.

Posted Apr 10, 2020 - 10:47 PDT

Identified

The issue has been identified and a fix is being implemented.

Posted Apr 10, 2020 - 09:29 PDT

Update

Our engineering team is putting mitigations in place and is continuing to investigate the current incident.

Posted Apr 10, 2020 - 08:52 PDT

Investigating

We are investigating elevated rendering errors on uncached derivative images. We will update once we have more information.

Posted Apr 10, 2020 - 07:46 PDT

This incident affected: Rendering Infrastructure.