Elevated rendering errors

Incident Report for imgix

Postmortem

What happened?

On September 24th 2019 15:04 UTC, the imgix service saw elevated latency when retrieving images via our origin cache. This caused an increase in error rates for newly rendered derivative images. The increase in errors correlates with a moderate (but by no means outside normal) increase in overall traffic.

While the initial outage appeared mitigated by 17:29 UTC, the same symptoms reoccurred the following morning. On September 25th 2019 between 15:10 UTC and 21:36 UTC the imgix service saw severely elevated rendering times with sporadic request errors.

How were customers impacted?

On both September 24th and 25th we observed the same symptom of elevated time requesting original images prior to them being modified for the first time via the imgix rendering stack. The underlying causes did ultimately prove to be unrelated, which resulted in different patterns of error rates between the two incidents.

On September 24th we saw elevated rendering errors throughout the duration of the incident. On September 25th while we did see elevated rendering errors, the nature of them changed (to timeouts) after approximately the first half hour.

What went wrong during the incident?

During both of these incidents we ran into the limits of our internal tooling for request tracing and metrics visualization. This resulted in a slower analysis and identification of a resolution.

Part of the our remediation also helped us realize that our tooling around highly targeted content filtering is lacking. This resulted in a slower time to push out certain changes than is desired.

A conclusive investigation after resolution of the outages was slowed given the combination of back-to-back incidents with strikingly similar symptoms. This had resulted us in treating the two incidents as related; however, once we began analyzing collected metrics it became clear that underlying causes were in fact different.

What will imgix do to prevent this in the future?

We have already tuned configurations that will allow the service to function more optimally during origin slowdowns and have rolled out additional logging and dashboards that will help us identify rendering imbalances during an incident.

We are revisiting the majority of our metric visualizations used internally, beginning with the components of our rendering stack which were affected the most severely during September 24th and 24th. The driving hope here is that we are able to have a better understanding of the ideal operating characteristics of the imgix rendering stack as we continue to extend and improve it.

We will be also embarking on longer term projects to enable a speedier means of enabling certain forms of content filtering, and to increase fault isolation when origin hosts behave in unexpected ways.

Posted Oct 04, 2019 - 17:19 PDT

Resolved

Rendering performance has returned to normal.

Posted Sep 24, 2019 - 11:57 PDT

Monitoring

We have identified the issue and have implemented a fix. Error rates are returning to normal levels. We will continue monitoring the situation.

Posted Sep 24, 2019 - 10:29 PDT

Update

We are applying some fixes and are continuing to track down the root cause of the issue.

Posted Sep 24, 2019 - 10:08 PDT

Investigating

We are investigating elevated rendering errors on uncached derivative images. Previously rendered and cached images are not impacted. We will update once we have more information.

Posted Sep 24, 2019 - 08:39 PDT

This incident affected: Rendering Infrastructure.