Elevated rendering errors

Incident Report for imgix

Postmortem

What happened?

On September 24th 2019 15:04 UTC, the imgix service saw elevated latency when retrieving images via our origin cache. This caused an increase in error rates for newly rendered derivative images. The increase in errors correlates with a moderate (but by no means outside normal) increase in overall traffic.

While the initial outage appeared mitigated by 17:29 UTC, the same symptoms reoccurred the following morning. On September 25th 2019 between 15:10 UTC and 21:36 UTC the imgix service saw severely elevated rendering times with sporadic request errors.

How were customers impacted?

On both September 24th and 25th we observed the same symptom of elevated time requesting original images prior to them being modified for the first time via the imgix rendering stack. The underlying causes did ultimately prove to be unrelated, which resulted in different patterns of error rates between the two incidents.

On September 24th we saw elevated rendering errors throughout the duration of the incident. On September 25th while we did see elevated rendering errors, the nature of them changed (to timeouts) after approximately the first half hour.

What went wrong during the incident?

During both of these incidents we ran into the limits of our internal tooling for request tracing and metrics visualization. This resulted in a slower analysis and identification of a resolution.

Part of the our remediation also helped us realize that our tooling around highly targeted content filtering is lacking. This resulted in a slower time to push out certain changes than is desired.

A conclusive investigation after resolution of the outages was slowed given the combination of back-to-back incidents with strikingly similar symptoms. This had resulted us in treating the two incidents as related; however, once we began analyzing collected metrics it became clear that underlying causes were in fact different.

What will imgix do to prevent this in the future?

We have already tuned configurations that will allow the service to function more optimally during origin slowdowns and have rolled out additional logging and dashboards that will help us identify rendering imbalances during an incident.

We are revisiting the majority of our metric visualizations used internally, beginning with the components of our rendering stack which were affected the most severely during September 24th and 25th. The driving hope here is that we are able to have a better understanding of the ideal operating characteristics of the imgix rendering stack as we continue to extend and improve it.

We will be also embarking on longer term projects to enable a speedier means of enabling certain forms of content filtering, and to increase fault isolation when origin hosts behave in unexpected ways.

Posted Oct 04, 2019 - 17:20 PDT

Resolved

All rendering is back to normal. We will be conducting and publishing a post-mortem at a later date. Thank you for your patience.

Posted Sep 25, 2019 - 13:59 PDT

Update

Render times have returned to normal. We will continue to monitor the service.

Posted Sep 25, 2019 - 13:36 PDT

Update

Render times have dropped significantly but may still be very slightly elevated. We are continuing to make changes to restore service to normal speeds.

Posted Sep 25, 2019 - 13:04 PDT

Update

We are still actively working on resolving the slow renders and are doing everything in our power to restore normal service.

Posted Sep 25, 2019 - 12:30 PDT

Update

We are working on a configuration update to ease the rendering slowdown and will have more updates shortly.

Posted Sep 25, 2019 - 10:59 PDT

Update

Render times have not fallen as quickly as we expected. We are continuing to monitor and see what other corrective measures we can take. While renders are no longer producing errors, some are still taking longer than anticipated.

Posted Sep 25, 2019 - 09:45 PDT

Monitoring

Errors rates have returned to normal and render times are also returning to normal. We will continue to monitor the service before resolving the incident.

Posted Sep 25, 2019 - 09:23 PDT

Identified

We have identified the issue. Errors rates are decreasing but render times are still elevated. We are continuing to investigate and update configurations.

Posted Sep 25, 2019 - 09:13 PDT

Investigating

We are investigating elevated rendering errors on uncached derivative images. Previously rendered and cached images are not impacted. We will update once we have more information.

Posted Sep 25, 2019 - 08:31 PDT

This incident affected: Rendering Infrastructure.