Elevated rendering times

Incident Report for imgix

Postmortem

What happened?

On September 13, 2021, between the times of 14:53 UTC and 15:44 UTC, the imgix service experienced increased rendering latency primarily for non-cached derivative images. Error rates remained low, though a very small percentage of images returned a timeout error during the peak of the incident.

How were customers impacted?

Between the time of 14:53 UTC and 15:27 UTC, some customers experienced dramatically increased latency for requests to non-cached and cached assets served by imgix. At the peak of the incident, cached requests averaged at 1s/request to complete, while non-cached requests averaged at 40s/request to complete.

The majority of requests to the service returned a 200 response, with a small percentage of images (<2%) returning a timeout error during the peak of the incident.

By 15:03 UTC, response times began to gradually recover, though the average response time was still higher than normal, especially for non-cached derivatives.

By 15:27 UTC the service had recovered to the point where timeouts were no longer occurring. Though higher-than-normal response times were still being reported in our monitoring, by this time the service was considered to be mostly recovered.

By 15:44 UTC, the service had completely recovered.

What went wrong during the incident?

At 14:53 UTC, we started receiving reports from customers regarding increased latency to the rendering service. At the time of these reports (and during the incident), our monitoring had not observed any behavior that had indicated service degradation. Because of this, manual alarms had to be raised which slowed our initial response and investigation.

Our engineers identified that, while our service’s reported errors were very low, our rendering latency was rapidly increasing. After verifying the issue, our team began tuning our rendering infrastructure in order to improve rendering performance.

After additional investigation, our team correlated the increased latency to the enablement of a new feature that resulted in higher render requests than expected. This specific feature interacted with our caching and rendering infrastructure by causing many requests to be immediately re-cached and re-rendered. The immediate increase in caching activity and volume triggered a bottleneck which eventually resolved itself once the cache had been mostly rebuilt.

What will imgix do to prevent this in the future?

We will be re-visiting our procedures for rolling out new features, which will include:

Implementing traffic configurations for controlling the flow of feature roll-outs
Improving our internal documentation and processes so that our teams are synchronized across feature roll-outs
Doing better analysis for caching and rendering impact per newly released feature

This incident also exposed an error with our rendering performance monitoring, which we have now fixed.

Posted Sep 24, 2021 - 13:35 PDT

Resolved

This incident has been resolved.

Posted Sep 13, 2021 - 09:49 PDT

Monitoring

Rendering performance has completely recovered. We are continuing to monitor the situation.

Posted Sep 13, 2021 - 09:34 PDT

Identified

Rendering performance has almost recovered back to normal, though there is still a slightly slower-than-normal rendering performance that is being observed. We are continuing to investigate the issue and apply mitigations.

Posted Sep 13, 2021 - 08:59 PDT

Update

Investigating further, we're observing degraded rendering performance. Non-cached image derivatives may take longer than normal to render or may timeout.

We are continuing to investigate.

Posted Sep 13, 2021 - 08:45 PDT

Investigating

We are currently investigating elevated render error rates for uncached derivative images. We will update once when we obtain more information.

Previously cached derivatives are not impacted.

Posted Sep 13, 2021 - 08:29 PDT

This incident affected: Rendering Infrastructure.