Elevated Rendering Errors

Incident Report for imgix

Postmortem

What happened?

On October 18th 2019 13:30 UTC, the imgix service experienced elevated error rates upon requests to render some new derivative images. While the impact had decreased by 14:15 UTC, it did not completely dissipate until 15:46 UTC. During this time we also saw varying levels of elevated rendering times. Previously cached derivatives were not affected.

How were customers impacted?

During the period of impacted service, affected requests would either have taken longer (eventually being displayed by a browser) or simply resulted in an error.

What went wrong during the incident?

While we have made good progress on improving some of our internal metric visualizations, it is not a quick journey. During the response to the incident on October 18th we found that the components of our rendering stack which have had their metrics overhauled were easier to diagnose. This resulted in our team being able to more quickly diagnose and fix the issue, resulting in a shorter incident. At the same time, the fact that not all components have been overhauled continues to be less than ideal during incident response.

What will imgix do to prevent this in the future?

We are continuing our work around how we leverage and visualize performance metrics throughout the rendering system. This will not only enable a more reliable service but also allow us to ultimately expose more detailed information to our customers. As part of the remediation on October 18th, a new feature set of one of our tools was enabled. We believe it will mitigate this particular unpleasant set of symptoms going forward.

We acknowledge that there have been above normal elevated error rates during the past few months and preventing incidents remains a top priority. Work has been completed and will continue to address the underlying causes of widespread origin issues, including continuing work on fault isolation.

Posted Oct 24, 2019 - 14:37 PDT

Resolved

Both rendering times and rendering errors have return to normal.

Posted Oct 18, 2019 - 09:21 PDT

Monitoring

We have rolled out additional configuration updates and rendering times are returning to normal. We will continue to monitor and apply fixes as necessary.

Posted Oct 18, 2019 - 08:57 PDT

Update

While we have resolved the errors, rendering times are still high as our systems work through a backlog of images. We will continue to monitor and make updates as necessary to clear the queue as soon as possible.

Posted Oct 18, 2019 - 07:51 PDT

Identified

We have implemented some configuration updates and error rates are beginning to fall. We are continuing to monitor the situation and make changes as necessary.

Posted Oct 18, 2019 - 07:19 PDT

Investigating

We are investigating elevated rendering errors on uncached derivative images. Previously rendered and cached images are not impacted. We will update once we have more information.

Posted Oct 18, 2019 - 06:40 PDT

This incident affected: Rendering Infrastructure.