Sporadic rendering errors
Incident Report for imgix
Postmortem

What happened?

On May 29th 2020 at 19:33 UTC the rendering stack ceased rendering new derivative images. While there were brief periods where we saw less elevated errors (19:46-19:51 UTC, 20:45-20:51 UTC, 20:56-21:05 UTC) the service was not fully restored until 21:14 UTC. The elevated errors were due to a network misconfiguration within a service provider’s environment.

How were customers impacted?

Previously rendered and unexpired images were still being served during this time. For images which were not cached, we saw sustained error rates up to 100%. The error rates did vary at times due to our investigation and remediation activities.

What went wrong during the incident?

Our initial internal analysis quickly isolated the problem to the service provider’s environment. While we were able to initially isolate the problem to a network misconfiguration, there were delays working with our service provider to properly identify the issue. This, combined with a newly discovered gap in our change management process, resulted in a delay in properly restoring service. We also discovered aspects of our monitoring had not been updated given recent changes to our relationship with external service providers.

What will imgix do to prevent this in the future?

While we will be continuing our ongoing work to increase global resiliency of our service, this is a long-term project. We will be immediately making changes to our monitoring implementation, and will be working with our service provider for transparency into our respective change management processes.

Posted Jun 09, 2020 - 11:29 PDT

Resolved
This incident has been completely resolved.
Posted May 29, 2020 - 15:06 PDT
Update
Error rates have returned to normal. We are continuing to work with our service provider to completely resolve the issue.
Posted May 29, 2020 - 14:51 PDT
Update
Some changes have been applied by one of our service providers. We are seeing a decrease in errors and will continue to work with our service provider to resolve this incident
Posted May 29, 2020 - 14:11 PDT
Identified
We have identified the issue and are working with one of our service providers on a fix.
Posted May 29, 2020 - 13:39 PDT
Investigating
We are currently investigating sporadic render error rates for uncached derivative images. We will update once when we obtain more information.
Posted May 29, 2020 - 12:58 PDT
This incident affected: Rendering Infrastructure.