On May 29th 2020 at 19:33 UTC the rendering stack ceased rendering new derivative images. While there were brief periods where we saw less elevated errors (19:46-19:51 UTC, 20:45-20:51 UTC, 20:56-21:05 UTC) the service was not fully restored until 21:14 UTC. The elevated errors were due to a network misconfiguration within a service provider’s environment.
Previously rendered and unexpired images were still being served during this time. For images which were not cached, we saw sustained error rates up to 100%. The error rates did vary at times due to our investigation and remediation activities.
Our initial internal analysis quickly isolated the problem to the service provider’s environment. While we were able to initially isolate the problem to a network misconfiguration, there were delays working with our service provider to properly identify the issue. This, combined with a newly discovered gap in our change management process, resulted in a delay in properly restoring service. We also discovered aspects of our monitoring had not been updated given recent changes to our relationship with external service providers.
While we will be continuing our ongoing work to increase global resiliency of our service, this is a long-term project. We will be immediately making changes to our monitoring implementation, and will be working with our service provider for transparency into our respective change management processes.