On June 4th, 2020 at 09:50 UTC one of our network service providers suffered a loss of power at their facility. This affected all inbound traffic from our edge and shield caches to our rendering service, which prevented new derivative images from being created. Due to the nature of the network outage, we did not experience an automatic failover to an alternate service provider. By 10:25 UTC network connectivity had been restored and derivative images were able to be successfully processed.
Unlike other service impacting events we have seen in recent months, this one manifested as a complete inability for our shield cache to access our rendering service. While the net effect was the same (503 for new derivative images) the specific error messages customers would have seen could have varied and custom error fallback images were not displayed.
The most impactful thing which went wrong was the lack of automatic failover to an alternate network service provider. This, combined with improperly configured external synthetic checks, delayed our overall response. We also saw more variance from our established incident remediation policy than ideal, which delayed customer outreach and escalations.
Previously identified monitoring changes will be implemented, with some additional near term work around increased availability for our network infrastructure. Long term global resiliency work which could have helped mitigate this incident is already accounted for and will continue.