On October 18th 2019 13:30 UTC, the imgix service experienced elevated error rates upon requests to render some new derivative images. While the impact had decreased by 14:15 UTC, it did not completely dissipate until 15:46 UTC. During this time we also saw varying levels of elevated rendering times. Previously cached derivatives were not affected.
During the period of impacted service, affected requests would either have taken longer (eventually being displayed by a browser) or simply resulted in an error.
While we have made good progress on improving some of our internal metric visualizations, it is not a quick journey. During the response to the incident on October 18th we found that the components of our rendering stack which have had their metrics overhauled were easier to diagnose. This resulted in our team being able to more quickly diagnose and fix the issue, resulting in a shorter incident. At the same time, the fact that not all components have been overhauled continues to be less than ideal during incident response.
We are continuing our work around how we leverage and visualize performance metrics throughout the rendering system. This will not only enable a more reliable service but also allow us to ultimately expose more detailed information to our customers. As part of the remediation on October 18th, a new feature set of one of our tools was enabled. We believe it will mitigate this particular unpleasant set of symptoms going forward.
We acknowledge that there have been above normal elevated error rates during the past few months and preventing incidents remains a top priority. Work has been completed and will continue to address the underlying causes of widespread origin issues, including continuing work on fault isolation.