On September 24th 2019 15:04 UTC, the imgix service saw elevated latency when retrieving images via our origin cache. This caused an increase in error rates for newly rendered derivative images. The increase in errors correlates with a moderate (but by no means outside normal) increase in overall traffic.
While the initial outage appeared mitigated by 17:29 UTC, the same symptoms reoccurred the following morning. On September 25th 2019 between 15:10 UTC and 21:36 UTC the imgix service saw severely elevated rendering times with sporadic request errors.
On both September 24th and 25th we observed the same symptom of elevated time requesting original images prior to them being modified for the first time via the imgix rendering stack. The underlying causes did ultimately prove to be unrelated, which resulted in different patterns of error rates between the two incidents.
On September 24th we saw elevated rendering errors throughout the duration of the incident. On September 25th while we did see elevated rendering errors, the nature of them changed (to timeouts) after approximately the first half hour.
During both of these incidents we ran into the limits of our internal tooling for request tracing and metrics visualization. This resulted in a slower analysis and identification of a resolution.
Part of the our remediation also helped us realize that our tooling around highly targeted content filtering is lacking. This resulted in a slower time to push out certain changes than is desired.
A conclusive investigation after resolution of the outages was slowed given the combination of back-to-back incidents with strikingly similar symptoms. This had resulted us in treating the two incidents as related; however, once we began analyzing collected metrics it became clear that underlying causes were in fact different.
We have already tuned configurations that will allow the service to function more optimally during origin slowdowns and have rolled out additional logging and dashboards that will help us identify rendering imbalances during an incident.
We are revisiting the majority of our metric visualizations used internally, beginning with the components of our rendering stack which were affected the most severely during September 24th and 25th. The driving hope here is that we are able to have a better understanding of the ideal operating characteristics of the imgix rendering stack as we continue to extend and improve it.
We will be also embarking on longer term projects to enable a speedier means of enabling certain forms of content filtering, and to increase fault isolation when origin hosts behave in unexpected ways.