At 14:53 UTC, our team was alerted to an issue that was manifesting in increased rendering performance issues. The root cause of the issue was a hiccup in our distributed configuration service had caused parts of our infrastructure to reload their configurations in rapid succession. The rapid succession of reloads put some of our core services into a state where orphaned processes were trying to handle production traffic. This caused those requests to error out for our customers. We tracked down and removed these orphaned processes, restoring normal operating rates of successful requests by 16:09 UTC. The underlying issue was fully remedied by 17:10 UTC.
At the high point of the service interruption (roughly 8 minutes), a maximum of 15% of image requests were slow or yielded an error. Less than 8% of image requests on average were impacted for the duration of the service interruption. This did not impact images that were fresh in our cache and/or previously requested through the imgix CDN. Depending on the nature of specific use cases, some customers were impacted more than others. Customers with lots of new content or content with short cache TTLs would have felt the most impact.
At 14:53 UTC, on-call staff were alerted to increased error rates. Two on-call members immediately responded and were able to raise other critical team members within minutes of the initial alert. It was initially determined that our configuration service had experienced a momentary, unexpected change in state, triggering various services to reload their configurations. Our team took action to diagnose which services were malfunctioning and set to work ensuring they were reloaded properly.
By 15:15 UTC, our initial actions were successful in treating the symptoms of the underlying configuration problem and error rates began to decrease steadily, but we had still not located the root issue leading to many of the errors. We detected a pattern in how some of our services were behaving that ultimately led us to determining that orphaned processes were broadly impacting render success rates. We began rolling out fixes that detected and corrected this specific issue at 15:39 UTC. Our error rates continued to rapidly decline while these fixes rolled out. At 16:09 UTC, our service was determined to be operating within normal success rate thresholds.
By 17:10 UTC, all orphaned processes had been eliminated and any errors related to this issue were no longer occurring.
Through our internal review process, imgix has decided to take the following courses of action in order to mitigate the impact or entirely eliminate future incidents of this nature.