Increased error rates

Incident Report for imgix

Postmortem

What happened?

At 14:53 UTC, our team was alerted to an issue that was manifesting in increased rendering performance issues. The root cause of the issue was a hiccup in our distributed configuration service had caused parts of our infrastructure to reload their configurations in rapid succession. The rapid succession of reloads put some of our core services into a state where orphaned processes were trying to handle production traffic. This caused those requests to error out for our customers. We tracked down and removed these orphaned processes, restoring normal operating rates of successful requests by 16:09 UTC. The underlying issue was fully remedied by 17:10 UTC.

How were customers impacted?

At the high point of the service interruption (roughly 8 minutes), a maximum of 15% of image requests were slow or yielded an error. Less than 8% of image requests on average were impacted for the duration of the service interruption. This did not impact images that were fresh in our cache and/or previously requested through the imgix CDN. Depending on the nature of specific use cases, some customers were impacted more than others. Customers with lots of new content or content with short cache TTLs would have felt the most impact.

What went wrong during the incident?

At 14:53 UTC, on-call staff were alerted to increased error rates. Two on-call members immediately responded and were able to raise other critical team members within minutes of the initial alert. It was initially determined that our configuration service had experienced a momentary, unexpected change in state, triggering various services to reload their configurations. Our team took action to diagnose which services were malfunctioning and set to work ensuring they were reloaded properly.

By 15:15 UTC, our initial actions were successful in treating the symptoms of the underlying configuration problem and error rates began to decrease steadily, but we had still not located the root issue leading to many of the errors. We detected a pattern in how some of our services were behaving that ultimately led us to determining that orphaned processes were broadly impacting render success rates. We began rolling out fixes that detected and corrected this specific issue at 15:39 UTC. Our error rates continued to rapidly decline while these fixes rolled out. At 16:09 UTC, our service was determined to be operating within normal success rate thresholds.

By 17:10 UTC, all orphaned processes had been eliminated and any errors related to this issue were no longer occurring.

What will imgix do to prevent this in the future?

Through our internal review process, imgix has decided to take the following courses of action in order to mitigate the impact or entirely eliminate future incidents of this nature.

We will introduce alerts that specifically target the detection of orphaned processes. This will enable us to better detect and take immediate action when an event like this happens, at whatever magnitude of impact.
Continue to tune and improve self-autonomic management of processes that will allow detection and immediate handling of orphaned processes.

Posted Mar 14, 2018 - 18:36 PDT

Resolved

This incident has been resolved.

Posted Mar 14, 2018 - 11:13 PDT

Monitoring

Error rates have returned to normal and a fix for the issue has been implemented. We will continue to monitor the situation.

Posted Mar 14, 2018 - 10:25 PDT

Identified

The issue has been identified and error rates continue to recover.

Posted Mar 14, 2018 - 10:10 PDT

Update

Error rates are currently recovering. We are continuing to investigate the issue.

Posted Mar 14, 2018 - 09:14 PDT

Investigating

We're currently investigating reports of elevated error rates. We'll update when we obtain more information.

Posted Mar 14, 2018 - 08:04 PDT