Elevated origin errors

Incident Report for imgix

Postmortem

What happened?

On May 18, 2021, 12:54 UTC, the imgix service experienced disruption caused by long-running processes within our origin cache. Once our engineers identified the issue, they began to implement remediation by 13:07 UTC. Error rates began subsiding by 13:27 UTC with full restoration of the service by 14:20 UTC.

Hours later on May 19, 2021, at 2:24 UTC, imgix experienced another issue with some slow renders and timeouts. Ongoing work from the earlier incident interrupted the service’s typical automatic recovery and required manual intervention. Progress began at 3:15 UTC with service being fully restored by 3:55 UTC.

How were customers impacted?

During the first incident, customers may have noticed some uncached derivative images return an error.

During the subsequent incident, some uncached derivatives took longer than normal to render, with some requests timing out.

In both events, cached derivative images were not impacted and continued to be served as normal.

What went wrong during the incident?

Our engineers were alerted to an increasing amount of elevated error responses from our service. Investigating the issue, our engineers identified a bottleneck in our origin cache. Our engineers isolated the issue and implemented limits to prevent further issues from stalling again after recovery.

After this incident subsided, we picked up slowness in image rendering which eventually culminated into timeouts for some requests. Manual intervention to restart some components was required in a later incident. A combination of rate-limiting and component restarts aided service recovery.

What will imgix do to prevent this in the future?

We will continue to fine-tune our tooling to detect and isolate problems before they can trigger larger failures. We will also be implementing changes targeting unexpected origin behavior observed during the time of the incident.

Posted May 26, 2021 - 12:38 PDT

Resolved

Service has been completely restored.

Posted May 18, 2021 - 07:20 PDT

Monitoring

Our engineering team has applied a fix, restoring services to normal. We are currently monitoring the situation.

Posted May 18, 2021 - 07:05 PDT

Update

Our engineers have mitigated the issue, restoring the majority of the service. There are a handful of Sources still being impacted, which we are continuing to investigate.

Posted May 18, 2021 - 06:47 PDT

Identified

We identified an issue causing elevated errors when fetching images from origins. We are currently working on a fix.

Previously cached derivatives are not impacted.

Posted May 18, 2021 - 06:14 PDT

This incident affected: Rendering Infrastructure.