Elevated rendering errors
Incident Report for imgix
Postmortem

What happened?

On August 12, 2021 between the hours of 14:10 UTC and 14:37 UTC, our rendering API experienced significant rendering errors for non-cached derivative images. The issue was identified and a fix was implemented by 14:37 UTC. 

Non-user-affecting behavior continued to be investigated until 15:58 UTC, which was when the incident was marked as fully resolved.

How were customers impacted?

On August 12 between 14:10 UTC and 14:37 UTC, a significant amount of requests to non-cached derivative images returned 503 errors. At the peak of the incident (14:21 UTC), 11.59% of all requests returned an error.

A fix began being implemented at 14:37 UTC with error rates at 6.11% and was fully rolled out by 15:58 UTC. Errors had returned to completely normal rates after the time of the fix. Internal investigation of background processes, which did not affect users, continued until 17:44 UTC. The incident was fully resolved at 17:55 UTC.

What went wrong during the incident?

Our engineers were alerted to an increased amount of elevated error responses from an internal service in our infrastructure. Investigating the issue, our engineers identified a spike in traffic from one internal service to another, which dramatically increased memory and thread usage. This eventually affected the rendering service by preventing serving of uncached image renders. 

After we identified the internal service affecting the rendering stack, it was temporarily paused to reduce load. This helped reduce the number of requests and assisted with faster recovery of the service.

What will imgix do to prevent this in the future?

We will evaluate our current workflow for general mitigation and preventative measures. Adjustments will be made to our infrastructure to reject new connections and to increase our internal service capacities. Tooling limits will be evaluated to see what fallback measures can be taken when some internal services reach maximum capacity.

Posted Sep 15, 2021 - 12:56 PDT

Resolved
This incident has been resolved.
Posted Aug 12, 2021 - 10:55 PDT
Monitoring
We have finished investigating background behavior within our infrastructure and everything is working as expected. The rendering service continues to be 100% operational. We are currently monitoring the situation.
Posted Aug 12, 2021 - 10:44 PDT
Update
The rendering service continues to be 100% operational. We are continuing to investigate non-user-affecting behavior in our infrastructure.
Posted Aug 12, 2021 - 10:27 PDT
Update
The rendering service continues to be 100% operational. We are continuing to investigate non-user-affecting behavior in our infrastructure.
Posted Aug 12, 2021 - 09:40 PDT
Update
The rendering service has been restored to completely normal levels. We are continuing to investigate some background behavior, though the rendering service is completely operational.
Posted Aug 12, 2021 - 08:59 PDT
Update
We are seeing elevated rendering errors for a small percentage of image requests. We are continuing to investigate.
Posted Aug 12, 2021 - 08:28 PDT
Identified
The issue has been identified and a fix has been implemented. Errors are returning to normal levels, though we are continuing to investigate.
Posted Aug 12, 2021 - 07:38 PDT
Investigating
We are currently investigating elevated render error rates for uncached derivative images. We will update once when we obtain more information.

Previously cached derivatives are not impacted.
Posted Aug 12, 2021 - 07:19 PDT
This incident affected: Rendering Infrastructure.