On August 12, 2021 between the hours of 14:10 UTC and 14:37 UTC, our rendering API experienced significant rendering errors for non-cached derivative images. The issue was identified and a fix was implemented by 14:37 UTC.
Non-user-affecting behavior continued to be investigated until 15:58 UTC, which was when the incident was marked as fully resolved.
On August 12 between 14:10 UTC and 14:37 UTC, a significant amount of requests to non-cached derivative images returned 503 errors. At the peak of the incident (14:21 UTC), 11.59% of all requests returned an error.
A fix began being implemented at 14:37 UTC with error rates at 6.11% and was fully rolled out by 15:58 UTC. Errors had returned to completely normal rates after the time of the fix. Internal investigation of background processes, which did not affect users, continued until 17:44 UTC. The incident was fully resolved at 17:55 UTC.
Our engineers were alerted to an increased amount of elevated error responses from an internal service in our infrastructure. Investigating the issue, our engineers identified a spike in traffic from one internal service to another, which dramatically increased memory and thread usage. This eventually affected the rendering service by preventing serving of uncached image renders.
After we identified the internal service affecting the rendering stack, it was temporarily paused to reduce load. This helped reduce the number of requests and assisted with faster recovery of the service.
We will evaluate our current workflow for general mitigation and preventative measures. Adjustments will be made to our infrastructure to reject new connections and to increase our internal service capacities. Tooling limits will be evaluated to see what fallback measures can be taken when some internal services reach maximum capacity.