On June 10, 2021, between the hours of 1:50 UTC and 2:15 UTC, the rendering API experienced significant rendering errors for uncached derivative images. The issue was identified and fixed, though a small percentage (<.01%) of renders continued to return errors until another fix was pushed out at 2:54 UTC.
The incident was marked as fully resolved at 4:10 UTC.
On June 10 between 1:50 UTC and 2:15 UTC, a significant amount of requests to uncached derivative images returned 503 errors. At its peak, 6% of all requests to imgix returned an error.
A fix began being implemented at 2:10 UTC and was fully rolled out by 2:15 UTC. Errors had returned to almost normal rates (<0.01%) after the time of the fix. A later patched restored the entirety of the service to normal at 2:54 UTC.
Our engineers were alerted to an increasing amount of elevated error responses from an internal service. Investigating the issue, our engineers identified that a misconfiguration during routine network maintenance had caused a DNS-related failure within our infrastructure. During our investigation, we found that our failover systems had not mitigated the issue as expected.
Our engineers immediately corrected the misconfiguration and restored DNS, which restored the majority of service. After service was restored, our engineers detected rendering instability affecting a very small percentage of images. Our engineering team continued to investigate and was able to push out a fix by 2:54 UTC.
We will revisit current workflows and standard operating procedures to perform an architectural review of system dependencies. In addition, imgix plans to improve coordination regarding scheduled maintenance to avoid service disruptions related to network changes.