On August 26, 2021, at 15:00 UTC, the imgix service experienced disruption caused by long-running processes within our origin cache. Once our engineers identified the issue, remediation changes were applied at 15:09 UTC. After the changes were pushed out, the service sharply recovered at 15:20 UTC.
Starting at 15:00 UTC, requests to non-cached derivative images returned a
503 response. These errors accounted for about 5% of all requests to the rendering service and were sustained until 15:20 UTC when the service sharply recovered.
Investigating the cause of the incident, our engineers identified a scenario in which origin connections were misbehaving due to customer configuration settings. While by itself this is not normally a problem, there was some origin activity that had caused the performance of the origin cache to severely degrade, eventually affecting rendering.
We will be modifying our infrastructure’s configuration to eliminate scenarios where customer configurations are able to cause origin connection issues in our infrastructure. We will also be working with existing customers to optimize their configurations so that they will not be affected by the new changes in our infrastructure.