Proxy Outage
Incident Report for imgix
Postmortem

We sincerely apologize for any issues this may have caused our customers. If you have any questions, please do not hesitate to contact us at support@imgix.com.

What Happened

At 21:42 UTC, our main proxy servers simultaneously incurred a reboot due to a hiccup in an upstream vendor's network. While normally such a hiccup would not affect anything, this time it caused our proxy servers to begin rebooting infinitely.

We have backup proxy servers that are designed to take over if our main proxy servers have issues. However, as we are preparing to transition these backup servers into our new data center, our ability to cutover to them was not automatic. With the main proxy servers down and backup proxies out of automatic rotation, no traffic could reach our image painters and we were unable to produce new rendered images for the better part of 20 minutes. Existing cached images were unaffected.

By 21:45 UTC, we were investigating the issue with the main proxies. By 21:49 UTC, we had begun pushing out configuration to move traffic over to our backup proxies. By 22:03 UTC, new load balancing configuration was pushed live and activated. It took a few minutes to build a quorum before traffic began restoring. There was an initial spike of image render requests as traffic began to flow again, leading to slower response times initially. At 22:08 UTC, the network issues that originally triggered the main proxies to reboot had resolved and the main proxies came back online. Since then, the service remains running with both main and backup proxies balanced in.

Steps To Correct

  • Move our current backup proxies to our new data center and transition them into our main proxy pool. By bringing these proxies into our new data center, we will have more control over the hardware and networking, enabling us to construct a more redundant set of services that failover smoothly and gracefully.
  • Establish an emergency protocol so that intra-team communication is more fluid. There was initial confusion as to the root causes of the outage and exactly where to be communicating updates. All of this cost us precious minutes.

-

Posted Jun 03, 2014 - 19:33 PDT

Resolved
Service is fully restored.
Posted Jun 03, 2014 - 15:31 PDT
Monitoring
Our main proxy servers are restored. Service is fully restored. Maintaining backup proxies in parallel to main proxies for the time being. Will continue monitoring.
Posted Jun 03, 2014 - 15:19 PDT
Identified
We have had an outage in our main proxy servers. Falling back on our backup proxy servers has restored service, but service is still slow. Working on recovering our main proxy servers.
Posted Jun 03, 2014 - 15:08 PDT
Investigating
We are experiencing an outage on our rendering infrastructure due to an outage in our proxies. We are currently investigating and follow up with an update.
Posted Jun 03, 2014 - 14:54 PDT