Elevated error rates

Incident Report for imgix

Postmortem

In two separate, but related, incidents occurring on August 29th and August 30th, imgix experienced outages involving our rendering infrastructure and a corrupted database. This caused degraded service for a number of customers and their users. We are deeply sorry for this interruption in service. As of August 30th at 15:56 PDT, we have successfully deployed the necessary fixes to stabilize the service and return it to operational order. Our team is reviewing both incidents and plans to build a detailed post-mortem and follow-up, which will be shared with you within the next few days.

What happened?

The imgix URL API service did not reliably service new image render requests starting at 21:36 GMT on August 29, 2017 and 14:05 GMT on August 30, 2017. This was caused by a combination of architectural design issues and an internal service failure, which led to a backlog of requests and in turn overwhelmed some portions of the internal service.

The service was temporarily restored to full functionality at 00:42 GMT on August 30, 2017 and permanently restored at 18:45 GMT on August 30, 2017 through the efforts of the imgix engineering staff to identify and resolve the underlying causes of this issue.

How were customers impacted?

Derivative images which had previously been rendered by the imgix service, and which were still present in the imgix CDN caching layer, remained available to be served without any impact. This typically accounts for greater than 80% of imgix’s total request volume on a usual day.

The majority of imgix customers either were not impacted or were not significantly impacted. However, many imgix customers did experience some impact as a result of this incident, to varying degrees dependent upon several factors, including:

• where the customer’s origin server is located

• the usage pattern of the customer’s service

• whether a given request happened to occupy the imgix rendering queue for long enough to trigger protective timeouts

Customers with a UGC (user-generated content) usage pattern were especially impacted, due to the inherent nature of their request patterns.

What went wrong during the incident?

Database issue

At the beginning of this incident, one of the primary database servers responsible for the internal tracking of customer origin objects and their metadata began experiencing heavy connection volume. This prevented new objects from being added to the origin object store as connections above the threshold were refused, and existing connections were unable to complete queries in a reasonable amount of time.

The service is designed to operate in a degraded fashion when this database is unavailable or under maintenance, but performance was sufficiently reduced under peak load that a backlog of requests began to build up, leading to issues in other areas of the imgix service.

A monitoring gap was discovered with this service, where a corrupt index error was not explicitly monitored for or alerted on. This increased our time to determine the root cause during the incident.

On-call runbook was insufficiently detailed and up-to-date

The documentation available to the on-call engineer who received the alert for this issue was insufficient to properly diagnose and resolve the issue. This was compounded by this service’s owner traveling outside of the country and as a result being largely unable to reach a computer.

While the on-call engineer did escalate the issue and diligently worked throughout to resolve the issue, time to resolution would undoubtedly have been reduced with better documentation and training processes.

Additionally, the database replica promotion mechanism was not documented in the on-call runbook, which would have permitted us to achieve a much faster time to resolution.

Communication procedures were not adequately followed

Internal communication practices during a customer impacting incident are outlined in a document which all engineers and on-call personnel are required to familiarize themselves with. These practices were not entirely followed by all parties, which contributed to an element of confusion in communicating the current status throughout the team.

External communication during the incident was also not up to imgix standards. While the responsible team members did correctly notify customers via the imgix status page, and customers who contacted imgix via our support channels were responded to, we did not provide sufficiently detailed information at a regular enough cadence. This was partly caused by our internal communication failures.

Service architecture flaws exacerbated this failure scenario

We permitted a performance optimization component of the architecture to become a necessary component, by inadvertently becoming reliant on the origin cache database in order to service normal traffic volumes. This exposed a weakness in our ongoing system test plans, where too much emphasis has been placed on steady-state operations rather than failure scenario testing at normal or elevated load.

Our service automations were also revealed to not gracefully handle flapping situations. While not a substantial contributor to the incident, downstream services frequently going in and out of rotation as a result of the growing backlog caused an excess amount of configuration rebuilding and reloading which added to our difficulty in diagnosing and troubleshooting the issue.

imgix also fundamentally does not operate a stateful service, and should not operate a database as a source of truth required to perform image rendering operations.

The initial fix turned out to be insufficient

After our initial investigation, the service was eventually placed into a reduced performance mode. This enabled us to remove the impacted database from service without impact to customer traffic while we repaired the corrupted index and associated table. In this mode, all customer requests were handled with some additional latency for newly fetched content (as it was not held in our origin object cache).

The engineering team felt that this mitigation was sufficient until we were able to fully restore the database to service, and that this represented the least risky option for unintended behavior. Work continued throughout the night to permanently resolve the underlying issue, but it was not yet complete by the next day’s peak traffic period.

During this period of peak traffic, we discovered that the degraded mode of operation was insufficiently able to handle the request volume and the same issue as the previous day presented itself. As a result of the higher traffic volume than the previous day, the request backlog filled faster and presented a harder problem to resolve.

imgix engineers were eventually able to return the service to stable operation by throttling traffic through the imgix system, performing a database replica promotion and re-parenting the impacted database replicas against the new database primary. Once that work was complete, the traffic throttling control was removed.

Posted Aug 30, 2017 - 16:18 PDT

Resolved

This incident has been resolved.
Posted Aug 30, 2017 - 15:56 PDT

Monitoring

We’re currently implementing a solution, and previously error responses are starting to return correct responses. We're monitoring traffic as the situation progresses.
Posted Aug 30, 2017 - 13:21 PDT

Update

We're continuing to work on implementing a fix for the issue.
Posted Aug 30, 2017 - 11:40 PDT

Update

We're continuing to work a fix for this issue.
Posted Aug 30, 2017 - 10:29 PDT

Update

We're continuing to work on implementing a fix for the issue. This is affecting new content and source deployments that contain error/missing images.
Posted Aug 30, 2017 - 09:25 PDT

Update

We're continuing to work on a solution. We'll update once it gets rolled out.
Posted Aug 30, 2017 - 08:26 PDT

Identified

We've identified the cause of the reported errors and we're working on implementing a solution.
Posted Aug 30, 2017 - 07:24 PDT

Investigating

We're investigating reports of elevated error rates.
Posted Aug 30, 2017 - 06:24 PDT