Skip to main content

Outage Tales: The Thanksgiving Kinesis Outage of 2020

In our latest installation of our Outages Tales series, CEO Doug Neumann recalls the AWS Kinesis outage that occurred two years ago this week, just ahead of the US Thanksgiving holiday.

The sprawling outage began with a small, routine maintenance by an Amazon engineer, and went on to last over 17 hours. It not only brought down Kinesis, but also had a cascading effect across many services in the region.

We know in the past few years that the holiday season has meant outage season at AWS. Are you prepared for the next AWS outage? Watch on as Doug breaks down what happened 2 years ago, and what you can do to protect your critical workloads.

 

Transcript:

Hi, I’m Doug from Arpio, and today I want to tell you the story of the AWS Kinesis outage of November 2020.

Kinesis is one of the many services in Amazon Web Services. It’s a data streaming service that’s a critical part of many data processing applications. It’s also used for data processing by many other services in AWS.

On November 25th, 2020 – 2 days before black Friday shopping – Amazon was adding capacity to ensure to that Kinesis was ready for increased loads during the holidays. They made a small adjustment in the number of servers in Kinesis, and pretty quickly the entire system started failing.

The root cause of the failure is very well captured in Amazon’s post-mortem of the event. Essentially, every front-end server for Kinesis is connected to every other front-end server. And every connection gets a dedicated processing thread. When they added more servers, they added more connections and more threads. And on that day, they ran into an operating system limit for how many threads could be allocated.

This outage began around 5:15 in the morning, and it wasn’t resolved fully until 17 hours later. And during that time, it wasn’t just Kinesis that was down. A host of other AWS services that depend upon Kinesis were also down, amplifying the overall impact of the outage.

So, what should we learn from this tale?

Well, the biggest lesson here is about redundancy and how you achieve it in AWS. Kinesis is an example of a regional service in AWS. It runs across multiple availability zones so that you don’t have to worry about a data center failure. But, when a regional service like Kinesis, goes down, it’s down for the entire region. And a dependent workload in that region goes down with it.

The way you protect against this type of outage is with a multi-region architecture. Amazon builds each region of AWS to be independent, so that an outage of one region won’t cascade to another. To ensure your workload can recover from any outage in AWS, it’s not sufficient to take advantage of multiple availability zones within a region. You need to use multiple regions.

At Arpio, our product gives you multi-region redundancy that you don’t have to build yourself. If you’re evaluating your own resiliency strategy, reach out and we’d be happy to tell you more.