File this one under #RegionWideOutage…
On January 23, 2020, AWS suffered a 7.5 hour region-wide outage of the VPC subsystem in the ap-southeast-2 (Sydney) region. In this blog post, we’ll explore what we know about the outage and its impact on customers in the region.
The detail provided here comes from the Amazon Personal Health Dashboard, press articles, and stories told to us by real AWS customers impacted by the event.
To summarize, something bad happened with the data store that underlies the VPC subsystem within AWS, which required them to put that data store into read-only mode for a while. This cascaded into numerous other services within AWS that depend on writes to the VPC subsystem.
Ultimately, many users were completely unscathed. But users who operate sophisticated workloads within AWS, and users who rely on higher-level services that are themselves sophisticated AWS workloads, had a really bad 7 hours that day.
The only way to avoid this one would have been to be multi-region. And, Arpio can make that happen for you in minutes.
- 11:07 a.m. AEDT, The issue begins…
- 11:41 a.m. AEDT, First reports of “increased API error rates and latencies“ show on the AWS health dashboard. “Connectivity to existing instances is not impacted.“
- 12:18 p.m. AEDT, EC2 status indicates that root cause has been identified, and they are working towards resolution. “This issue mainly affects EC2 RunInstances and VPC related API requests“.
- 12:29 p.m. – 2:31 p.m. AEDT, Additional status updates indicate cascading impact to ELB, RDS, ElastiCache, AppStream, Workspaces, and Lambda services in the region
- 3:49 p.m. AEDT, EC2 status report explains that “a data store used by a subsystem responsible for the configuration of Virtual Private Cloud (VPC) networks is currently offline and the engineering teams are working to restore it.“ They expect the restoration process to be complete within 2 hours.
6:45 p.m. AEDT, EC2 status report confirms that everything has returned to normal.
We spoke with several AWS customers who use this region, and the impact was quite varied. Some customers were totally untouched and had only heard there were issues. Others spent the day fighting their environments. One customer we spoke with said, “We are managing a number of prod customer loads in AWS and we had issues in pure DevOps environments and autoscaling. The impact was big.“
But as you know, Twitter is the best place to figure out what’s really going on:
In the end, it seems that activities that modified infrastructure were impacted. If you didn’t need to launch new instances, run Lambdas that are VPC-connected, deploy new load balancers, etc., you were probably fine. But if your workload is elastic and wasn’t already scaled up, you struggled.
So What Happened?
From the timeline above, we know that the datastore underlying the VPC subsystem was impaired. But what else can we learn from their messaging?
Midway through the outage, AWS posted:
We determined that the data store needed to be restored to a point before the issue began.
Alright, data loss or data corruption occurred, and they need to restore from a backup. Or maybe they had a hardware failure and they need to rebuild the server. We’ve all been there. Credit to AWS for having backups 🙂
In order to do this restore, we needed to disable writes.
So, they’re doing some kind of in-place restore while the data store is online for reads? If you can read from it, maybe it’s not data loss/corruption. But it seems more likely that they’re operating the service off of read replicas that weren’t impacted while they rebuild the primary datastore. And if that’s the case, the data loss/corruption didn’t get replicated to other nodes, so maybe it was actually hardware failure.
When the event was over, they posted more detail:
By 10:50 PM PST, we had fully restored the primary node in the affected data store. At this stage, we began to see recovery in instance launches within the AP-SOUTHEAST-2 Region.
So, there’s a “primary node” in this data store. It’s not some kind of multi-master data store. And either this primary node is a single point of failure, or they had a really unfortunate compound event that impacted its redundancy as well.
The truth is, what happened doesn’t really matter. The important thing is how you engineer your system so that you’re not impacted the next time.
Okay, Then How Do You Avoid It?
This was a region-wide impairment. To avoid it, you need to be able to operate from another region. Your options here are to build a multi-region application (which for many workloads is a significant re-architecture) or to be able to fail-over your application to another region (a process often called disaster recovery).
How Can Arpio Help?
Arpio is disaster recovery as a service for AWS workloads. In 15 minutes, Arpio can fully automate the disaster recovery process for your existing applications that run in AWS. And when the next region-wide outage occurs, Arpio can restore your service in an alternate region even faster.
Oh, and we also protect you from cyber attacks on your AWS account.
Check it out at arpio.io, and let us know if you’d like to give it a spin!