Skip to main content

When we tell people about how Arpio fully automates disaster recovery for the AWS cloud, we’re commonly asked the same question.

“Wait.  Isn’t that built into the cloud?”

You can be forgiven if you asked it too.  Cloud platforms provide a lot of built-in redundancy, and it can be easy to assume they’ve got it all covered.  But in the end of the day, the major cloud providers leave it to you to mitigate the most critical disasters.

Understanding your D.R. needs for your cloud-based workloads requires assessing normal IT risks with a cloud-specific lens.  In this blog post, we’ll do just that.

Let’s revisit some cloud disasters from the recent past…

Luckily, disasters in the cloud (just like disasters in life) aren’t a daily occurrence.  But they happen frequently enough that most cloud practitioners can tell you a story about one they’ve experienced themselves.  Here are a couple of stories that we like to tell.

AZURE SOUTH CENTRAL REGION OUTAGE

The South Central region of Microsoft Azure is located in the vicinity of San Antonio, TX.  On September 4th, 2018, they had some pretty impressive weather in San Antonio, and a lightning strike took out power to this Azure region.

Normally, power outages at data centers aren’t the end of the world – they have backup generators that automatically kick in.  But the lightning strike surged power through the data center’s cooling systems, and they were fried. Temperatures in the datacenter got hot.  Too hot. Hardware failures started happening. Microsoft decided they needed to shut everything down.

It took 24 hours for Microsoft to repair the damage and to start bringing services back online.  Some services weren’t restored for 3 days.

AWS NORTHERN VIRGINIA REGION OUTAGE

On February 28, 2017, an Amazon employee was performing a routine maintenance operation when he accidentally typo’d a command.  With that one mistake, he took down a massive number of servers supporting Amazon’s Simple Storage Service (S3).

S3 is a foundational service in AWS.  When S3 doesn’t work, a lot of other services don’t work too.  This outage cascaded throughout AWS in that region, significantly impacting thousands of workloads.  It was so bad that Amazon’s status page (where they report service outages) didn’t work for several hours.  You had to go to Twitter to learn what was going on.

Restoring service required a complete reboot of the S3 system, which takes hours.  Altogether, AWS’s Northern Virginia Region was significantly impaired for 5 hours. And since this is the largest region of AWS, much of the internet was broken for that time.  Estimates after the event claimed $250 million in economic damage.

CODE SPACES

Code Spaces was once an up and coming competitor to GitHub.  Until an attacker gained access to their AWS account, and demanded a ransom.  When Code Spaces decided to play hardball, the attacker deleted all of their data.  And all of their backups.

It turns out, it’s pretty hard to have business continuity when you have lost all of your data.  Code Spaces closed up shop the very next week.

The architecture of the cloud

It’s tempting to think about the cloud as some ephemeral thing that exists in the ether.  But in actuality, it’s just servers that live in real data centers in physical locations on our planet.

All 3 of the major public cloud providers (Amazon, Microsoft, and Google) operate services in geographic areas they call “regions.”  Most of these regions consist of multiple “availability zones,” which you can roughly think of as distinct data centers in the region.

The services provided by these clouds are either zonal, regional, or global.

A zonal service is hosted within a given availability zone, and is therefore vulnerable to an outage within that availability zone (i.e. fire in the data center).

A regional service spans availability zones within a region, and is therefore not vulnerable to an outage of a single availability zone.  But it is vulnerable to an outage of multiple availability zones in the region (i.e. earthquake, regional power grid outage).

And as you’ve undoubtedly surmised, a global service is not vulnerable to a single regional outage.  You’d need multiple regions to fail to bring it down.

In general, low-level infrastructure services (like virtual machines and virtual hard drives) are zonal.  Higher-level platform services (like hosted databases) are often regional. And global services are pretty rare (they’re really hard to engineer).

So, to understand what kind of disasters you’ve mitigated, you need to understand the architecture of the cloud services you’re consuming and what they’ve already mitigated for you.  If you’re consuming a zonal service (for example), it doesn’t mean your application isn’t resilient to an availability zone outage. It just means you have to build that resilience yourself – it’s not automatic.

Don’t forget about the cyber threats

Remember the story about Code Spaces?  It didn’t matter whether they were deployed across multiple availability zones or regions when the attacker decided to delete all of their data.

The disaster wasn’t an outage — it was catastrophic data loss. And since their backups were part of the data loss, they had no path to recovery.

Most organizations work hard to ensure that they aren’t vulnerable to an attacker gaining access to their accounts.  But search online for “aws account hacked” and you’ll find countless stories where it has happened.  An in-depth approach to security requires that you always lock down your backups just in case.

How do you know if you’re covered?

Assessing your readiness for a cloud disaster boils down to answering 3 simple questions.

  1. If our cloud provider suffers an extended outage of an availability zone, are we still available? If not, how would we recover?
  2. If our cloud provider suffers an extended outage of multiple availability zones (or an entire region), are we still available? If not, how would we recover?
  3. If an attacker gains access to our cloud account, could they corrupt (i.e. ransomware) or delete our data and our backups?

If your answers to these questions are satisfactory for your business needs, you’re mostly good.  You just need to answer one more question…

Have we tested it?

About Arpio

Arpio is comprehensive disaster recovery for AWS so that you don’t have to build it yourself.  We pick up where Amazon leaves off, making it easy and fast to recover from any AWS outage or from any cyber attack on your AWS infrastructure.

Learn more at www.arpio.io.