The AWS Well-Architected Framework is an incredible resource for IT architects looking to build secure, high-performing, resilient, and efficient solutions, both on-premises and in the cloud. It captures the wisdom derived from person-centuries of work at AWS, both inside Amazon and in collaboration with customers.
The framework is structured as 5 pillars, and the “reliability” pillar represents Amazon’s most concise guidance on how we should accommodate disaster recovery in our architectures. In this blog post, we’ll unpack these best practices, and talk about how Arpio implements these for you in AWS.
TLDR: The AWS Well-Architected Framework has a LOT to say about disaster recovery for AWS (which is why this post got so long). And we’ve built Arpio to check all of the boxes for you.
Identify and back up all data that needs to be backed up.
We’re starting basic here. I think you can figure this part out. If you’re using managed data services (RDS, DynamoDB), backup is built-in. Just make sure you’ve got it turned on. If you’re running your own database on EC2 instances, you’re on your own. But you can use EBS snapshots to take quick/frequent crash-consistent backups of your database server without impacting server performance.
Arpio manages backup for you so you don’t have to worry about it. You simply select the servers and databases you need to protect, and Arpio takes configures and executes the backups. But a lot of tools do that, it’s not what makes us special.
Secure and encrypt backup.
Suddenly, we’re getting advanced – most people don’t think through this. You probably know that AWS stores snapshots and backups in S3, where they benefit from 99.999999999% durability. And if you’re taking your own backups, you’re probably storing them in S3 as well. So, those backups are never going to disappear unintentionally.
That’s partly true. All of those 9s of durability refer to physical failures, like hard drive crashes. The problem is the “logical” deletion problem. If somebody accidentally deletes a production RDS server (because they mistook it for a test server), it’s really easy to inadvertently delete the backups too. And if a bad actor gets into your account with malicious intent, they’ll probably go after your backups first.
Your backups are your last line of defense against catastrophic data loss. You need to make sure they’re secure.
Arpio does this by copying your backups into a locked-down AWS account. It’s your own account – you control the access – but your team doesn’t need access, and it’s not running any compute. The attack vectors that let a bad guy into your production account won’t exist here.
Perform data backup automatically
There are so many ways to easily do this, hopefully you aren’t doing it manually. Just make sure that your automatic backups are being stored securely. We just talked about that.
Arpio performs backups automatically, on a schedule dictated by your recovery point objective (your RPO, more on that below). If you have a 30 minute RPO, and it’s taking AWS 10 minutes to create backups and copy them to the locked-down (and cross-region) location, Arpio starts creating the next backup when the current one is 20 minutes old. That way, when the current one is expiring, the next one is coming available.
Perform periodic recovery to verify backups
I bet you aren’t doing this quite like you ought to. It’s easy to let it slip through the cracks.
The gist here is that you can’t trust a backup process that isn’t tested. So you need to test it. You don’t test every backup (well, I guess you could), but you want to occasionally make sure it’s working
Arpio makes recovery testing easy. One button in the UI recovers your entire workload, including backups of all relevant systems. In a few minutes, everything is up and ready for a smoke test. And when you’re done, another button tears it all down for you.
Deploy the workload to multiple locations
As we all know, servers occasionally die, datacenters sometimes lose power, and region-wide services periodically have problems. Diversity of locations is your friend here. You can do this by deploying across multiple availability zones, or across multiple regions. You can even do both to ensure maximum availability.
Arpio makes multi-region deployments easy; its what we were built for. Arpio continuously replicates your environment to an alternate region, making it quick and painless to fail over in the event of an outage. And because AWS regions are engineered to be completely independent, this gives you the most robust defense against outages.
Automate disaster recovery for single-location components
Effectively, if you don’t have an HA configuration (multiple components clustered to behave as one), you have a single point of failure. You should make recovery as quick and scripted as possible.
Arpio gives you this automation. Whether your component is single-location, or multi-location, Arpio automates the recovery so you’re back up in minutes.
Failover to healthy resources
This is starting to feel a bit redundant. I think that’s a redundancy pun.
This is what Arpio does. Arpio launches healthy resources just-in-time when you need them so that you can failover and continue operating.
Test resiliency using chaos engineering
So, this is a different Arpio use-case, but it’s really interesting. For a lot of applications, especially those that run at scale, there is no test environment that comes close to production. This undermines a lot of testing activities like load testing, penetration testing, and chaos testing.
Arpio creates clones of production environments. In minutes. And it cleans them up for you when you’re done with them. It’s a great way to setup an environment to do some chaos testing, and keep that activity out of production.
Conduct game days regularly
A game day is a practice of outage resolution. Often times, it’s a walk through of the disaster recovery plan. It’s not as much fun as a football “game day.” Or if you’re a cricket fan, a “match 3-day.”
Arpio automates the full recovery process, which means you can bring up the environment for testing in just a few minutes. It doesn’t take a day to practice recovery. Maybe we should call it a “game hour” instead.
Define recovery objectives for downtime and data loss
By “recovery objectives” they’re talking about the recovery point objective (RPO) and recovery time objective (RTO). These are metrics that respectively define how fresh your recovery data is when you need it, and how quickly you can restore service.
Arpio helps you measure and optimize both of these.
Use defined disaster recovery strategies to meet the recovery objectives
This is essentially saying you have a DR plan that meets your RPO and RTO requirements.
If you use Arpio, we are that plan.
Test disaster recovery to validate the implementation
We already talked about testing above, but that was focused on backups. This best practice is talking about your entire environment, and in particular proving that your RPO and RTO are met.
The good news is that Arpio is a recovery solution for your entire AWS environment. And we’ve made testing quick and easy (you’ve probably heard me say that already). And we measure your RPO and RTO for you.
Manage configuration drift at the DR site
Your DR environment is supposed to look a lot like your production environment. Maybe even exactly like your production environment.
That’s what Arpio does. Our model is to continuously clone your production environment to your DR environment for you, so you don’t have to do anything yourself. We “curate” the DR environment for you, and there is no configuration drift.
Automate recovery
Self-explanatory at this point, right?
Fundamentally, this is what Arpio is.
If you’ve made it this far, congratulations! I hope you’ve learned something about how Amazon prescribes that you accommodate disaster recovery into your AWS workloads. And if you’d like to implement these best practices for your environment, I hope you’ll take a look at Arpio. We’d love to give you a tour.