Skip to main content

.. and How to Recover From Them and Prevent Them.

Consider yourself one of the lucky ones if you’ve never experienced a SaaS Disaster. Whether it’s a simple hacking, a network fail, or even a natural disaster, having a recovery strategy is critical! Here are 11 Different SaaS Disaster Scenarios, ways to recover, and our thoughts on how to prevent them from affecting your systems to begin with.

Malware / ransomware attack

Scenario:
Malicious code makes it onto your server, gains root access and effectively takes control of your system.

Severity:
Worst case scenario. Often very hard (or impossible, unless payments are made) to recover from.

Recovery Strategy:
Recover in a completely new, clean environment.

Regional AWS service failure

Prevention best practice:
Strict security patch policies, Clean recovery (Arpio) software to start ‘from scratch’ and making sure you have a long enough backup trail to be able to go back to ‘pre-infection’ status.

Scenario:
One of the many AWS regional services has an outage.  Your application depends on that service (or depends on a service that depends on that service) and their outage has become your outage.

Severity:
Moderate.  You may be dead in the water until AWS fixes the root cause, but usually you’re only offline for a few hours.  Worst case scenario, you’re offline for a few days.

Recovery Strategy:
Recover in an alternate region

Prevention Best Practice:
Avoiding this outage requires a multi-region strategy.  Multi-region-active architectures are best here, but they’re very hard to build.  For most organizations, cross-region failover is more attainable.

AWS access tokens committed to public source repo

Scenario:
You’re using IAM users to access AWS from your runtime compute environment, and you’ve got the access tokens stored in a file on disk.  And then you commit that file to source control.  Oops.

Severity:
Moderate to bad.  Attackers monitor public repos for this very scenario.  If your IAM user has privileged access, you could be in hot water.

Recovery Strategy:
Discard those access tokens pronto. Make sure to verify possible damage / access including injection of malicious code depending on the access levels exposed.

Prevention Best Practice:
If your runtime compute environment runs in AWS, use IAM Roles instead of IAM Users.  If you must use IAM Users, treat those credentials as you would any other critical operations secret.  And if you are launching these applications directly from a logged on session, protect the access with MFA.

Disgruntled employee does something unfathomable

Scenario:
You had a bad seed on the team, and you let him go. But you didn’t realize all of the routes he had into your production environment, and you didn’t get him locked out.  He decided to seek revenge, and now your environment is toast.  It happened to WebEx.

Severity:
Worst case scenario, the insider job.

Recovery Strategy:
Cross-account backup/recovery to an account that he was unable to access.

Prevention Best Practice:
Single sign-on everywhere you can, including at the command line.

Automation goes haywire

Scenario:
You’re using an infrastructure-as-code solution to manage your environment.  It has been an extremely reliable mode of operation, until you fail just once to realize that the change you’re pushing live will replace (instead of update) a stateful resource.

Severity:
Moderate to severe.  You hopefully have backups, but it’ll take a while to restore them.

Recovery Strategy:
In-place restore of your data

Prevention Best Practice:
Always run “terraform plan” or use CloudFormation changesets before applying changes to prod.  And pay close attention to what they’re telling you!

Human error in the cloud console

Scenario:
This can happen in so many ways, but a catastrophic example is that while trying to clean-up a non-production resource (like a database), somebody accidentally cleans up the production resource.  And if they didn’t check the box to “create one final snapshot” before launching the request, there may not be a good backup to restore.

Severity:
Moderate to severe – it just depends on what the mistake was.

Recovery:
You need to re-create that resource, through whatever means makes sense.  If it’s a stateful resource (a database, a server, etc.) then you also need to restore its data.

Prevention Best Practice:
Aside from limiting access to production environments so that a minimal number of people can potentially make this mistake, you definitely want to have good backups that you store outside of the production environment.  You wouldn’t want the human error to delete your data and your backups.

Mistakenly ran a pre-prod command in production

Scenario:
You thought you were logged into the test environment, but you were wrong.  And now the production environment is broken.

Severity:
Again, depends on what you mistakenly deleted.  This can be really bad.

Recovery:
Restore whatever was broken.  If that’s a catastrophic mistake, you may need to restore much of the environment.

Prevention Best Practice: 
Again, limit access to production so you’ve minimized the number of people who can make this mistake.  And ensure you’ve got a solid recovery plan in place in case it still happens.

Bad actor deletes your servers and your backups

Scenario: 
The bad actor that hacked into your environment and held you for ransom was smart enough to find your backups and delete them in the process.  If those were your only backups, you might not be able to get your data back.  Ever.

Severity:
Severe, maybe even existential.

Recovery:
Restore in a clean environment (the hacked one is tainted – don’t go back there) from the backups that you’re storing outside of your primary environment.  You have those backups, right?

Prevention Best Practice:
It’s best if the bad actor never gets into your environment in the first place, but somehow cyber attackers and insider threats keep breaking through.  Even if you’ve severely locked down the ship, make sure you’re storing backups outside of your production environment.  Copying them to a locked-down AWS account will do the trick.

Natural Disaster

Scenario:
An earthquake in Northern California, a wild fire in Oregon, a tornado in Ohio, or hurricane Sandy #2 in Northern Virginia.  Natural disasters have a habit of destroying data centers (or the infrastructure they rely upon) in very unpredictable ways.

Severity:
Moderate.  Usually the downtime from these events is capped at “days” not “months.”

Recovery:
Restore in a different region to continue operating while the dust clears.

Prevention Best Practice: 
Certainly it helps to be distributed across multiple AZs in a region, but AZs and workloads have interdependencies that can undermine this resilience.

Recovery:
Restore service in an alternate region that isn’t impacted.  Make sure you can recover quickly to beat the stampede of other users who are attempting the same.

State actor hacks your cloud provider

Scenario:
Really sophisticated hackers get into some really prized locations, and you can’t exactly air-gap (disconnect from the internet) a cloud provider.  If a state actor compromised a cloud provider, and decided to torch their environment, it’s probably an act of war.

Severity:
Catastrophic

Recovery:
Rebuild in a completely different infrastructure, probably on a completely different cloud provider.  Make sure you have a copy of your data.

Prevention Best Practice:
You really can’t do anything to prevent this.  But you can mitigate the risk by ensuring that your critical systems are backed up outside of your cloud provider.

Compute vulnerability discloses instance role access tokens

Scenario:
The code you’re running in production has a bug, and an attacker has figured out how to leverage it to dump your environment variables.  Conspicuously on display are the access tokens that your production compute role uses to access AWS.

Severity:
Moderate to Severe

Recovery:
Revoke those access tokens, harden the compute instance, and start repairing the damage that was done.  If it’s bad enough, it might be time to rebuild in a clean, uncompromised environment.

Prevention Best Practice: 
Make sure your production roles and instance profiles have the minimum privilege necessary to do their job, and stay up to date on your patches.