Skip to main content

Building a Resilient Cloud: 6 Keys to Redundancy in AWS

In the ever-evolving landscape of IT infrastructure on AWS, achieving resilience is paramount for IT decision-makers, system architects, and cloud engineers. Here are six crucial elements to consider when designing and implementing a resilient cloud.

1. Embrace the Resilient Cloud Dynamic Duo: High Availability and Disaster Recovery

Achieving resilience in cloud architecture involves effectively incorporating both High Availability (HA) and Disaster Recovery (DR) approaches. HA eliminates the risk of downtime due to common hardware failures and power disruption. But when HA solutions fall short, DR steps up to save the day (and your business).

AWS simplifies achieving high availability within a region through features like availability zones and regional services. However, extending HA across regions introduces new challenges. Applications initially designed with in-region HA in mind often require a complete reevaluation and rearchitecture to achieve multi-region HA. It’s also important to recognize that no HA approach can fully safeguard against cyber attacks, insider threats, or human errors.

Striking the right balance between costs and benefits typically involves prioritizing HA within a region and disaster recovery  across regions. This approach not only aligns with AWS’s capabilities but also acknowledges the intricate considerations involved in optimizing resilience against diverse threats and challenges.

2. The Easy Button for In-Region HA: AWS Managed Services

One of the primary reasons that in-region HA is relatively easy is that it’s built into most of the managed services in AWS. For instance, setting up a redundant database deployment traditionally involves installing a second database server, configuring network communication, establishing replication from the primary to the secondary server, and developing (along with testing and managing) software for monitoring and failover handling. Alternatively, with AWS, you can simply enable the multi-AZ option in RDS, letting Amazon take care of all these complexities with just a click. Other Managed Services are multi-AZ by default making your HA setup simple. Embracing managed services is fundamental to building a resilient cloud architecture.

3. Backup Out of Region and Out of Account

AWS built-in backups are stored in the same region and same account: not great for a true disaster recovery scenario. A crucial step in fortifying data resilience is the placement of backups outside the production environment, in a different region and not in your production account. Enabling capabilities such as cross region replication in Amazon S3 establishes geographical redundancy. This not only shields against catastrophic failures but also aligns with compliance requirements. To further enhance security and resilience in the face of compromised systems, the implementation of cross-account backups, coupled with automated backup policies and routine testing, ensures the reliability and effectiveness of the backup strategy.

4. Recover from Cyber Disasters into a Clean AWS Account.

Recovering into a compromised production account is not just a bad move, it’s downright foolish. The critical concern is that the attacker who compromised the production account may still have a presence in your account, potentially having installed backdoors. You don’t want to risk the possibility that such vulnerabilities haven’t been addressed. The solution is simple: recover into a clean AWS account. Recovering into a clean and locked down AWS account means you don’t have to worry about backdoors or stolen credentials.

5. Automate Infrastructure Recovery

When disaster strikes, the urgency to get back online cannot be overstated. Manually rebuilding your cloud environment and data can take a long time. It introduces the risk of human error, especially if you’re not sure you have the right skills on-hand. The solution? Automation. Automation, especially using tools like CloudFormation and Terraform in AWS, ensures a rapid recovery, elevates accuracy, and eliminates the risk of human errors. Beyond immediate benefits, automation enables regular and swift testing, increasing confidence that your recovery approach will work when you most need it. This isn’t just a convenience; it’s a critical practice that optimizes resources, serves as documentation for compliance, and guarantees a rapid and reliable restoration when it matters most. Don’t let disasters dictate your downtime—automate your way to swift and error-free recovery.

6. Manage Deploy-Time Dependencies

The resilience of your systems is only as strong as your dependencies. Therefore, as you probably already know, it’s crucial to understand the resilience posture of your runtime dependencies.

However, managing deploy-time dependencies is also critical. If your DR plan requires you to re-deploy your applications, deploy-time dependencies such as CI/CD pipelines and open-source packages served from the internet must be available. If those dependencies are impacted by the same disaster event, you might not be able to recover.

Bonus Tip: Backup Your Secrets

Sure, you already know the drill about making backups, but here’s a reminder. It’s seriously crucial to have backups for your passwords to keep your data safe and your business on track. These password backups come in clutch when things go haywire.