Skip to main content

On the heels of the major outage in AWS’s Northern Virginia (us-east-1) region the morning of October 20, we thought it was a good time to share some technical details about how Arpio is built to be resilient to regional events, ensuring that our customers can count on us when they need us.

In case you missed the event, AWS had a significant outage in their largest region starting around 3am ET on the 20th. The cause is not yet disclosed, but some indications point to failure in DNS resolution of the DynamoDB API endpoint in us-east-1.

Arpio’s Impact

Arpio has a hard dependency on DynamoDB. We love it as a database and it’s key to how we deliver a massively scalable and massively reliable platform for cloud resilience. But, when DynamoDB goes down in us-east-1, Arpio is impacted.

Luckily, we have known from day 1 that to be in the AWS disaster recovery business, we have to be resilient to an AWS regional disaster ourselves. We built Arpio as a multi-region workload that operates in us-east-1, us-east-2 (Ohio), and us-west-2 (Oregon). The goal was to ensure that Arpio can survive in the unlikely event of a 2-region outage and still recover customer workloads.

Our “processing region” algorithm for any given workload ensures that we can help a customer operate in safe, unaffected regions in the event of an outage. For example, if a customer’s production workload lives in us-east-1, Arpio will choose between us-east-2 and us-west-2 for running recoveries, ensuring we can recover us-east-1 workloads when us-east-1 is impaired.

Recovering TO us-east-1

Ironically, because of Arpio’s effective cross-region disaster recovery capabilities, the impact on Arpio of a us-east-1 outage focuses on customers who host their production workload in other regions. For customers who use us-east-1 as a recovery region, our processing region algorithm might ordinarily choose us-east-1 to execute their backups and recoveries. 

During these events, we make a deliberate decision to retain or remove a region from the list of Arpio deployment regions. Once the region is removed, the algorithm chooses a different processing region, and everything picks up cleanly.

Resilient Architecture Saves the Day

Overall, Arpio’s architecture performed as designed during the event: us-east-1 workloads remained fully protected throughout the outage, and recovery capabilities remained available for all customers. Backup job were delayed for some non-us-east-1 workloads but fully recovered around 5:30 a.m. when the DynamoDB DNS issue was resolved.

***

If you’d like to learn more about how you can safeguard your critical applications and recover quickly from any regional event, get in touch or request a demo of Arpio!