How Arpio Outage Recovery Ensures You'll Never Miss Your SLA

Think back to your most recent bad outage. How was it handled? Was it orderly and predictable? Were you able to detect, diagnose, and fix the problem before your customers were impacted? Did you have confidence throughout that your services could be restored in a timely manner? And did you have reasonable alternatives if the outage couldn’t be resolved quickly?

Or, was the process completely ad-hoc? Was it stressful? Were you reliant on one or two experts to perform bespoke troubleshooting, racing to find a root cause before your largest customers screamed and your CEO got involved?

Most outages are frightful events, and you probably invest significant expense to ensure they don’t happen in the first place. But despite these efforts, occasionally things go down. And when that happens, most companies meander through an unpracticed and high-stakes diagnostic process, with very unpredictable outcomes. There must be a better way.

Introducing Arpio Outage Recovery

Indeed there is a better way. This may sound like magic, but it is possible to restore service in minutes, no matter what the cause of an outage may be. It is possible to restore service without even finding and fixing the cause of the outage. And because of this, you can begin restoring service as soon as an outage is detected, in parallel with your troubleshooting efforts to diagnose the root cause.

To achieve this, we’re going to rely on the tried and tested techniques of redundancy and failover. You’re surely using these techniques in your infrastructure today. But redundancy and failover are typically applied at the component level, and only suffice for component-level failures.

If the root cause is something bigger — data loss, ransomware, cloud-platform impairment — a component-level approach falls short.

But what if this redundancy approach was applied across your entire application, workload, or environment? What if you could spin-up a duplicate environment, with all of your data, in a different location, in a matter of minutes? And what if you could roll back that environment (when appropriate) to an earlier point in time, before the outage struck?

This is Arpio’s concept of outage recovery. Essentially, let’s apply the concepts of redundancy and failover above the component level – at the application level – to quickly resolve outages from failures at any level.

Is This New?

Outage recovery isn’t entirely new. You’re probably familiar with the concept of “disaster recovery,” whereby workloads and data are re-created in an alternate location in the event of an IT catastrophe. Disaster recovery has been around for ages.

But the downsides of disaster recovery solutions are punitive in so many ways that they’re not appropriate for anything short of an absolute business catastrophe.

Penalty #1: They’re tremendously expensive to implement.

DR solutions are custom-engineered to the specific architecture of a given application. They take weeks to implement and require ongoing maintenance as your environment evolves. They may require duplicative hardware purchases for the recovery environment. And they rely on specialized software licenses with hefty price tags. Because of these expenses, they’re only appropriate to implement for the most mission-critical services in deep-pocketed organizations.

Penalty #2: They’re slow to recover.

DR solutions focus on data protection and restoration. But to restore service, you need to recover your data and your network, operating systems, containerized workloads, application software, security, IAM, and everything else that makes your application tick. With typical investments in automation, this recovery takes several hours to execute. Without automation, it can take days.

Penalty #3: Data loss.

DR solutions are built for catastrophic situations. They ask the question, “would you like none of your data, or most of your data?” and you gladly choose the latter. But with most DR solutions, the last few hours of data is lost, and that’s a tough pill to swallow. You probably won’t swallow it for a common outage.

Penalty #4: Manual failback.

Most DR solutions are minimally engineered to recover from a disaster. They aren’t built to restore service in the primary environment when the disaster has concluded. If you pull the trigger on recovery, you’ve signed up for an expensive and painful process of manual failback when the outage’s root causes are addressed.

Application recovery solutions don’t incur these penalties. Suddenly, you can use application-level failover to confidently recover from even minor impairments.

How Does It Work?

Arpio Outage Recovery solves the aforementioned problems.

General-Purpose, Platform-Layer Approach

Rather than engineering a custom recovery solution for each workload, Arpio Outage Recovery leverages AWS’s platform-layer replication and recovery techniques to clone your entire AWS environment. Now, a single solution applies consistently across all workloads running on the platform. This general-purpose approach eliminates the expense of custom-engineering and specialized software licenses associated with today’s disaster recovery solutions.

Comprehensive Recovery Automation

Arpio Outage Recovery automates the complete recovery process with a single click. Your workload is up and running in just a few minutes.

Zero Data Loss

Arpio Outage Recovery leverages real-time replication to synchronize data to your recovery environment. In common outages, where replication frequently hasn’t failed, all data is available in the recovery environment and there is no data loss. In catastrophic outages where replication may also fail, the potential data loss is measured in seconds.

It’s important to note that in some outages, a little data loss is actually desirable. If your outage is the result of data corruption, or possibly ransomware, your replication strategy may have replicated the outage to your recovery environment. For this reason, Arpio Outage Recovery allows you to rewind to a point in time before the outage occurred.

Automated Failback

Arpio Outage Recovery is not naive about the need to return service to your production environment when the original outage is remediated. It solves for both failover and failback. When the initial outage is resolved, the same automation philosophy that made recovery a seamless experience makes failback a seamless experience as well.

By addressing the penalties associated with traditional DR solutions, Arpio Outage Recovery is applicable for use in more than disaster scenarios. It’s a reliable solution to quickly restore service during any outage.

Wrapping Up

Our industry assumes that the only way to eliminate downtime is to eliminate every possible outage. But we all know this is an unattainable goal, and when an outage does occur we hope we’ll be able to find and fix the causes before business impacts become meaningful.

Arpio Outage Recovery gives you a reliable and predictable alternative to the ad-hoc recovery processes that organizations rely upon today. Instead of depending on bespoke troubleshooting and diagnostic techniques, Arpio Outage Recovery can restore your entire environment in an alternate location, and without data loss, allowing you to resume operations in a matter of minutes.

Want to know more? Let’s have a chat.

How Arpio Outage Recovery Ensures You’ll Never Miss Your SLA