Arpio + Opti9: Automating Resilience for AWS Workloads (webinar)

Transcript

Doug Neumann
I really appreciate you guys having us and I’m excited to talk through what we’re doing and what problem that really solves. That really starts with understanding the disaster risk of workloads that run in the cloud and AWS, and I thought maybe the best way to do this would kind of be three stories. I want to start by walking us through a couple of stories about outages, disasters that have happened. These are all relatively recent just to help people think about what do they need to be contemplating for their own resilience strategies. I think DR for on-premise workloads is a pretty well understood problem these days, but when you move those into the cloud and you inherit some additional resilience that the cloud platforms provide, you have to ask the question, well, what problems do I now have to retain responsibility for?

I think this is a good way for us to kind of walk through that. So, so I want to start by talking about the most recent major AWS outage that happened. This was on December 7th and as actually occurred to me this morning, it was the anniversary of the Pearl Harbor attack. Not sure I should really be equating these two disasters, but all that said back on December 7th, Amazon had a major outage of the Northern Virginia region of AWS. What happened here was that they, throughout the day routinely scale up and down their own services within AWS and they had an internal auto scaling event occur. Somehow, they scaled to a point that they triggered a scalability bug in their systems. That caused a bunch of systems to flood a specific network segment with what appears to be DNS requests. They effectively DOS-ed their own internal DNS services that are essential for running all of the services in the Northern Virginia region.

The net result of this was that the control plane for AWS, their API surface area was broken. It was so severely broken that Amazon had a lot of trouble actually using their own tools, which build on their own platforms to troubleshoot and discover what the root cause of it was. Most people at least know this because on December 7th, you couldn’t log into the AWS console. There were some back doors that might be able to get you into a different region of AWS. The login page for AWS is console is hosted in the Northern Virginia region. With that region being down, you had to know the special URL to get to another region. Certain services like the single sign on service for AWS lives in the Northern Virginia region. You couldn’t use SSO to log in. So, lots of pain ensued for people during this day.

Many companies were down entirely, especially companies who have very dynamic cloud native workloads that relied on this and it was a seven hour outage. So, I’ve lived through a lot of outages in my career. Seven hours is a very painful outage to live through. It’s not the most painful outage. We’re gonna talk about some others. There are a lot worse. The thing that I remember from dealing with prior AWS outages is hour one, you don’t know that it’s going to be over at seven hours. Hour two, you still don’t know you have zero control during these events to actually mitigate your own situation. If you have not invested in that upfront. The way that companies, sidestep this particular outage was to have a multi-region approach where they could move their traffic to a different region. That’s what, as we’ll talk later on, it’s an essential part of preparing for cloud outages and being resilient to these types of events.

I’m going to Leon, you remember this event? Did this impact?

Leon Thomas
Yes, we were definitely all hands on deck.

Doug Neumann
Yup. So that’s an AWS outage. That’s what I think a lot of people think about when they’re considering DR. In the clouds. What do If AWS is down, but there are probably a lot more companies that are dealing with ransomware and trying to understand their ransomware risk or return from a ransomware event. These days, you hear about this a lot more often. I wanted to highlight one that happened. This was also in December, actually spanned into to January a company that you guys may know is called Kronos. They do a lot of different workforce management solutions, things like timecards and punching in employees and stuff like that. They operate a lot of infrastructure for the services. Well, back middle December, they were hit with a ransomware attack and the end result of this, they actually have not been terribly transparent about what happened and why it took them so long, but that the impact was it, the services they provide to companies that support 8 million different employees were offline for 42 days, in the holiday period when everybody wants to make sure they’re getting their paycheck. These companies that use the Kronos service were forced to go and invent manual processes for running payroll processes that are perhaps error prone that are slow to execute and delay being paid. 8 million employees in the United States were effectively either not getting paid or getting paid late or getting paid inaccurately for a 42 day period here. I think what’s really interesting about this one is not just the outage, but rather the impact that this ransomware attack will have on Kronos as business. These are major companies that are using the service. I think this screenshot here talks about Tesla and Pepsi, who were unable to use these services during this period. These companies are currently pressing lawsuits against Kronos to recoup their own costs around these things. I cannot imagine a customer that sues your business and tends to remain a customer. They will lose a lot of business as a result of this. If you think about how this just represents what fundamentally is an existential threat for a business where a ransomware attack might undermine their service so comprehensively that they probably a year from now or a year from this attack, won’t look like the same business that they were beforehand. It’s really all because they were unprepared to deal with a major ransomware event and it took them 42 days to recover from it. So, Leon was the last time you had to help a customer with a ransomware attack.

Leon Thomas
Last week, as a matter of fact!

Doug Neumann
Oh really? Okay. So these are happening all the time. You can see a lot of them in the press. Certainly when they’re this big, you also don’t hear about a lot of it because nobody wants to go raise their hand and share publicly that they’ve been the victim of a ransomware attack, but dealing with these can be catastrophic if you haven’t really thought through it. Your disaster recovery strategy needs to contemplate recovering, not just from infrastructure outages and data center fires and things like that, but also cyber events such as this one. The next story I wanted to tell is actually extremely recent. They just resolved this one last week. This was there’s a company called Atlassian that many of you may know and use. They build very popular tools for like project management and IT service management and software engineering efforts and things like that.

The beginning of this month, they were performing some maintenance operations, which involved cleaning up some data from a legacy service that they were deprecating. They had a team that was producing some data that was an input to this process as the ideas of what data should be cleaned up. Another team was executing a script to clean up that data and a miscommunication in that process resulted in the wrong IDs being provided instead of being the IDs for the specific data that need to be cleaned up, it was the IDs for the sites that hosted that data, the team then went and run that script and accidentally deleted the sites of 400 of their largest customers. I think the number I heard was like 500,000 users that were impacted. Now Alaska is a very mature organization. They also have a lot of it processes in place and they have good backup practices. They also have a six hour recovery time objective for outages of the services, but their backup processes are more automated than their recovery processes are. It turns out that six hour recovery time objective applied on an individual site level. When you have hundreds of these that are impacted at the same time, you can only do so many in parallel. It ultimately took them 15 days to recover from this particular outage. They are back online. So kudos to them. I’m sure that this is not a fun event to go through, but 15 days for a company like Atlassian to not provide service to some of their largest customers is another huge reputational hit. And it’s, you know, a simple miscommunication. There was no malice involved in this. There was no infrastructure outage. There just happened to be two teams that didn’t communicate as effectively as they were supposed to and resulted in this major event for them. I can’t tell you how many times I’ve been involved in something like this in my career early on. I don’t know if you’ve seen this before.

Leon Thomas
I was just going to comment that judging by the names in the participant group here, I see several Atlassian customers. It, it certainly can happen to anybody.

Doug Neumann
Hopefully the participants here were not impacted by this particular outage, but if they are, hopefully they’re at least happy to have their stuff back. So, the last thing I want to talk about this, isn’t really a story to tell about an outage, but this is really more something that’s happening really relevant right now, which is obviously we’re at a time of some global uncertainty I’ll call it. The US government, the department of Homeland security has asked us as a community of enterprises in the United States that operate oftentimes critical infrastructure or systems that support critical infrastructure to be on a heightened level of alerts. If the raised the shields up alert, they’re anticipating that there is going to be an increase in cyber activity. I guess in this case, it’s state sponsored cyber activity. Certainly that might look like ransomware attacks that might look like other kinds of things, but we should all be right now anticipating what would happen if my particular properties got attacked, as well as what would happen if the infrastructure of our country got attacked. If, for example, the power grid that happens to behind a region of a country where we’re operating a cloud workload were to go down for an extended period of time. I think it’s a really good time to be thinking about what is our disaster recovery strategy? How are we prepared for these kinds of events and how are we treating this with the appropriate level of urgency effectively? The landscape of disasters that I think that people really should be thinking about as they’re wondering, what investments should I be making as a company. The next thing I think we should do is just talk about how do we avoid these outcomes. If this were to happen to my organization, how would I recover from this? That just gets us talking about what disaster recovery for a cloud workload really looks like. I want to talk about the three things that I believe and that we tell our customers are most essential for them to be contemplating when they’re thinking about the resilience of their workloads and it starts with good backup processes.

I think a lot of us benefit from the built-in backup capabilities that are in a cloud platform like AWS, and being able to turn on backup or actually inheriting for free backup. That’s built into like a managed database platform or something like that. What we really need to be considering is what would happen if we had a cyber event occur or a malicious employee – there are some other stories I can tell you about employees connecting into environments and doing something unfathomable because they were upset. If they were to go in and compromise this environment, would they be able to find our backups? Would they be able to undermine our ability to recover from the event that’s about to occur? I can’t tell you how many times I’ve heard stories of people get ransomware. The ransomware also encrypted the backups and those encrypted backups are not valuable when a trying to restore your data and not pay a ransom.

The essential practice here is to consider how are we protecting those backups? Are we airgapping our backups and moving them into a location where a bad actor would not be able to get them that location generally should be outside of your production environment so that if your production environment is compromised, their backups, aren’t sitting right next to it in AWS. There’s a couple different ways you can accomplish this. One of the ways that is most common and what we advocate for our customers to do is to move those backups into a different AWS account. That different account is a completely different security domain. The attack vectors that might let a bad actor getting a new production account, don’t allow them to get over there. You don’t need to share access to that account with anybody on your team, necessarily. If you’ve got an automated solution in place to be able to replicate everything over there, but fundamentally air gapping your backups, is an essential practice that everybody needs to be invested in if they aren’t already.

The second thing that’s really important to understand is the level of redundancy that you have in your cloud workload. There is a lot of redundancy built into the cloud platform into AWS itself. In particular, when you deploy a workload into AWS, most workloads are deployed into a specific region of AWS. Within that region, they have a concept called availability zones. These availability zones, you can kind of think of them as being independent data centers that are in the same general geographic region. They might be separated by miles or tens of miles or something like that, but you can deploy your workloads so that they run across multiple zones. If you have a traditional data center outage like a fire or a flood or something like that hopefully wouldn’t be impacting multiple zones that you’re in and your workload could continue operating in that other. But the thing that’s really important to understand is that availability zones still share a lot of foundational infrastructure. And this December 7th outage we talked about is a really good example. All of the availability zones were impacted by this control plane-out. There’s the AWS control plane is regional. It’s not specific to an availability zone. Therefore, didn’t matter if you were deployed in multiple AZ’s if your workload depended on AWS control plane, then your workload was not going to work that particular day. Now, in addition to availability zones or multiple regions, and when Amazon builds regions, they build those to be completely independent. Each region of AWS has its own control plane. It has, its own implementation of regional services like S3, IAM, or things like that. It’s a global service, but there are regional footprints, but fundamentally, if you want to build to be resilient to any AWS outage in a region, you have to build to be multi-region and you have to contemplate what would it mean for me to be able to recover my workload in a different region If my primary region were not available.

The last thing that’s really essential is practice. It’s so easy for us to go and say, I turned on backups. I’m not going to worry about this problem, but then the actual disaster strikes and you find out that you’re actually your backup process was not successful, or you hadn’t turned it on for all of the pieces of your infrastructure that you needed. Or even though we have backups, we haven’t practiced the recovery process. We don’t understand what it means to recover 400 lost properties. We only know what it means to recover a single loss property, like the Atlassian outage there. It’s really critical that you don’t just invest in the backup process, but you invest and you practice the recovery process so that, and your team is trained on what it takes to recover our service in the dire event where we need it.

If you aren’t authentically testing your process by actually bringing up your service and validating that it works in its recovery form, then you can’t know that you can recover it in the time objective that you’ve specified. You don’t know what all of those steps are. You don’t know the unknowns of your recovery process. That’s what we pro-port is essential for people to understand.

That gets us to what Arpio is. What we built with Arpio is intended to solve these exact problems here. Arpio the concept here is that you can go and invest in building out disaster recovery for your own workload, stitching together all the backup processes for all of the different services that you’re using, understanding how to automate all of the infrastructure that would be necessary to recreate or how to manually recreate that infrastructure in the event of an outage. But building disaster recovery is really not strategic for most organizations. Their time is better spent investing in the business and the things that will help them grow the business and win new customers. You have to have a DR plan for responsibly operate your infrastructure, but the best case scenario, you don’t actually have to spend a lot of time on that DR Plan. Arpio solves this problem for AWS workloads and really what we’ve done here is said, well, we can look at AWS platform and we can see what does it mean for all of these 200 and so services that exist in AWS, what is the right strategy to recover that service from an outage? If it’s a data heavy service, then what does it mean to be able to back up that data? Or if it’s an infrastructure only service, how do you automate recreating that infrastructure exactly as it was in its original environment, so that what you get is functionally equivalent to what you started with and you don’t have to go and manually stitch back together all of those things. Our view gives it multi-region redundancy by default, you go into our console and I’ll show you this in a second check, a couple check boxes, tell us which region you want to be able to recover in. Arpio handles, replicating everything over there and giving you the ability to turn that up in a moment’s notice if you need it. Similarly, Arpio also protects your data so that you do have that air gap solution and make sure that you are resilient to some outage of your production environment that might undermine your recovery capability by attacking your backups and selves. Lastly, the product focuses on automating all of this. I often say, backup is generally always automated, but the recovery process is oftentimes very manual. A lot of people I talked to are actually, what, if my services go down, it might take me a day to recover them, but that’s going to be okay for the business. The problem is that if it takes you 24 hours to recover your service, it takes you 24 hours to test your ability to recover the service. And that is three full working days. Very few people have an appetite to actually spend three days on your uninterrupted focus on a recovery process. The great thing about an automated recovery solution is you can spin that environment up in 10 minutes. You can smoke test that in 10 more minutes, and then you can turn it back down and 30 minutes later you’ve exercised your full disaster recovery capability. You know that it’s going to work for you if you need it. That is the idea behind Arpio and the automation that it brings.

Arpio + Opti9: Automating Resilience for AWS Workloads (webinar)

Transcript

How does Arpio work?

Recent Posts

Have a Question?

Quick Links