{"id":265,"date":"2020-06-14T08:41:49","date_gmt":"2020-06-14T08:41:49","guid":{"rendered":"http:\/\/box5442.temp.domains\/~arpioio\/?p=265"},"modified":"2020-06-14T08:41:49","modified_gmt":"2020-06-14T08:41:49","slug":"behind-the-ibm-cloud-outage","status":"publish","type":"post","link":"https:\/\/arpio.io\/staging\/8013\/behind-the-ibm-cloud-outage\/","title":{"rendered":"What the IBM Cloud Outage Teaches Us About Resilience Engineering"},"content":{"rendered":"<body><p class=\"\">Last week, IBM Cloud customers were impacted by a <a href=\"https:\/\/www.bleepingcomputer.com\/news\/technology\/ibm-cloud-global-outage-caused-by-incorrect-bgp-routing\/\" target=\"_blank\" rel=\"noopener noreferrer\">3-hour global networking outage<\/a>, rendering those customers\u2019 services unavailable for much of Tuesday afternoon\/evening. Compared to AWS, Microsoft Azure, and Google Cloud, IBM is a small player in the public cloud, yet this outage still <a href=\"https:\/\/news.ycombinator.com\/item?id=23471993\" target=\"_blank\" rel=\"noopener noreferrer\">managed to impact about 10%<\/a> of the services on the internet.<\/p>\n<p class=\"\">In this blog post, we\u2019ll break down what happened last week, and why this class of outage has been hitting all of the cloud providers lately. Finally, we\u2019ll outline how Arpio makes it easy for AWS customers to avoid this and similar outages when they happen.<\/p>\n<h4>What do we know?<\/h4>\n<p class=\"\">Reports of a major outage in the IBM Cloud began surfacing on social media around 4:50 pm CDT on Tuesday, June 9th. As with other cloud outages we\u2019ve seen in the past, the impacted services included IBM\u2019s status website, so much of the conversation centered on the lack of information about what was going on.<\/p>\n<p class=\"\">Initial reports focused on IBM\u2019s Dallas POP being inaccessible, but pretty quickly the other locations in IBM Cloud were also implicated. This appears to have been a global outage.<\/p>\n<p class=\"\">IBM released a statement the next day that gives us ample insight into the root cause: \u201cAn investigation shows an external network provider flooded the IBM Cloud network with incorrect routing.\u201d This was yet another case of BGP Hijacking.<\/p>\n<h4>What is a BGP Hijack?<\/h4>\n<p class=\"\">The internet is a \u201cnetwork of networks\u201d which means that independent computer networks are themselves interconnected. At the highest level, large-scale internet service providers connect to each other so that customers of one ISP can access websites and services that operate through another ISP.<\/p>\n<p class=\"\">BGP, the Border Gateway Protocol, is the mechanism that these ISPs use to publish connectivity to each other.<\/p>\n<p class=\"\">A BGP hijack occurs when one provider erroneously or maliciously publishes bad connectivity information to its peers. Typically, the publishing provider is advertising a fantastic route to a given destination, encouraging the other provider to send traffic that direction. If the network infrastructure cannot handle all of the traffic it suddenly starts receiving, an outage occurs.<\/p>\n<p class=\"\">BGP hijacks are sadly common on the internet. In April, <a href=\"https:\/\/www.zdnet.com\/article\/russian-telco-hijacks-internet-traffic-for-google-aws-cloudflare-and-others\/\" target=\"_blank\" rel=\"noopener noreferrer\">suspicious routes published by a Russian telecom provider<\/a> caused a one-hour outage for customers accessing Amazon, Facebook, and Google from over 200 networks around the world. In 2018, hackers used BGP to <a href=\"https:\/\/www.internetsociety.org\/blog\/2018\/04\/amazons-route-53-bgp-hijack\/\" target=\"_blank\" rel=\"noopener noreferrer\">hijack Amazon Route53 traffic<\/a> and direct customers of the MyEtherWallet cryptocurrency service to an imposter site. Last year, <a href=\"https:\/\/blog.cloudflare.com\/how-verizon-and-a-bgp-optimizer-knocked-large-parts-of-the-internet-offline-today\/\" target=\"_blank\" rel=\"noopener noreferrer\">CloudFlare publicly shamed Verizon<\/a> for a BGP hijack that resulted in a multi-hour outage for many of the largest sites on the internet.<\/p>\n<h4>The sad thing about BGP Hijacks<\/h4>\n<p class=\"\">Usually, when a provider has an outage, it\u2019s their own fault. With BGP hijacks, that\u2019s not necessarily the case.<\/p>\n<p class=\"\">The IBM Cloud outage last week apparently resulted from another provider publishing erroneous routes to IBM. IBM probably should have rejected those routes, but the external provider is at least partly to blame.<\/p>\n<p class=\"\">The recent AWS and CloudFlare outages resulted entirely from external providers exchanging erroneous routes to those platforms. The cloud platforms weren\u2019t even involved. Yet their service, and their customers, paid the price.<\/p>\n<h4>What can we learn?<\/h4>\n<p class=\"\">There is a <a href=\"http:\/\/www.manrs.org\/\" target=\"_blank\" rel=\"noopener noreferrer\">movement in the industry<\/a> to implement technical solutions that reduce the risk of BGP hijacks. But stepping back, these events are just more examples of the numerous things that can break systems on the internet.<\/p>\n<p class=\"\">We can, and should, continue to mitigate the risks of systemic failures by improving the underlying systems we build upon. We need to fix the vulnerabilities in BGP that result in these problems. And we\u2019ll need to fix the new vulnerabilities that arise when we fix these current ones.<\/p>\n<p class=\"\">But this game of whack-a-mole will never end. There will always be resiliency problems that have not yet been solved. So we need to concurrently invest in solutions that allow our systems to persevere and recover in the face of these unsolved, and often unknown, vulnerabilities.<\/p>\n<p class=\"\">In computing, this has always been the purview of redundancy. Decades ago, that was about uninterruptible power supplies and redundant arrays of inexpensive disks. These days, the common practice is metropolitan-area geo-redundancy, such as AWS\u2019s availability zones. But to truly be resilient at a global scale, we need to engineer global resilience.<\/p>\n<h4>How does Arpio help?<\/h4>\n<p class=\"\">Arpio is a global resilience solution for applications that run in Amazon Web Services.\u00a0 When a BGP hijack (or another event) impairs an AWS region, Arpio makes it quick and easy to continue operating in another part of the world.<\/p>\n<\/body>","protected":false},"excerpt":{"rendered":"<p>Last week, IBM Cloud customers were impacted by a 3-hour global networking outage, rendering those customers\u2019 services unavailable for much of Tuesday afternoon\/evening. Compared to AWS, Microsoft Azure, and Google&#8230;<\/p>\n","protected":false},"author":1,"featured_media":266,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"nf_dc_page":"","content-type":"","inline_featured_image":false,"footnotes":""},"categories":[3],"tags":[],"class_list":["post-265","post","type-post","status-publish","format-standard","has-post-thumbnail","category-blog"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v26.9 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>What the IBM Cloud Outage Teaches Us About Resilience Engineering \u2014 Arpio<\/title>\n<meta name=\"description\" content=\"Last week\u2019s IBM Cloud outage is just another example of the many things that can go wrong for applications on the internet.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/arpio.io\/behind-the-ibm-cloud-outage\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"What the IBM Cloud Outage Teaches Us About Resilience Engineering \u2014 Arpio\" \/>\n<meta property=\"og:description\" content=\"Last week\u2019s IBM Cloud outage is just another example of the many things that can go wrong for applications on the internet.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/arpio.io\/behind-the-ibm-cloud-outage\/\" \/>\n<meta property=\"og:site_name\" content=\"Arpio Disaster Recovery Made Easy\" \/>\n<meta property=\"article:published_time\" content=\"2020-06-14T08:41:49+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/arpio.io\/wp-content\/uploads\/2020\/08\/image-asset-1-1.jpeg\" \/>\n\t<meta property=\"og:image:width\" content=\"300\" \/>\n\t<meta property=\"og:image:height\" content=\"200\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"6805pwpadmin\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"6805pwpadmin\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"4 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/arpio.io\/behind-the-ibm-cloud-outage\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/arpio.io\/behind-the-ibm-cloud-outage\/\"},\"author\":{\"name\":\"6805pwpadmin\",\"@id\":\"https:\/\/arpio.io\/staging\/8013\/#\/schema\/person\/0a2437a37056190db7e46201a6a65095\"},\"headline\":\"What the IBM Cloud Outage Teaches Us About Resilience Engineering\",\"datePublished\":\"2020-06-14T08:41:49+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/arpio.io\/behind-the-ibm-cloud-outage\/\"},\"wordCount\":774,\"commentCount\":0,\"image\":{\"@id\":\"https:\/\/arpio.io\/behind-the-ibm-cloud-outage\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/arpio.io\/staging\/8013\/wp-content\/uploads\/2020\/08\/image-asset-1-1.jpeg\",\"articleSection\":[\"Blog\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/arpio.io\/behind-the-ibm-cloud-outage\/\",\"url\":\"https:\/\/arpio.io\/behind-the-ibm-cloud-outage\/\",\"name\":\"What the IBM Cloud Outage Teaches Us About Resilience Engineering \u2014 Arpio\",\"isPartOf\":{\"@id\":\"https:\/\/arpio.io\/staging\/8013\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/arpio.io\/behind-the-ibm-cloud-outage\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/arpio.io\/behind-the-ibm-cloud-outage\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/arpio.io\/staging\/8013\/wp-content\/uploads\/2020\/08\/image-asset-1-1.jpeg\",\"datePublished\":\"2020-06-14T08:41:49+00:00\",\"author\":{\"@id\":\"https:\/\/arpio.io\/staging\/8013\/#\/schema\/person\/0a2437a37056190db7e46201a6a65095\"},\"description\":\"Last week\u2019s IBM Cloud outage is just another example of the many things that can go wrong for applications on the internet.\",\"breadcrumb\":{\"@id\":\"https:\/\/arpio.io\/behind-the-ibm-cloud-outage\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/arpio.io\/behind-the-ibm-cloud-outage\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/arpio.io\/behind-the-ibm-cloud-outage\/#primaryimage\",\"url\":\"https:\/\/arpio.io\/staging\/8013\/wp-content\/uploads\/2020\/08\/image-asset-1-1.jpeg\",\"contentUrl\":\"https:\/\/arpio.io\/staging\/8013\/wp-content\/uploads\/2020\/08\/image-asset-1-1.jpeg\",\"width\":300,\"height\":200},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/arpio.io\/behind-the-ibm-cloud-outage\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/arpio.io\/staging\/8013\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"What the IBM Cloud Outage Teaches Us About Resilience Engineering\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/arpio.io\/staging\/8013\/#website\",\"url\":\"https:\/\/arpio.io\/staging\/8013\/\",\"name\":\"Arpio Disaster Recovery Made Easy\",\"description\":\"AWS Disaster Recovery\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/arpio.io\/staging\/8013\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\/\/arpio.io\/staging\/8013\/#\/schema\/person\/0a2437a37056190db7e46201a6a65095\",\"name\":\"6805pwpadmin\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/arpio.io\/staging\/8013\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/bbce7316dd4979a6199ddcdaed836e357939826f60c7be919373136535d247b6?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/bbce7316dd4979a6199ddcdaed836e357939826f60c7be919373136535d247b6?s=96&d=mm&r=g\",\"caption\":\"6805pwpadmin\"},\"sameAs\":[\"http:\/\/support.pagely.com\"],\"url\":\"https:\/\/arpio.io\/staging\/8013\/author\/6805pwpadmin\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"What the IBM Cloud Outage Teaches Us About Resilience Engineering \u2014 Arpio","description":"Last week\u2019s IBM Cloud outage is just another example of the many things that can go wrong for applications on the internet.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/arpio.io\/behind-the-ibm-cloud-outage\/","og_locale":"en_US","og_type":"article","og_title":"What the IBM Cloud Outage Teaches Us About Resilience Engineering \u2014 Arpio","og_description":"Last week\u2019s IBM Cloud outage is just another example of the many things that can go wrong for applications on the internet.","og_url":"https:\/\/arpio.io\/behind-the-ibm-cloud-outage\/","og_site_name":"Arpio Disaster Recovery Made Easy","article_published_time":"2020-06-14T08:41:49+00:00","og_image":[{"width":300,"height":200,"url":"https:\/\/arpio.io\/wp-content\/uploads\/2020\/08\/image-asset-1-1.jpeg","type":"image\/jpeg"}],"author":"6805pwpadmin","twitter_card":"summary_large_image","twitter_misc":{"Written by":"6805pwpadmin","Est. reading time":"4 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/arpio.io\/behind-the-ibm-cloud-outage\/#article","isPartOf":{"@id":"https:\/\/arpio.io\/behind-the-ibm-cloud-outage\/"},"author":{"name":"6805pwpadmin","@id":"https:\/\/arpio.io\/staging\/8013\/#\/schema\/person\/0a2437a37056190db7e46201a6a65095"},"headline":"What the IBM Cloud Outage Teaches Us About Resilience Engineering","datePublished":"2020-06-14T08:41:49+00:00","mainEntityOfPage":{"@id":"https:\/\/arpio.io\/behind-the-ibm-cloud-outage\/"},"wordCount":774,"commentCount":0,"image":{"@id":"https:\/\/arpio.io\/behind-the-ibm-cloud-outage\/#primaryimage"},"thumbnailUrl":"https:\/\/arpio.io\/staging\/8013\/wp-content\/uploads\/2020\/08\/image-asset-1-1.jpeg","articleSection":["Blog"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/arpio.io\/behind-the-ibm-cloud-outage\/","url":"https:\/\/arpio.io\/behind-the-ibm-cloud-outage\/","name":"What the IBM Cloud Outage Teaches Us About Resilience Engineering \u2014 Arpio","isPartOf":{"@id":"https:\/\/arpio.io\/staging\/8013\/#website"},"primaryImageOfPage":{"@id":"https:\/\/arpio.io\/behind-the-ibm-cloud-outage\/#primaryimage"},"image":{"@id":"https:\/\/arpio.io\/behind-the-ibm-cloud-outage\/#primaryimage"},"thumbnailUrl":"https:\/\/arpio.io\/staging\/8013\/wp-content\/uploads\/2020\/08\/image-asset-1-1.jpeg","datePublished":"2020-06-14T08:41:49+00:00","author":{"@id":"https:\/\/arpio.io\/staging\/8013\/#\/schema\/person\/0a2437a37056190db7e46201a6a65095"},"description":"Last week\u2019s IBM Cloud outage is just another example of the many things that can go wrong for applications on the internet.","breadcrumb":{"@id":"https:\/\/arpio.io\/behind-the-ibm-cloud-outage\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/arpio.io\/behind-the-ibm-cloud-outage\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/arpio.io\/behind-the-ibm-cloud-outage\/#primaryimage","url":"https:\/\/arpio.io\/staging\/8013\/wp-content\/uploads\/2020\/08\/image-asset-1-1.jpeg","contentUrl":"https:\/\/arpio.io\/staging\/8013\/wp-content\/uploads\/2020\/08\/image-asset-1-1.jpeg","width":300,"height":200},{"@type":"BreadcrumbList","@id":"https:\/\/arpio.io\/behind-the-ibm-cloud-outage\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/arpio.io\/staging\/8013\/"},{"@type":"ListItem","position":2,"name":"What the IBM Cloud Outage Teaches Us About Resilience Engineering"}]},{"@type":"WebSite","@id":"https:\/\/arpio.io\/staging\/8013\/#website","url":"https:\/\/arpio.io\/staging\/8013\/","name":"Arpio Disaster Recovery Made Easy","description":"AWS Disaster Recovery","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/arpio.io\/staging\/8013\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/arpio.io\/staging\/8013\/#\/schema\/person\/0a2437a37056190db7e46201a6a65095","name":"6805pwpadmin","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/arpio.io\/staging\/8013\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/bbce7316dd4979a6199ddcdaed836e357939826f60c7be919373136535d247b6?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/bbce7316dd4979a6199ddcdaed836e357939826f60c7be919373136535d247b6?s=96&d=mm&r=g","caption":"6805pwpadmin"},"sameAs":["http:\/\/support.pagely.com"],"url":"https:\/\/arpio.io\/staging\/8013\/author\/6805pwpadmin\/"}]}},"_links":{"self":[{"href":"https:\/\/arpio.io\/staging\/8013\/wp-json\/wp\/v2\/posts\/265","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/arpio.io\/staging\/8013\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/arpio.io\/staging\/8013\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/arpio.io\/staging\/8013\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/arpio.io\/staging\/8013\/wp-json\/wp\/v2\/comments?post=265"}],"version-history":[{"count":0,"href":"https:\/\/arpio.io\/staging\/8013\/wp-json\/wp\/v2\/posts\/265\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/arpio.io\/staging\/8013\/wp-json\/wp\/v2\/media\/266"}],"wp:attachment":[{"href":"https:\/\/arpio.io\/staging\/8013\/wp-json\/wp\/v2\/media?parent=265"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/arpio.io\/staging\/8013\/wp-json\/wp\/v2\/categories?post=265"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/arpio.io\/staging\/8013\/wp-json\/wp\/v2\/tags?post=265"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}