Recent AWS Outages: Should You Panic?
Amazon cloud services ran into significant difficulties in December 2021. Three major AWS outages affected financial institutions, airline reservation systems, online retailers, and dating apps. The incidents were global phenomenons. Companies like Spotify, Venmo, Netflix, Tinder, and Roku experienced service disruptions. One of the outages took down Amazon’s own delivery app, which led to packages piling up in Amazon warehouses. Small businesses came to a standstill.
With every outage in December, businesses worldwide waited for AWS to find the underlying root cause and resolve the issues. Organizations realized they were totally dependent on Amazon. It brought to light the vulnerability of relying on a single cloud provider. In the aftermath of the havoc, businesses want to figure out how to safeguard themselves from similar situations in the future.
The adoption of public cloud services has seen an upward trajectory since the emergence of the big three providers, Amazon Web Services (AWS), Microsoft Azure, and Google Cloud. Businesses have good reasons to embrace cloud computing. Public clouds are high speed, low maintenance, cost-effective, scalable, and secure. To learn more about public cloud advantages, read our article 7 Key benefits of cloud computing for Your Business.
However, the dark side of the public cloud is outages. When an organization relies on a sole cloud provider, a cloud outage can threaten the whole business. Cloud outages expose enterprises to significant financial risks. A 2020 Global Survey of Data Center and IT Managers conducted by Uptime Institute found that two-thirds of outage incidents were more than $100,000 while the rest of the reported outages cost $1 million or more. So, it makes sense for businesses to worry about cloud outages.
The end of 2021 was not a good time for AWS. The major outages were as follows:
Impairment of Network Devices (December 7, 2021) – The AWS outage began around 12:30 pm Eastern time in US-EAST-1 region. It took AWS about 5 hours to resolve the issue. Several impaired network devices caused the outage. The network devices that handle traffic forwarding and network address translation (NAT) were overwhelmed. It caused disruptions for Netflix, Disney+, Fidelity Investments, Roku, and Ring. The service interruption halted Amazon package deliveries because the scanning and routing apps were unavailable.
Network Congestion and Packet Loss (December 15, 2021) – The AWS incident began around 7:15 am Pacific time in the US-WEST-2 region. The root cause was defined as network congestion due to internal engineers unexpectedly moving traffic to the AWS backbone. Due to packet loss, applications like Okta, Workday, and Slack were unreachable. By 12:00 pm Pacific time, the issue was resolved.
Power Outage in Data Center (December 22, 2021) – The outage started around 4:11 am PST. The reason was determined as a power outage in a single data center in the USE1-AZ4 availability zone of the US-EAST-1 region. The power loss shutdown EC2 and EBS instances. It affected services like Slack, Imgur, EPIC games, and Asana. It took 12 hours to resolve the issue.
Businesses want reliable infrastructure and 100% outage-free environments. Unfortunately, there is no way to design a system or architecture with 100% uptime. So, cloud outages will always be a part of the IT and DevOps life.
Even though the December outages created an uproar, AWS has gone through many incidents over the years. Here are a few highlights:
- November 2020 – Capacity update to Amazon Kinesis Data Streams caused an outage.
- June 2016 – A storm caused power outages in a Sydney data center and took out EC2 and EBS services.
- November 2014 – An AWS CloudFront DNS server caused CDN outages.
- September 2013 – A load balancing issue resulted in service unavailability for customers in Virginia.
- December 2012 – Customers had problems due to Elastic Load Balancing (ELB).
- June 2012 – An electric storm in Northern Virginia caused an outage.
- August 2011 – A transformer caused a power outage, leading to EC2, EBS, and RDS problems.
Notably, cloud outages can be caused by natural disasters, human errors, and unpredictable network conditions. It’s impossible to design a cloud service that can avoid all calamities.
The risk of cloud outages is unavoidable. However, businesses can take steps to mitigate the risks. Here are a few steps you can consider:
Replicating infrastructure across multiple clouds, also known as a multi-cloud strategy, can help you avoid falling into the single cloud dependency trap. During an outage of one cloud provider, you can use your backup provider to keep services running. Another strategy is to split your resources between on-premises and public cloud to create a hybrid cloud approach. You can use your on-premises resources during public cloud outages. To learn more about hybrid cloud, download Survey Report: Hybrid Cloud Management 2020.
If multi-cloud and hybrid cloud strategies are out of your price range, you can still benefit from multi-region redundancy. Even if a region goes down, you will have access to other regions to keep your applications running. For example, if you are running your services from the US East (Northern Virginia) Region and your backup servers are running in Europe (Ireland) Region, you have better protection against a catastrophic event in Virginia. Even if a storm takes out the data centers in Virginia, you can still run your operations from Ireland, even if the services are slower due to physical distance. Businesses with mission-critical services can even distribute their services in more than two regions to increase redundancy. Of course, the cost can be a prohibitive factor. Also, if your business has restrictions about where data can reside, you might be unable to use multi-region as a risk mitigation strategy.
Being prepared for disaster recovery will help you during outages. You should run test cases and check the efficiency of your disaster recovery (DR) and business continuity plans (BCP). A great way to improve your cloud resiliency is to look into chaos engineering. Netflix pioneered the concept of chaos engineering by developing Chaos Monkey. The tool randomly disables cloud production instances to ensure the services can withstand the disaster. Chaos Monkey is available as an open-source tool. But there are other tools like Chaos Mesh, Gremlin, and Litmus. Also, monitor your applications to alert you at the first sign of any trouble. Automated monitoring can help you stay ahead of any disaster scenarios. To learn about automating your monitoring tasks, check out Automated Application Monitoring – Solution Brief.
Even with mitigation strategies, you might be unable to keep your services going during an outage. If your services go down, you need to inform your users. So, set up channels to give status updates during an outage situation. Honest communication will help you earn your customers’ trust.
Vendor-agnostic tech partners can decrease risk exposures. They can help you implement multi-vendor cloud solutions. A vendor-agnostic consultant will not try to sell you a particular cloud product or service. So, you have a better chance of getting robust solutions that can handle cloud outages. Vendor-agnostic solutions will ensure the tools you use are aligned with your strategic requirements and risk tolerance.
In a recent webinar Stay Out of Outages – The BCP Element Now One Talks About we discussed multi-vendor strategies to combat cloud outages.
Cloud outages have real financial consequences. If your applications and services are offline, it can lead to missed revenues, lost clients, and reputation damage. As an industry-standard, cloud service-level agreements (SLAs) provide remedy through service credits. But the service credits are not enough to cover the losses you incur. Moreover, cloud providers can define downtimes as degradation if some connectivity is available during an incident. It further restricts your ability to recuperate any loss. Considering the situation, some insurance companies have started providing downtime insurance for cloud users. Downtime insurance can help you mitigate outage risks. Before buying downtime insurance, you should run a cost-benefit analysis to determine if the insurance cost is worth it for your business.
In today’s business environment, it is impractical to avoid the cloud. But cloud outages are inevitable. Your best bet is to mitigate your risks through multi-vendor cloud strategies and vendor-agnostic tech partners.
At GlobalDots, our senior-level DevOps engineers and experts have experience working with multiple vendors. We take a holistic approach to evaluating and mitigating risks for multi-cloud environments. To learn more about our risk mitigation services, contact GlobalDots today.