How Availability Zones Could Have Saved Azure from the Outage Pain

Earlier this month, Microsoft Azure experienced somewhat of a cloud infrastructure meltdown when a powerful lightning storm hit one of their San Antonio datacenters. The surge caused a voltage swell in the utility feeds, overwhelming the facility’s surge suppressors and knocking out its cooling systems. The backup “load-dependent thermal buffer” was completely depleted, air temperatures rose, and automatic shutdown of their hardware began to roll out.

As a result, the outage affected approximately 40 Azure services in the South Central US cloud availability region, including many of the often critical Office 365 services used by businesses across the globe such as Exchange, SharePoint, and Teams. It has been recorded as one of the longest outages in Microsoft’s Visual Studio Team Services (now known as Azure DevOps team) starting at 2:45 am PST on September 5th and ending at 5:05 pm on September 5th.

“This shutdown mechanism is intended to preserve infrastructure and data integrity, but in this instance, temperatures increased so quickly in parts of the data center that some hardware was damaged before it could shut down. A significant number of storage servers were damaged, as well as a small number of network devices and power units.”

– Buck Hodges, Microsoft Azure DevOps Director of Engineering

Here’s What’s Clear

Buck Hodges of the Azure DevOps team, wrote a postmortem, apologizing for the incident and recapped their plans to move forward. You can view the full postmortem here. However, looking at the bigger picture, the key thing that stood out to our team was the importance of Availability Zones in your Cloud Provider’s environment.

How This Could Have Been Prevented

Looking at the situation at hand, its clear to our team that the proper use of Availability Zones could have potentially prevented this painful incident. While Azure is beginning to introduce the concept, they are currently only supported in three Azure regions, Azure’s primary solution thus far has been automatic SQL database backup and storage replication.

On the flip-side, Amazon Web Services (AWS) is a strong supporter of Availability Zones. Currently, AWS Cloud spans 55 Availability Zones within 18 geographic Regions and 1 Local Region around the world, which are connected with low latency, high throughput, and highly redundant networking. These Availability Zones offer AWS customers an easier and more effective way to design and operate applications and databases, making them more highly available, fault tolerant, and scalable than traditional single datacenter infrastructures or multi-datacenter infrastructures.

AWS Availability Zone Basics

If you’re unfamiliar with Availability Zones, I encourage you to take some time to learn more about them. They’re truly a foundational piece of the equation when understanding how to build applications in a global cloud infrastructure. In this example, I will use a diagram from Amazon Web Services (AWS).

Every AWS Region consists of at least two Availability Zones and all new Regions will have at least three. AWS Regions are geographical locations with a collection of availability zones mapped to physical data centers in that region. Regions are physically isolated from and independent of every other region in terms of location, power, network, etc.

Inside of each Region, you will find two or more Availability Zones, with each zone hosted in multiple, separate data centers from another zones. An availability zones is a logical data center in a region available for use by any AWS customer. Each zone has redundant and separate power, networking, and connectivity to reduce the likelihood of two zones failing at the same time.

Why Do Availability Zones Matter?

The ability to leverage multiple availability zones, spanned across various locations is the foundation for building a highly available, resilient, and fault-tolerant application. Availability Zones in AWS enable you to easily architect applications that automatically fail-over between Availability Zones without interruption.

In the below diagram, we’ve illustrated how one can span an application across multiple zones. In other words, you’re diversifying your risk, but scaling your application across various zones. By placing cloud instances or virtual servers for each tier in each zone, you are eliminating a single point of failure, that could bring your entire system down at a single point. In this model, we can enable synchronous database replication between a master and slave to enable seamless failover.

Availability Zones are a fundamental building block of AWS’ offerings. Azure is still working to retrofit this concept to their existing and new regions. Reflecting on the recent Microsoft Azure situation, it is possible that if the Microsoft team leveraged multi-availability zone scaling up front, the pain from this outage may have been prevented.

What are your plans to recover?

Are you interested in moving to AWS to take advantage of their Region and Availability Zone model to help prevent your apps from going down? Get in touch to have an introductory conversation with one of our cloud architects to review your disaster recovery and Availability Zone strategy.

Dream Build Soar

Let’s start building

Have an idea that you would like to share? We want to help you bring your ideas from concept to reality.