Leap Year and Windows Azure Cloud Outage: Root Cause Analysis
March 12, 2012
Windows_Azure
Bill Laing of Microsoft has posted a “Root Cause Analysis” (RCA) that not only gives a detailed account of the Leap Year software bug and related issues that hit the Windows Azure servers starting February 28, but also provides unprecedented insights into the Azure cloud architecture.With the release of the RCA blog, Microsoft announced that it has extended a 33% service credit to ALL customers of Windows Azure. Bill Laing writes: “Microsoft recognizes that this outage had a significant impact on many of our customers. We stand behind the quality of our service and our Service Level Agreement (SLA), and we remain committed to our customers.”
On March 12 Gartner analyst Kyle Hilgendorf blogged that he “was very pleased with the level of detail in Microsoft’s RCA.” In reference to the Amazon Web Services’ EBS outage in April 2011, he wrote “an RCA… is one of the best insights into architecture, testing, recovery, and communication plans in existence at a cloud provider. Microsoft’s RCA was no exception.”
A Closer Look
Bill Laing’s starts his analysis with a background description of the architectural issues that helped propagate the Leap Year bug. While a detailed reproduction of the RCA is beyond the scope of this Talkin Cloud blog, suffice to say that proper virtual machine functioning on Azure requires that a Guest Agent (GA) have a valid transfer certificate, which is encrypted to protect both proprietary and client data. The certificates are valid for one year from the date of creation.
The Leap Year software bug, which according to the RCA triggered at 4:00 pm PST on February 28, just as clients were starting cloud work for Leap Year itself, forced the transfer certificates to have a valid-to date of February 29, 2013. The system rejected the date, causing the certificates to fail.
This led to a cascade effect, the shutting down of numerous Azure server clusters, because the monitoring processes incorrectly assumed that hardware was corrupted. Moreover, a software update was underway (which complicated matters the next day—see below), and this lead to the software bug spreading even faster, in comparison to less active storage servers.
Microsoft identified the bug at 6:38 pm PST. At 6:55 pm PST, the company disabled all client service management worldwide for Windows Azure. Bill Laing writes: “This is the first time we have ever taken this step.”
The company worked to fix the GA’s and roll them out on the evening of February 28 and though the night. At 5:23 am PST on February 29, Microsoft announced that client service management was back online for the majority of Azure server clusters.
Then, Leap Year Strikes
But as (bad) luck would have it, Azure suffered a second outage on February 29. On the previous day, seven servers were undergoing the software update when the bug hit them. Microsoft decided to unite the old components with the new GA (rolled out the night before) on these servers. Confident of the solution, Microsoft opted for a quicker “blast” update across all servers at 2:47 pm PST.
Microsoft failed to notice, and Bill Laing was very honest on this point, that their new update package with the older components also included a network plugin created for the update, which were not compatible. To sum it up, a second major loss of service to Azure clients occurred. It was not until 2:15 am on March 1 that Microsoft determined that all servers were cleaned up and fully functional.
Laing concludes the RCA with defined follow-up steps to improve the Azure system, and a repetition of Microsoft’s commitment to its Azure customers. As a result of the outage, Azure has come under considerable criticism and increased scrutiny as a cloud service. But in terms of openness and transparency, this RCA blog from Microsoft rose to the occasion.
Talkin Cloud will keep you updated on Microsoft’s work in providing uninterrupted Windows Azure cloud services.
About the Author
You May Also Like