Cloud Outage: Microsoft Explains Sept. 8 Service Disruption
September 22, 2011
Microsoft has explained the cloud outage that took out Microsoft Office 365, Hotmail and other Windows Live services Sept. 8, 2011, in a blog entry that lays the blame on a faulty network traffic balancer. First and foremost: Microsoft claims no customer data was lost or damaged when Office 365 went down.
As covered in TalkinCloud’s original recap of the outage, service restoration began at 10:23 p.m. Pacific the night of Sept. 8, with the incident resolving at 11:35 p.m.
The Windows Live Team can say it better than I can, so here is the team’s explanation on what exactly went wrong, taken from that blog entry:
We determined the cause to be a corrupted file in Microsoft’s DNS service. The file corruption was a result of two rare conditions occurring at the same time. The first condition is related to how the load balancing devices in the DNS service respond to a malformed input string (i.e., the software was unable to parse an incorrectly constructed line in the configuration file). The second condition was related to how the configuration is synchronized across the DNS service to ensure all client requests return the same response regardless of the connection location of the client. Each of these conditions was tracked to the networking device firmware used in the Microsoft DNS service.
Microsoft is preparing a handful of new processes to make sure it doesn’t happen again: Its cloud teams are enhancing the protocols for problem identification, monitoring, and recovery; it is hardening its DNS service for additional redundancy and failover; and it is adding a recovery process that allows specific properties the ability to fail over and then come back when DNS is resolved. Microsoft is also further enhancing its recovery tools to see about reducing time to recovery even further.
And the blog entry closed with an apology for the inconvenience and a promise to do better.
On the one hand, this really does sound like the kind of freak accident where it could have happened to anyone. But on the other, it’s not the first time Microsoft Office 365 has had this kind of problem — and it’s only been on the market a few months.
Microsoft Online Services offered customers a service credit the last time, and I’m wondering if it will repeat that tack. But more than that, I’m wondering if it would be enough to restore lost confidence.
About the Author
You May Also Like