Reboot Command Typo Takes Down Joyent US-East-1
Who says spelling isn't important these days? A simple typo in a reboot command was responsible for a Joyent data center outage this week. Joyent plans to put procedures into place to avoid any other such outages.
May 30, 2014
Who says spelling isn't important these days? Or at least the ability to type without mistakes? Students, take note, because a simple typo took down the entire Joyent US-East-1 data center this week. According to Joyent, an "operator error" mistyped the command to reboot a select set of new systems, instead specifying a reboot of all systems.
Mental note to all future admins: Double-check your typing before executing. All told, the downtime wasn't severe, but certainly longer than it should have been. Joyent's Eastern data center was down for at total of one hour, 17 minutes.
"Unfortunately, the tool in question does not have enough input validation to prevent this from happening without extra steps/confirmation, and went ahead and issued a reboot command to every server in us-east-1 availability zone without delay," the Joyent team wrote on the company blog.
Joyent goes into detail about the technical aspects of the downtime cause, but in the end, it was essentially a typo, showing that more often than not, human error is the biggest cause of downtime. And such disasters can be avoided.
Joyent noted in its blog post that it "will be taking several steps to prevent this failure mode from happening again, and ensuring that other business disaster scenarios are able to recover more quickly." Joyent's plans are three-fold:
The company plans to improve the tooling that people and systems interact with so that input validation will become stricter and will not allow all servers and control plane servers to be rebooted simultaneously. "We have already begun putting in place a number of immediate fixes to tools that operators use to mitigate this, and we will be rethinking what tools are necessary over the coming days and weeks so that 'full power' tools are not the only means by which to accomplish routine tasks," the team wrote.
The cloud provider is also investigating what extra steps in control plane recovery can be taken so it can safely reboot all nodes simultaneously without operator intervention."We will not be able to serve requests during a complete outage, but we will ensure that we can record state in each node such that we can recover without human intervention," the team wrote.
Joyent plans to assess migrating customer instances off of older legacy hardware platforms more aggressively.
The company apologized for the downtime, and it looks like it has a sound plan in place to deal with any future potential issues.
About the Author
You May Also Like