In the more than 1,000 articles written about Amazon’s April 21 cloud outage, I found myriad “lessons,” “reasons,” “perspectives,” and a “black eye,” a “wake up call” and a “dawn of cloud computing.”
What I struggled to find was a simple, but factual, explanation of what actually happened at Amazon. Thankfully, Amazon did eventually post a lengthy summary of the outage, which is an amazing read for IT geeks, to which the author of this blog proudly belongs, but may induce temporary insanity in a more normal individual. So this blog is my attempt to describe the event in simple, non-technical terms. But if I slip once or twice into geek-speak, please show some mercy.
The Basics of Amazon’s Cloud Computing Solution
When you use Amazon to get computing resources, you’d usually start with a virtual server (aka EC2 instance). You’d also want to get some disk space (aka storage), which the company provides through Amazon Elastic Block Store (EBS). Although EBS gives you what looks like storage, in reality it’s a mirrored cluster of two nodes … oops, sorry … two different physical pieces of storage containing a copy of the same data. EBS won’t let you work using only one piece of storage (aka one node) because if it goes down you’d lose all your data. Thus in Amazon’s architecture, a lonely node without its second half gets depressed and dedicates all its effort to finding its mate. (Not unlike some humans, I might add.)
Amazon partitions its storage hardware into geographic Regions (think of them as large datacenters) and within Regions to Availability Zones (think of them as small portions of a datacenter, separated from each other). This is done to avoid losing the whole Region if one Availability Zone is in trouble. All this complexity is managed by a set of software services collectively called “EBS control plane,” (think of it as a Dispatcher.)
There is also a network that connects all of this, and yes, you guessed right, it is also mirrored. In other words, there are two networks… primary and secondary.
Now, the Play-by-Play
At 12:47 a.m. PDT on April 21, Amazon’s team started a significant, but largely non-eventful, upgrade to its primary network, and in doing so redirected all the traffic to another segment of the network. For some reason (not explained by Amazon), all users traffic was shifted to the secondary network. Now you may reply, “So what…that’s what it’s there for!” Well, not exactly. According to Amazon, the secondary network is a “lower capacity network, used as a back-up,” and hence it was overloaded instantly. This is an appropriate moment for you to ask “Huh?” but I’ll get back to this later.
From this point on, the ECB nodes (aka pieces of storage) lost connections to their mirrored counterparts and assumed their other halves were gone. Now remember that ECB will not let the nodes operate without their mates, so they started frantically looking for storage space to create a new copy, which Amazon eloquently dubbed the “re-mirroring storm.” This is likely when reddit.com, the New York Times and a bunch of others started noticing that something was wrong.
By 5:30 a.m. PDT, the poor ECB nodes started realizing harsh reality of life, i.e., that their other halves were nowhere to be found. In a desperate plea for help, they started sending “WTF?” requests (relax, in IT terminology WTF means “Where is The File?”) to the Dispatcher (aka EBS control pane). So far, the problems had only affected one Availability Zone (remember, a Zone is a partitioned piece of a large datacenter). But the overworked Dispatcher started losing its cool and was about to black out. Realizing the Dispatcher’s meltdown would affect the whole Region, causing an outage with even greater reach, Amazon at 8:20am PDT made the tough, but highly needed, decision to disconnect the affected Availability Zone from the Dispatcher. That’s when reddit.com and others in that Availability Zone lost their websites for the next three days.
Unfortunately, although Amazon sacrificed the crippled Availability Zone to allow other Zones to operate, customers in other Zones started having problems too. Amazon described them as “elevated error rates,” which brings us to our second “Huh?” moment.
By 12:04 p.m. PDT, Amazon got the situation under control and completely localized the problem to one Availability Zone. The team started working on recovery, which proved to be quite challenging in a real production environment. In essence, these IT heroes were changing tires on a moving vehicle, replacing bulbs in live lamp, filling a cavity in a chewing mouth… and my one word for them here is “Respect!”
At this point, as the IT folks started painstakingly finding mates for lonely EBS nodes, two problems arose. First, even when a node was reunited with its second half (or a new one was provided), it would not start operating without sending a happy message to the Dispatcher. Unfortunately, the Dispatcher couldn’t answer the message because it had been taken offline to avoid a Region-wide meltdown. Second, the team ran out of physical storage capacity for new “second halves.” Our third “Huh?”
Finally, at 6:15 p.m. PDT on April 23, all but 2.2 percent of the most unfortunate nodes were up and running. The system was fully operational on April 24 at 3:00 p.m. PDT, but 0.07 percent of all nodes were lost irreparably. Interestingly, this loss led to the loss of 0.4 percent of all database instances in the Zone. Why the difference? Because when a relational database resides on more than one node, loss of one of them may lead to loss of data integrity for the whole instance. Think of it this way: if you lose every second page in a book, you can’t read the entire thing, right?
Now let’s take a quick look at our three “Huh?s”:
- Relying on SLAs alone (which in Amazon’s case is 99.95 percent uptime) can be a risky strategy for critical applications. But as with any technological component of a global services solution, you must understand the supplier’s policies.
- Cloud platforms are complex, and outages will happen. One thing we can all be certain of in IT; if it cannot happen, it will happen anyway.
- When you give away your architectural control, you, well, lose your architectural control. Unfortunately, Amazon did not have the needed storage capacity at the time it was required. But similar to my first point above, all technological components of a global services solution have upsides and downsides, and all buyer organizations must make their own determinations on what they can live with, and what they can’t.
An interesting, final point: after being down for almost three days, reddit.com stated, “We will continue to use Amazon’s other services as we have been. They have some work to do on the EBS product, and they are aware of that and working on it.” Now, folks, what do you think?