AWS went down again last Friday. I wouldn't normally care, I only run non-critical toy projects out of their infrastructure, but I know that it disrupted a friend's wedding and that's just not cool.
Amazon's public statement about the event is fairly detailed and fairly believable. In one of their northern Virginia datacenters "each generator independently failed". They don't state how many generators they have, but their vagueness and references to "primary and backup power generators" seem to indicate that they have two.
Since they had UPS systems, a power outage with generator failure from 7:24pm PDT meant that the datacenter only lost power between 8:04pm PDT and 8:24pm PDT, and apparently many systems had power restored from 8:14pm, PDT. So why was the outage for customers so long?
The majority of EBS servers had been brought up by 12:25am PDT on Saturday. However, for EBS data volumes that had in-flight writes at the time of the power loss, those volumes had the potential to be in an inconsistent state.I always understood that the value of having a UPS was two-fold, you could survive small power interruptions and you could safely shut down so that when power was restored your systems would return without requiring manual intervention. The Amazon cloud does not seem to be good at the latter.
At the most basic level it would seem prudent to force EBS servers to switch to a more cautious mode as soon as grid power is lost. If a server is running on batteries or even on a generator then forcing disks to remain in a consistent state is a pretty basic precaution. How hard is it to mount -o remount,sync automatically? Obviously there's performance degradation with that, but it seems a small price to pay in the rare occasion when there's clear and present risk of data loss. Who wouldn't take an occasional performance hit in exchange for reliable disks and shorter outages?
Bringing back EC2 instances is a harder problem. Fundamentally the machines that run EC2 instances don't know or care much about the VM images that run on them. That's what makes them easy to manage, that's what makes it easy to spin up new images. On the other hand my simple web service that went down for hours last week does simply boot up. Because it's deployed into this automatically managed cloud it has to. Had I been running on my own hardware in the exact same datacenter my downtime would have been on the order of 20 minutes rather than hours.
Because we're building on top of a system involving half a million servers for compute alone we're subject to the complexities of very large scale systems, even for our very simple systems. Each time a set of cascading failures causes extensive downtime we have to ask ourselves if the benefits of such complicated systems outweigh the cost.