Hacker News new | past | comments | ask | show | jobs | submit login
Amazon EC2 outage: summary and lessons learned (rightscale.com)
66 points by sarahbacon on April 25, 2011 | hide | past | favorite | 22 comments



I got the directive this weekend to transfer stuff off of the cloud immediately. We lost a weekend of work for a lot of people, and nothing I can say as a tech will account for that. Upper management wants us the hell off.

I'd argue that the overall cost is much less than having all of these services in house, and in house services go down too.

But I think for the moment, they want someone to yell at, and Amazon gives the most unhelpful lack of communication, with no even remote eta, and that's unacceptable.


I think you'd be surprised at how much better/cheaper you could have done it in house.

Given proper motivation (say 20c bonus on every dollar saved) I think we'd see this argument vanish pretty quickly.

If you listen to the wrong people (sales-guys from "Enterprise Grade" vendors), or pinch too many pennies it can easily be a disaster. It's dangerous water to tread on your own for sure.


Beware of wrong incentives. This is exactly how people are tempted to improve the average case with a trade-off with the worst case. Dollars saved in the short term can cause catastrophic problems later.


Business guys just don't understand any of this stuff. They hear the cloud buzzwords and want to jump on the bandwagon so they can be one of the cool kids, and then they panic when a failure, native and incident to any third-party service, occurs. Really silly.

If it's critical not to have problems if/when Amazon goes down, you have to plan for that ahead of time, just the same as anything else. It's not like throwing the "cloud" label on something makes it invincible.

As computers become increasingly integral to the daily operations of a business, the business guys are not going to have any choice but to learn some basics. We're really already at this point, but maybe after enough failures similar to those witnessed this weekend it will finally sink in.

I am amazed how many people want to blindly follow the buzzword bandwagon without obtaining even a vague notion of the technical implications first. The fact that Amazon would be controlling Amazon EC2 and that if a failure occurs at Amazon, their EC2 product may be affected, is the most blatant thing about using "Amazon EC2" or any other external service provider.


I don't feel like you're adding a lot to the discussion by stereotyping and calling out "business guys". As a comparitively non-technical person to the average hacker news user, I would expect Amazon to quickly and clearly communicate to me why they are unable to provide the contracted service, what they are doing to remedy the issue, when it will be returned, and what we can do to avoid a similar outage in the future. From there, I can determine my risk aversion and the oppurtunity cost of choosing such a solution.

From what I gather, this is not how it was handled (I may be wrong), and for that I could not put my trust in them.


Technical failures in complex applications are not always able to be "clearly communicated". I don't use EC2 in any major capacity so I may be wrong here, but my understanding of the failure is that Amazon acknowledged the system had failed and that they were working to get the systems back up. I don't know what else you expect in the midst of an outage -- the fact is that if the technical failures were foreseen and planned, there wouldn't be an outage, so you have to give the technicians time to figure out what happened and figure out a way to fix it.

The details you want don't come until the crisis is over and the users are back online. Occasionally you may be able to get a meaningful ETA, but it really depends on the nature of the failure(s) that caused the outage. I'm glad Amazon didn't cave to the pressure and just throw a random guess out.


Amazon has promoted RightScale in the past (and presumably the two continue to have a close relationship). So it seems understandable that RightScale would want to adopt a diplomatic tone.

However, imo an executive summary that starts with "The Amazon cloud proved itself in that sufficient resources were available world-wide such that many well-prepared users could continue operating with relatively little downtime. But because Amazon’s reliability has been incredible, many users were not well-prepared leading to widespread outages. Additionally, some users got caught by unforseen failure modes rendering their failure plans ineffective." seems a little too supportive of Amazon.


That's like saying, The commuter rail service proved itself in that customers who also owned cars were able to drive to work when the train stopped running.


No, it isn't. It is like saying the highway system proved itself when the 101 was closed because people could take 280 instead. If for some reason, you only had planned to ever take 101 and wasn't ready to take an alternate route, yes you got screwed, but that was kind of your own lack of planning for this particular failure mode. (stretched metaphor.)


The metaphor works if you pretend 101 is on the east coast and the 280 on on the west coast. :D


The metaphors are still flawed, since it's both the route and the destination that changes, which makes things a lot more complex than just taking a different route.


It's more like: Travelers who knew how to quickly get to Terminal X to take an airline that routed flights though <HUB B> instead of <HUB A> got home in time for Christmas. Those who didn't were locked out and slept in the airport.


http://ee.lbl.gov/papers/sync_94.pdf

I posted this yesterday, with the conjecture that it may have been a sudden sync problem. It's a good read.


This is a great paper. If you haven't read it, it suggests a common scenario where endemic network delays tend to nudge all participants in a periodic broadcast protocol to send their broadcasts at the same time, so that some hours after you start all the participants, everyone has synchronized and on a timer saturates the network with updates.

The solution (I didn't reread so this is from memory) is to add random jitter to each participant's timer.

However, is there evidence to suggest that's what happened to Amazon? I can see this being a big issue in '93 with high-latency low-bandwidth links a commonplace. But we think that Amazon wasn't engineered well enough to deal with multiple orders of magnitude spikes in C&C traffic?

Thank you, though, for posting a (much needed) technical comment to this discussion.


I don't think it was a symptom of routing synchronization specifically, but I'd be curious to know if it was a case of unexpected and undesired synchronization. (E.G. An independent and random cluster of blocks suddenly updated; the network was saturated; it pulled in more updates; ...)

And yes, the paper talked about randomization. It also pointed out the magnitude of randomization required was larger than expected.


Has there been an official explanation?


As far as I'm aware, no. That's why RightAWS said they get an F for communication.


For those of us waiting to learn what happened, the title is baity and misleading. A more accurate headline would be: "Rightscale outage: some speculation and customer service suggestions for Amazon".

At the time of writing Amazon has not yet posted a root cause analysis. I will update this section when they do. Until then, I have to make some educated guesses.

That pretty much sums it up. Well, that plus some contradictory lesson learned, such as The biggest problem was that more than one availability zone was affected, followed by, must have live replication across multiple availability zones.


I agree. This is just another article of someone speculating what has gone wrong, possibly in order to get hits on his or her blog.


The author's suggestion that service providers should make predictions is exactly what status updates aren't supposed to do.

Amazon's communication during this was on point. There's a line between tell me what's wrong and fix what's wrong and all of the author's suggestions on how to fix the "communication problem" are on the wrong side of that line.


Mhh, this is weird. I didn't suggest that they tell me what to do but that they tell me about the derivative. From the status messages it appeared that things were getting better, albeit slowly, when in fact they kept getting worse and affected more machines 12 hours after the initial problem.


I agree but it's worth noting that not all of the author's suggestions are on the wrong side of that line. The first three suggestions and the penultimate suggestion are along the lines of "tell me what's wrong".




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: