On the Amazon S3 downtime…
UPDATED: Fri Feb 15 15:57:21 EST 2008
UPDATED: Sat Feb 16 23:49:46 EST 2008
I’ve been seeing these reports this morning, but didn’t really want to say much about it. It sucks, but it happens. Places get the downtime. But I came across this particularly silly reporting from TechCrunch… Let’s just see:
It’s a short piece - 2 grafs, the first one’s just some of the facts. The fun starts with the second graf:
This could just be growing pains for Amazon Web Services, as more startups and other companies come to rely on it for their Web-scale computing infrastructure. But even if the outage only lasted a couple hours, it is unacceptable.
Hmm.. Ok. I guess. Let’s move along.
Nobody is going to trust their business to cloud computing unless it is more reliable than the data-center computing that is the current norm.
Here we show a distinct lack of understanding why cloud computing is more desirable than data-center computing. It is not simply an alternative - it gives you the ability to scale very easily. That’s the big win - when you get your spike you aren’t worried about that spike taking down your site. With S3 serving your images, you aren’t worried about bandwidth constraints. With EC2 running your machines, you can easily fire up a few more instances to handle the increased traffic and turn them off when you’re done. So, to me, cloud computing would only have to be the same level of reliability as a data-center setup and it’s a win.
Cloud computing needs to be 99.999 percent reliable if Amazon and others want it to become more widely adopted.
This is a good one, because it’s such a good encapsulated thought - nobody likes downtime! But that’s old school thinking. Even 37signals (a client of S3 affected by the outage) points out how getting from 99% to 99.999% is really expensive for very little gain. Nobody’s really looking for that no more and service level guarantees of it are largely for marketing purposes.
For me the real story was the horrible PR handling it got. Amazon’s good at selling things and dealing with individual complaints. They are brand new to this big services thing and they need to figure it out fast. For something as high profile as this, a few terse forum posts was not the way to handle the problem. Amazon should have had a good bit more of the touchy feely going on in there with some deeper explanation and what not - hopefully the why and what fors are still forthcoming, but I don’t think the problem was well handled in the public arena and it shows Amazon’s inexperience dealing with this sort of issue.
UPDATE: Please read the SmugMug blog on this. The real news isn’t that they weren’t affected, but is very reasoned expectations for such a service. I’m agree with him on the communications bit - I suspect Amazon’s going to grow into it. Faster, rather than slower, by listening to his suggestions.
UPDATE 2: Here’s some closure on this. Their statement and Nicholas Carr’s analysis. Looks like they’re heeding SmugMug’s advice and working on a service health dashboard.







