Oren Michels | Co-Founder & GM
April 29, 2011

Shame on Who?


The infamous AWS outage of April 2011 is not even a week behind us and already many debate which has disturbed the ripples of the internet waves more, the actual event or the ensuing rhetoric. Before the dust settles, I suppose, it is my turn to weigh in.

A tremendous amount has been written about ‘who was wrong’ or ‘who is bad’’- Amazon or its customers. Like many things in life- the answer is not quite that binary. Amazon- and cloud services in general- are good for some things and bad for others. Amazon is not meant to be all things to all people. Many have asked me- as an Amazon customer- how we plan to react. We’re not going to "dump Amazon" or "dump the cloud" because they had some downtime any more than we would have stopped using data centers when 365 main lost power, or Rackspace went down after a truck hit a power pole outside of their facility.

Amazon is great for a great many reasons, but everything goes down at some point and everything has pros and cons. You understand them and work around them as best you can. Understanding is a key word here.

Much of the debate has focused on Amazon’s SLAs. As someone who has read every word of the Amazon SLA, I can tell you it is a joke. After last week’s outage, we all watched as people wrote, blogged and tweeted "How could they go down? They have an SLA!" Most of these people never really bothered to read what the SLA promises and thus understand how unlikely it is that that SLA could ever actually pay off.

There is a huge difference between someone using a service that falsely makes certain promises and someone assuming certain service levels that were never promised. Those that ascertain, "I'm shocked that EBS doesn't have this or that level of redundancy" fall into the latter category. EBS is a great service for people whose database is not mission critical - or who consciously choose rapid development and feature innovation over bulletproof reliability.

As proven by Twitter, Foursquare and even Gmail- turns out it’s ok to occasionally fail. In certain businesses, you can have failures and still build a huge, multi-billion dollar company. No it’s not optimal (the failure, that is), nor ever intentional but for sending tweets, checking-in or even sending free e-mails it’s tolerable by the vast majority of customers. Customers may complain, or wring their hands in grief but these businesses have made a conscientious decision to release quickly, watch it break and iterate vs. a more enterprisey approach of slow releases with more guarantee of reliability. 

The reality is that the Amazon SLA essentially guarantees the same thing that a data center's SLA does - that the overall power won't go dark. This leaves a disconnect between the SLA guarantees and what people are buying. Most (if not all) Amazon customers are buying a service several layers up the stack, so people expect that SLAs and warranties will cover this. Amazon could easily provide these levels of warranty, but in order to do so, they’d have to charge considerably more for their service….which people wouldn’t like.

Customers need to look at what companies like Amazon are selling, look at what they’re paying for and see if it’s consistent with what they need. It’s then up to the customer to make the appropriate decisions on reliability, failover, redundancy, architecture, etc for their own business. One size does not fit all. 

Ultimately we are paying Amazon a lot less than we were several years ago for more service. Yes, they could be more transparent with roadmap or how the whole service works, Yes, their SLA’s (when you read them) guarantee very little. But Amazon does not lie nor are they misleading.

What is misleading is the way in which many folks chose to understand what they were buying in the first place. I dare say that those who understood well the services and levels of guarantee provided by Amazon were not affected by the issues last week.

We have redundancy we have because we've learned the hard way that we need it. Awhile back we had a database failure that caused us some misery. Based on that (very) tough learning experience I doubt we'd use EBS unless we thought it was way more reliable than it is. 

For those who have recently accused Amazon of “bad behavior”, I say Shame on You-unless failing to deliver something you never promised is “bad.” Amazon has offered a service that has allowed companies like ours to exist, grow, and prosper with zero cap ex in a way that would have been impossible before. They have added innovations like map reduce, virtual private clouds and, yes, RDS for people to innovate even more quickly and cheaply.

Working with the known risks and imperfections, companies can easily architect solutions that meet their required standards.

Mashery co-founder and original Architect Clay Loveless covers the tough business of tradeoffs in technical architecture (http://claylo.com/post/4844798650/failure-is-not-an-option) as well as his experience with Amazon (http://claylo.com/post/4817029650/where-there-are-clouds-it-sometimes-rains). Both are great reads recounted with considerably more humor than this post.