Backing Up the Backup – Time to Stop Whining about Amazon
It’s time for another round of lame Internet outage jokes. Can’t get on Reddit? Great, go back to work. Couldn’t solve your hotel problems with Airbnb yesterday? Maybe it’s a sign you would be more rested after a staycation.
Amazon Web Services (AWS) is so pervasive that even a minor hiccup, which is pretty much what we saw on Monday afternoon, causes a whole lot of our connected lives to have bouts of bad breathing. This turn causes the inevitable hand-wringing across the service industry. But as in the past, I hope sanity reigns.
I’ve blogged about this before, specifically the lack of clarity in understating the ersatz SLAs that Amazon would never accept as a customer but try to convince us have value and meaning. I’m sure I’ll have reason to blog about it again. But here’s where we are now: On Oct. 22, an apparent problem at AWS caused a number of domains—perhaps up to 60, according to Compuware’s Outage Analyzer—to suffer from degraded performance. For the record, not all users were affected by any means, but there were enough issues to generate considerable attention.
That may be in part because the affected sites include Reddit, GitHub and Airbnb—all household names in the online service space. It was just as bad the last time something like this happened, back in July, when the sites hardest hit included Netflix, Instagram and Pinterest. Companies like these cause the issue to be elevated from IT support to major consumer inconvenience and brand damage.
Amazon's own dashboard gives us a real-time close-up of the problem, which in this case showed that the East Coast region experienced degraded performance from the Elastic Block Store. For those of us at Mashery, that was of particular interest.
First, let’s acknowledge how the issue affected our own customers at Mashery. We’re set up to automatically sense when any part of either of our redundant networks fails, and to divert traffic around any outage. We’re still reviewing the data, but from what we’ve seen so far, a few very brief failover occurred due to network congestion within the affect AWS zones. The real impact to our customers was a small percentage of failed calls for less than two minutes on Monday, just before lunch in San Francisco. The other elements of our platform - our portal, dashboard, reporting and other core components of our infrastructure - were not affected.
There’s a reason for this. Based on the experience of running a multitenant API platform in the cloud for the past six and a half years, we’ve made some decisions about architecture that were purposefully meant to steer clear of things that looked like they could stir trouble. One of these conscious choices made a while ago was not to rely on Amazon’s EBS, specifically because of its history of outages.
To be clear, we’re still fans of cloud services in general and Amazon Web Services in particular. Amazon is Mashery’s single largest vendor, by a long shot. But “cloud” is not inherently “good” – you need to understand what is ready for prime time, and what isn’t. Amazon’s Elastic Block Store is seductive – it seems to be flexible and powerful and easy to use, and appears to promise virtually endless redundancy. That said, we live in the real world.
In our world, we understand that services sometimes fail. We also appreciate that rapid development and ground-breaking innovation can exact a toll on reliability. None of this has stopped great companies like Twitter and Foursquare, which have gone down on occasion, from becoming wildly successful. Even a momentary failure of service isn’t optimal of course, but it’s the price we pay for constant innovation and advancement. That’s exactly what defines our industry. Would caution and compromise really service us better?
For each component of our platform, we evaluate the tradeoffs between trying something new that might reduce cost or development time versus using the tried and true, but perhaps more time-intensive. Things that could impact the traffic manager component, and therefore affect our customers’ ability to properly process API calls, or that could cause security issues, are likely to skew well to the “tried and true and bulletproof” end of the spectrum. Things that would affect the developer portal, and therefore not impact API calls but could cause documentation or analytics to be temporarily offline are still important (I never like to see posts like this to appear on @masheryops), but part of what makes our portal awesome is the speed at which we introduce new features.
For us, EBS hasn’t yet met the standard of reliability for use in any significant component of the Mashery API management platform. Our analysis has resulted in the conclusion that the risk is too high – an analysis that appears to be accurate, at least for now.
My previous blog on this topic followed industry-wide hand-wringing about the failure of theoretically infallible and redundant systems. This time, I’m hoping we can skip all that.
Cloud computing gives businesses unprecedented advantages in cost, scalability, flexibility and reach for critical business functions, including the distribution and scale of an API. Amazon Web Services is among the key players enabling our game. However, by using Amazon, we can never abdicate, or even outsource, our management responsibilities to the cloud, or the service providers we retain. That’s just irresponsible and unwise.
In the real world, just as every Broadway play has an understudy, we need to develop our own levels of redundancy and reliability. We also make choices about where we can push the envelope and accept the consequences, and where we need redundancy even if it is expensive.
I’m writing this post at 39,000 feet on a flight to a conference. In front of me, behind the hardened cockpit door, is a pilot…and a copilot. Two sets of controls. Two radios. Two navigation systems. The redundancy is expensive, but because of it I don’t worry about getting to my destination in one piece. But back here where the passengers sit, there is an overhead luggage compartment that is duct taped together with stickers all over it that say “Inoperative – Please do not use”. So polite. But also not mission critical. Doesn’t bother me in the slightest.
Everything fails at some point; looking for someone to blame, or even issuing an immediate requisite mea culpa won’t be enough. The growing market for cloud services to run everything from finance to fan sites will eventually favor companies able to think through failover scenarios, and adequately prepare for them…but who know when to take a calculated risk and when to insist on the highest level of redundancy. Anything short of that is simply naive.