March 10, 2011

Ops, The World Cup and Keeping the Family Happy

 

Thu, 01 Jul 2010 23:43:49 +0000
If you are in operations or engineering in a technology business, you can probably still remember most of the details of the worst outage you've managed through.  That cold sweat moment when your monitoring systems starts alerting, your phone starts vibrating and beeping, your adrenaline hits and all you can think of is logging on (despite prior commitments to the kids, wife and/or dog).  You're trying to get shell access, but the systems are overloaded and it's taking for-e-ver.  Instant messaging starts lighting up with messages from your guys.  Your boss is calling on your cell phone amongst the flurry of SMS messages.  You try to block out all the noise because you just need time to start figuring out what's happening.  Ugh. External load. Out of capacity.  Now what?

This past week, Twitter broke new volume records, but also broke another record, poor availability for their API. As someone responsible for a product's availability, I feel for the team over there.  I can just about feel the stress level and the fatigue of the guys in the trenches.  I'm sure they did awesome work managing through it and need a break… but here comes another big weekend for them: World Cup Quarterfinals and Tour de France.  *Sigh*.  Ops guys are heroes.  They deserve their nicotine, caffeine, beer and doughnuts.

All of this has me thinking about what we do at Mashery.   We spend a lot of time thinking about all aspects of running successful API programs, from strategy, to developer outreach, to launches to operations.  We've learned a lot about running programs in a variety of different industries and environments.  Watching what happened to Twitter over the past week makes me think of a few operational lessons I’ve learned over the years.

Rate limiting and throttling is necessary.

Some will call foul on this since you're effectively turning away business, but it's reality.  I wish there was such a thing as infinite capacity.  Even if you have the good fortune to have tons of it, make sure you still have a way to hit the panic button and rate limit if necessary.  Twitter has one of the world's biggest APIs, and when they broke their volume record due to the World Cup, throttling traffic is what allowed them to stay afloat.  It can happen on a 100 QPS API or a 3200 QPS API.  And really people, have a way to do this outside of modifying your code.  When you need the panic button, you shouldn't be calling engineering and worrying about whether the build is good.

Separate traffic and developer management from your backend API servers.

APIs are different than websites.  With your website, you mostly know things like concurrent connections and bandwidth.   When you max capacity, there's not a ton you can do but start adding or turn some things off for everybody and ride out the wave.  But if you've done it right with an API and have traffic controls with developer management in front of your API, you have a shot at managing the wave rather than just hoping it dies down.

Firstly, having a traffic management layer allows you to limit load on your backend while you try to login and troubleshoot your situation.  If you use a service like Mashery, you can apply maintenance blocks, turn off certain less important endpoints/methods, or even just throttle back how much traffic you'll accept while politely blocking the rest with fast response error messages.

Secondly, an API done right has a developer management component in your front edge.  With this in place, you actually know which developers/partners/applications are sending you traffic and you can make more focused decisions.  In Twitter's case, they didn't pick and choose who would get the most capacity (as far as we know publically).  Instead, they chose a general ratcheting down of access rights to all developers in their community; probably the right political choice for them to make.

For lots of other companies though, not all developers or partners are created equal.  In this situation, picking which applications to allocate the most capacity to might be critical.  A retailer may need to ensure specific applications are 100% available on cyber-Monday. Maybe you need to manage specific contractual obligations if you have a pay-for-access API with SLAs.  Or maybe a certain app is launching on iTunes today and you want to protect its success.  Having the visibility into developer specific traffic and the ability to throttle or limit each independently is necessary in these situations and requires this component on your front edge.

Additionally, in a moment of purely coincidental yet serendipitously beneficial timing, Mashery announced yesterday the expansion of our traffic and developer management capabilities with the release of API Access Tiers. This enables our customers even more capability to help them easily group their developer partners to easily prioritize traffic given their business objectives. When that panic button is needed, you can make simple UI driven decisions for whole tiers of developer partners.

While we're advocates of using API analytics to plan capacity and avoid outages all together, the reality is that unplanned for events happen.  Having tools in place for managing traffic is a great way keep business running smoothly and help ops guys get home to their families.  After all, that's a worthwhile cause.

- Chris Lippi, VP of Product and Operations