Blogger: Richard Watson
At QCon in London last Friday, I heard Evan Weaver from Twitter talk about "Improving Running Components at Twitter". Evan joined Twitter last summer to address the well-publicized scaling problems.
Twitter's approach to solving their performance and scalability issues is a great example of thinking big while taking small steps. The team set about iterative removal of bottlenecks. Firstly they tackled multi-level caching (do less work), then the message queuing that decouples API requests from the critical request path (spread the work), then the distributed cache (memcached) client (do what you need to do quickly). Evan was asked about strategic work to take them to the next 10x growth. His responded that they were so resource constrained (can you believe there are only 9 in the services engineering team) and so under water with volume that they have to work on stuff that gives them most value (take small steps). But crucially, they create solutions to the bottlenecks making sure that whatever they fix will not appear on the top 3 problem list again (which is thinking big – well, as big as you can when you’re growing like a hockey stick).
We also got an insight into how organizations deal with constraints in technical decision making. Evan was asked if they considered using Erlang for the middleware given its suitability, yadda, yadda. "Yeah, but we’re only 9 people, and we have years of JVM engineering experience". Can they drop tools to learn Erlang? They went with Scala on the JVM for the messaging.
They also demonstrated lifecycle control of the chastest kind – deploying, regression testing and redeploying the new messaging code on one of three servers for 3 months before releasing onto the full cluster. This chastity directly relates to Twitter's business. If they blew the cache during a software update, it could take weeks to build it again. How long would the Twitterati wait? Twitter is in a pre-monetized place right now, but their value proposition includes keeping the service performing and scaling. The same problem nearly killed Last.fm in 2008. After a fire in their data centre, the Last.fm caches got flushed and it took weeks to rebuild them. With typical understatement, Evan did admit Twitter "runs hot" and is putting "pretty significant" work into their cache replication.
This was a killer talk, from a guy clearly knee deep in the muddy trenches. The classic quote from Paul Butterworth about deploying software updates to running systems being like "changing the tires on a moving car" was never truer than at internet scale.


Hi Richard. Nice recap of challenges Twitter is facing scaling - especially use of memcached.
Posted by: Bret Clement | March 20, 2009 at 09:25 AM