We have 5350 guests and 8 members online
Home > Blogs > Featured Blogs > Agile Junction > The Twitter Fail Whale and Global Optimization

The Twitter Fail Whale and Global Optimization

E-mail
Monday, 22 February 2010 10:28
Blogger: Richard Watson I was looking for a good example to explain global vs. local optimization, and lo, one fell right out of the twittersphere at me. It came from the Twitter engineering team themselves. Ed Ceaser (@asdf) and Nick...

RichardWatson_jpg

Blogger: Richard Watson

I was looking for a good example to explain global vs. local optimization, and lo, one fell right out of the twittersphere at me. It came from the Twitter engineering team themselves. Ed Ceaser (@asdf) and Nick Kallen (@nk) posted a blog entry recently, entitled "The Anatomy of the Whale". The entry discusses the team’s efforts to track down a capacity problem that caused too many people to see the 'fail whale', Twitter's visual representation of a HTTP 503 Service Unavailable error. As the guys say:

"Discovering the root cause can be very difficult because Whales are an indirect symptom of a root cause that can be one of many components. In other words, the only concrete fact that we knew at the time was that there was some problem, somewhere."

It struck me then that finding a performance bottleneck is a fantastic example of a global process optimization problem. Any seasoned developer knows that the process for finding performance issues is real detective work. In my years as a developer, I learned to recognize performance blind alleys, or a red herring (to name but two clichés), such as investing time in optimizing one part of an end-to-end process. Ed and Nick offer the following great advice for any optimization effort: "Focus on the biggest contributor to the problem". Tracking down the biggest contributor to the problem is where the need for visibility comes in.

Show me that call stack again, Watson

Visibility

Gaining visibility into any process has a number of aspects:

  • Measuring local things credibly

  • Aggregating the local measurements into an end-to-end picture

  • Presenting the metrics visually to gain insight into their relationships

The twitter team describes this measurement data problem as:

"Debugging performance issues is really hard. But it's not hard due to a lack of data; in fact, the difficulty arises because there is too much data. We measure dozens of metrics per individual site request, which, when multiplied by the overall site traffic, is a massive amount of information about Twitter's performance at any given moment. Investigating performance problems in this world is more of an art than a science. It's easy to confuse causes with symptoms and even the data recording software itself is untrustworthy."

Visualizing the performance data, the Twitter team discovered that their problem was the decay in throughput of data being delivered during peak loads from their distributed caching subsystem, based on Memcached. Armed with this information, they tackled the problem in two ways: reduce the volume of calls to Memcached (they found 7 out of 17 calls to Memcached were unnecessary), and beef up the Memcached cluster.

Transparency

There is further point to make about the Twitter blog post: it is another example of the Twitter team doing their engineering in public. This transparency gives them credibility (Amazon and Salesforce.com, are you listening?) and positively affects Twitter’s relationship with their customers, potential paying customers, and investors. I have blogged about the work of the Twitter engineering team before. Their commitment to transparency continues to impress me. I'd rather hear my service providers saying "look we have a problem, here's how we measured it, and here are the steps we are taking to resolve it", rather than operate on a "Wizard-of-Oz-behind-the-magic-curtain" basis.


Posted: 2010-02-22 17:28:49

Read Full Article
Author:Richard Watson

Trackback(0)

Comments (0)Add Comment


Write comment

security code
Write the displayed characters


busy
 
Cialis

Agile Marketplace - Announcements and Special Offers

The Business Case for ALM Transformation
Are legacy systems holding your company back?  Breakthrough these technical constraints with an open and scalable environment that meets your unique business need to transform. There is no reason to be locked into an obsolete platform. The output of a number of recent transitions from legacy systems, this is practical white paper shares lessons learned and illustrates how guidance and enablement can pave the way for change.
Download this Whitepaper