I moderated a discussion with Jesse Robbins, Kripa Krishnan, John Allspaw about Learning to Embrace Failure. This is the first time you'll see Google reveal what they've been doing since 2006. Read the entire discussion in the new issue of ACM Queue magazine: Resilience Engineering: Learning to Embrace Failure
Participants include Jesse Robbins, the architect of GameDay at Amazon, where he was officially called the Master of Disaster. Robbins used his training as a firefighter in developing GameDay, following similar principles of incident response. He left Amazon in 2006 and founded the Velocity Web Performance and Operations Conference, the annual O'Reilly meeting for people building at Internet scale. In 2008, he founded Opscode, which makes Chef, a popular framework for infrastructure automation. Running GameDay operations on a slightly smaller scale is John Allspaw, senior vice president of technological operations at Etsy. Allspaw's experience includes stints at Salon.com and Friendster before joining Flickr as engineering manager in 2005. He moved to Etsy in 2010. He also recently took over as chair of the Velocity conference from Robbins. Google's equivalent of GameDay is run by Kripa Krishnan, who has been with the program almost from the time it started six years ago. She also works on other infrastructure projects, most of which are focused on the protection of users and their data.
The full article is here: http://queue.acm.org/detail.cfm?id=2371297
This is the 2nd of 3 articles on the subject that I'm involved with. Part 1 was published last week. Part 3 is kind of a surprise and will be out in less than a month. Watch my blog for the announcement.