This year LISA is in Washington D.C., from Nov 8-13. If you are on the east-coast, this is a good opportunity to attend the premiere system administration conference.

Register now.

This year's schedule is packed with amazing talks. I'd like to point out...

  • "Go for Sysadmins" from Chris "Mac" McEniry, Sony Network Entertainment
  • "Neighborly Nagios" from David Josephsen, Librato
  • "systemd, the Next-Generation Linux System Manager" from Alison Chaiken, Mentor Graphics
  • "Software Defined Networking: Principles and Practice" from Nick Feamster, Princeton University
  • "How to Not Get Paged: Managing On-call to Reduce Outages" from Thomas A. Limoncelli, Stack Overflow

Register now.

Posted by Tom Limoncelli

You're gonna want this book. Pre-order it now.

(Pre-orders are paper right now; it should be available on Kindle soon. Official release date is Oct 25)

This is the best book I've ever read about Postmortems and creating a Blameless operations culture.


Posted by Tom Limoncelli in DevOps

I had an interesting conversation with Ryan Coleman, product manager at Puppet Labs. He gave me a preview of some of the things being announced soon and highlighted at PuppetConf next week. If you can't attend, you can livestream the conference for free. In particular, the keynote is on Thurs, Oct 8th at 9am PT (noon ET).

How to livestream the entire conference is here:

It isn't too late to grab a ticket and attend in-person!


Posted by Tom Limoncelli in Puppet interviewed me for their website. We talk about DevOps, automation, and more. Interestingly enough, the person interviewing me was Barry Burd, a professor of mine 20 years ago.

View it here:


Posted by Tom Limoncelli

Next week's "LISA Conversations" podcast will be a discussion about the LISA '14 talk "Making Push on Green a Reality". We'll be interviewing the presenter, Daniel V. Klein, about the talk and what he has to say about it nearly a year later.

"Push On Green" means automatically pushing code to production with no human gates. If all the tests pass, the new code is pushed to production automatically. This enables Google to push code more frequently and with higher confidence than (for example) monthly or weekly code pushes.

Watch the video from LISA '14 and get ready to watch us record the podcast live on September 29, 2015, at 3:30pm PDT. We take questions live during the session. If you can't tune in live, the video will be available shortly after.

You won't want to miss this!

See you there!

For more info about LISA Conversations, visit

Posted by Tom Limoncelli in LISA Conversations

TV Alerts


(the links point to Tivo's page to set up 1-step recording for that series)

Posted by Tom Limoncelli

We'll be recording Episode 3 of Usenix LISA Conversations on Tuesday, September 29, 2015.

Our next conversation will be with Daniel V. Klein who presented "Making Push on Green a Reality" at LISA14. Watch his talk beforehand, and then join us at 3:30 pm PDT/6:30 pm EDT on Tuesday, September 29, 2015, at the Google Hangout On Air. We'll discuss the talk and what he's been doing since. If you miss the live session, you can view the recording on the USENIX YouTube channel.

This month's hosts will be Lee Damon and Tom Limoncelli (me!).

More info about the series can be found on the Usenix LISA Conversations Homepage.

Posted by Tom Limoncelli in LISA Conversations

After listening to Jon Taffer's interview on The Nerdist Podcast about "Bar Rescue", I'm convinced that I should do a TV show called "IT Rescue" where we visit an IT department that is failing hard and set them up for success.

Hollywood... call me!

Posted by Tom Limoncelli

I hadn't realized that Google Play permits book reviews. Strata, Christine and I are very please to read these:

Ivan Dimitrov wrote:

Simplely the best book for system administrators and their managers. Packed with great stuff from first page to the last. If you have to read one chapter - it's the Appendix A :)

Adrian Colley wrote:

This book covers about 85% of what any programmer needs to know to be a fully competent Google Site Reliability Engineer. It's written like a textbook for a training course, but it serves well as a reference text. I never tire of recommending it to my colleagues, even though spreading this knowledge reduces the scarcity of my personal marketable skill set. Despite the use of "cloud" in the title, this is not just for 1000+-node IaaS providers or their customers. It is a guide to modern system administration techniques that any online business will need eventually if it dreams of being depended on by millions of users. (Full disclosure: I used to work in Google SRE, and while there I even got Tom to autograph one of his other books for me.)

Adrian also wrote on Google Plus:

This has become my favourite technical book. It codifies many of the lessons learned the hard way through experience in Internet site operations, but which are not written down anywhere else. It's like a book that fell through a time warp from 10 years in the future. The best bit is that I know for a fact that everything in it is true, because my time at Google permitted me to see these lessons being learned the hard way (that is, through outages, post-mortem analyses, and war stories).

Thanks, folks! Keep writing those reviews!

Someone wrote to me recently asking for advice about how to re-organize his company's documentation stash. Basically they had a directory on a fileserver that had become a free-for-all, collect everything, "cosmic abyss" (his words). Tons of documents. No organizations. Most of it out-of-date or of unknown quality.

Did I have any advice that didn't involve complex document control philosophy and best practices?


Here's a strategy I've used at 2 different organizations. It is very simple and low-overhead:

Find a way to mark all old docs as "old", then find a way to review docs and mark them as "reviewed". You might convince your team to do a day-long "review day" (or maybe "week") where other projects are put on hold and people try to do all the reviews. What doesn't get reviewed is just left somewhere that people can find.

We recently re-organized our wiki at this way. We made 5 new subfolders: procedures, servicedocs, templates, styleguides, policies (plus a "trash" subfolder). We moved all legacy docs (the entire hierarchy) into a subfolder called "unreviewed".

To track the reviews, we listed all the unreviewed doc in a google spreadsheet. The spreadsheet had 3 columns: filename, volunteer, status.

We picked a day to do our "wiki fixit". People spent a day reviewing docs and had permission to put all non-emergency work on hold. They'd pick a doc to work on and write their name in column 2 of the spreadsheet to "own" it. They'd review the doc, moving it into the right folder (or "trash"). When done, they'd write the word "DONE" in the "status" column. In a day we got all the important docs reviewed and moved into one of the 5 new places. The remaining docs were mostly obsolete crap.

Weeks later we would still find an occasional doc that was still "unreviewed" but it was easy to move it to the right folder. Some day we'll be brave and remove the "trash" and "unreviewed" folders but its only disk space so that day may never come.

This was nice because it gave us a 'clean slate' feeling but converted the important docs.

If you haven't used a multi-user spreadsheet for something like that I highly recommend it. Everyone can see what everyone else is doing. You can reserve in advance docs you want to work on. It creates peer-pressure to get a lot done, since everyone can see who the laggards are. It also creates a record of who got how much done; which is useful if you want to gamify the process and give rewards for most documents processed, etc. A good way reward IMHO is to have the company take everyone out to dinner at the end. To go to the dinner you must have participated. The person that completed the most docs gets to pick the restaurant.

Posted by Tom Limoncelli

  • LISA15