Recently in DevOps Category

One of the most anticipated DevOps books in years is about to start shipping! DevOps Handbook: How to Create World-Class Agility, Reliability, & Security in Technology Organizations by Gene Kim, Jez Humble, Patrick Debois, and John Willis is the practical guide to doing all the wonderful things that The Phoenix Project talks about.

I've received an early copy of the book and it is excellent. It is very down-to-earth, practical, advice. I'll write more next week when I've had time to read the entire thing.

You can pre-order it directly from IT Revolution or via Amazon.

Check it out!

Posted by Tom Limoncelli in DevOps

One of the things my team at StackOverflow does is maintain the CI/CD system which builds all the software we use and produce. This includes the Stack Exchange Android App.

Automating the CI/CD workflow for Android apps is a PITA. The process is full of trips and traps. Here are some notes I made recently.

First, [this is the paragraph where I explain why CI/CD is important. But I'm not going to write it because you should know how important it is already. Plus, Google definitely knows already. That is why the need to write this blog post is so frustrating.]

And therefore, there are two important things that vendors should provide that make CI/CD easy for developers:

  • Rule 1: Builds should work from the command line on a multi-user system.
    1. Builds must work from a script, with no UI popping up. A CI system only has stdin/stdout/sterr.
    2. Multiuser systems protect files that shouldn't be modified from being modified.
    3. The build process should not rely on the internet. If it must download a file during the build, then we can't do builds if that resource disappears.
  • Rule 2: The build environment should be reproducible in an automated fashion.
    1. We must be able to create the build environment on a VM, tear it down, and built it back up again. We might do this to create dozens (or hundreds) of build machines, or we might delete the build VM between builds.
    2. This process should not require user interaction.
    3. It should be possible to automate this process, in a language such as Puppet or Chef. The steps should be idempotent.
    4. This process should not rely on any external systems.

Android builds can be done from the command line. Hw, but the process itself updates files in the build area. Creating the build environment simply can not be automated, without repackaging all of the files (something I'm not willing to do).

Here are my notes from creating a CI/CD system using TeamCity (a commercial product comparable to Jenkins) for the StackOverflow mobile developers:

Step 1. Install Java 8

The manual way:

CentOS has no pre-packaged Oracle Java 8 package. Instead, you must download it and install it manually.

Method 1: Download it from the Oracle web site. Pick the latest release, 8uXXX where XXX is a release number. (Be sure to pick "Linux x64" and not "Linux x86").

Method 2: Use the above web site to figure out the URL, then use this code to automate the downloading: (H/T to this SO post)

# cd /root
# wget --no-cookies --no-check-certificate --header \
    "Cookie:; oraclelicense=accept-securebackup-cookie" \

Dear Oracle: I know you employ more lawyers than engineers, but FFS please just make it possible to download that package with a simple curl or wget. Oh, and the fact that the certificate is invalid means that if this did come to a lawsuit, people would just claim that a MITM attack forged their agreement to the licence.

Install the package:

# yum localinstall jdk-8u102-linux-x64.rpm

...and make a symlink so that our CI system can specify JAVA8_HOME=/usr/java and not have to update every individual configuration.

# ln -sf /usr/java/jdk1.8.0_102 /usr/java/jdk

We could add this package to our YUM repo, but the benfit would be negligible plus whether or not the license permits this is questionable.

EVALUATION: This step violates Rule 2 above because the download process is manual. It would be better if Oracle provided a YUM repo. In the future I'll probably put it in our local YUM repo. I'm sure Oracle won't mind.

Step 2. Party like it's 2010.

The Android tools are compiled for 32-bit Linux. I'm not sure why. I presume it is because they want to be friendly to the few developers out there that still do their development on 32-bit Linux systems.

However, I have a few other theories: (a) The Android team has developed a time machine that lets them travel back to 2010 because I happen to know for a fact that Google moved to 64-bit Linux internally around 2011; they created teams of people to find and eliminate any 32-bit Linux hosts. Therefore the only way the Android team could actually still be developing on 32-bit Linux is if they either hidden their machines from their employer, or they have a time machine. (b) There is no "b". I can't imagine any other reason, and I'm jealous of their time machine.

Therefore, we install some 32-bit libraries to gain backwards compatibility. We do this and pray that the other builds happening on this host won't get confused. Sigh. (This is one area where containers would be very useful.)

# yum install -y glibc.i686 zlib.i686 libstdc++.i686

EVALUATION: B-. Android should provide 64-bit binaries.

Step 3. Install the Android SDK

The SDK has a comand-line installer. The URL is obscured, making it difficult to automate this download. However you can find the current URL by reading this web page, then clicking on "Download Options", and then selecting Linux. The last time we did the the URL was:

You can install this in 1 line:

cd /usr/java && tar xzpvf /path/to/android-sdk_r24.4.1-linux.tgz

EVALUATION: Violates Rule 2 because it is not in a format that can easily be automated. It would be better to have this in a YUM repo. In the future I'll probably put this tarfile into an RPM with an install script that untars the file.

Step 4. Install/update the SDK modules.

Props to the Android SDK team for making an installer that works from the command line. Sadly it is difficult to figure out which modules should be installed. Once you know the modules you need, specifying them on the command line is "fun"... which is my polite way of saying "ugly."

First I asked the developers which modules they need installed. They gave me a list, which was wrong. It wasn't their fault. There's no history of what got installed. There's no command that shows what is installed. So there was a lot of guess-work and back-and-forth. However, we finally figured out which modules were needed.

The command to list all modules is:

/usr/java/android-sdk/tools/android list sdk -a

The modules we happened to need are:

  1- Android SDK Tools, revision 25.1.7
  3- Android SDK Platform-tools, revision 24.0.1
  4- Android SDK Build-tools, revision 24.0.1
  6- Android SDK Build-tools, revision 23.0.3
  7- Android SDK Build-tools, revision 23.0.2
  9- Android SDK Build-tools, revision 23 (Obsolete)
 19- Android SDK Build-tools, revision 19.1
 29- SDK Platform Android 7.0, API 24, revision 2
 30- SDK Platform Android 6.0, API 23, revision 3
 39- SDK Platform Android 4.0, API 14, revision 4
141- Android Support Repository, revision 36
142- Android Support Library, revision 23.2.1 (Obsolete)
149- Google Repository, revision 32

If that list looks like it includes a lot of redundant items, you are right. I don't know why we need 5 versions of the build tools (one which is marked "obsolete") and 3 version of the SDK. However I do know that if I remove any of those, our builds break.

You can install these with this command:

/usr/java/android-sdk/tools/android update sdk \
    --no-ui --all --filter 1,3,4,6,7,9,19,29,30,39,141,142,149

However there's a small problem with this. Those numbers might be different as new packages are added and removed from the repository.

Luckily there is a "name" for each module that (I hope) doesn't change. However the names aren't shown unless you specify the -e option:

# /usr/java/android-sdk/tools/android list sdk -a -e

The output looks like:

Packages available for installation or update: 154
id: 1 or "tools"
     Type: Tool
     Desc: Android SDK Tools, revision 25.1.7
id: 2 or "tools-preview"
     Type: Tool
     Desc: Android SDK Tools, revision 25.2.2 rc1

Therefore a command that will always install that set of modules would be:

/usr/java/android-sdk/tools/android update sdk --no-ui --all \
    --filter tools,platform-tools,build-tools-24.0.1,\

Feature request: The name assigned to each module should be listed in the regular listing (without the -e) or the normal listing should end with a note: "For details, add the -e flag."

EVALUATION: Great! (a) Thank you for the command-line tool. The docs could be a little bit better (I had to figure out the -e trick) but I got this to work. (b) Sadly, I can't automate this with Puppet/Chef because they have no way of knowing if a module is already installed, therefore I can't make an idempotent installer. Without that, the automation would blindly re-install the modules every time it runs, which is usually twice an hour. (c) I'd rather have these individual modules packaged as RPMs so I could just install the ones I need. (d) I'd appreciate a way to list which modules are installed. (e) update should not re-install modules that are already installed, unless a --force flag is given. What are we, barbarians?

Step 4: Install license agreements

The software won't run unless you've agreed to the license. According to Android's own website you do this by asking a developer to do it on their machine, then copy those files to the CI server. Yes. I laughed too.

EVALUATION: There's no way to automate this. In the future I will probably make a package out of these files so that we can install them on any CI machine. I'm taking suggestions on what I should call this package. I think android-sdk-lie-about-license-agreements.rpm might be a good name.

Step 5: Fix stupidity.

At this point we though we were done, but the app build process was still breaking. Sigh. I'll save you the long story, but basically we discovered that the build tools want to be able to write to /usr/java/android-sdk/extras

It isn't clear if they need to be able to create files in that directory or write within the subdirectories. Fuck it. I don't have time for this shit. I just did:

chmod 0775 /usr/java/android-sdk/extras
chown $BUILDUSER /usr/java/android-sdk
chown -R $BUILDUSER /usr/java/android-sdk/extras

("$BUILDUSER" is the username that does the compiles. In our case it is teamcity because we use TeamCity.)

Maybe I'll use my copious spare time some day to figure out if the -R is needed. I mean... what sysadmin doesn't have tons of spare time to do science experiments like that? We're all just sitting around with nothing to do, right? In the meanwhile, -R works so I'm not touching it.

EVALUATION: OMG. Please fix this, Android folks! Builds should not modify themselves! At least document what needs to be writable!

Step 6: All done!

At this point the CI system started working.

Some of the steps I automated via Puppet, the rest I documented in a wiki page. In the future when we build additional CI hosts Puppet will do the easy stuff and we'll manually do the rest.

I don't like having manual steps but at our scale that is sufficient. At least the process is repeatable now. If I had to build dozens of machines, I'd wrap all of this up into RPMs and deploy them. However then the next time Android produces a new release, I'd have to do a lot of work wrapping the new files in an RPM, testing them, and so on. That's enough effort that it should be in a CI system. If you find that you need a CI system to build the CI system, you know your tools weren't designed with automation in mind.

Hopefully this blog post will help others going through this process.

If I have missed steps, or if I've missed ways of simplifying this, please post in the comments!

P.S. Dear Android team: I love you folks. I think Android is awesome and I love that you name your releases after desserts (though I was disappointed that "L" wasn't Limoncello.... but that's just me being selfish.). I hope you take my snark in good humor. I am a sysadmin that wants to support his developers as best he can and fixing this problems with the Android SDK would really help. Then we can make the most awesome Android apps ever.... which is what we all want. Thanks!

Posted by Tom Limoncelli in DevOpsTechnical Tips

Imagine if job advertisements were completely honest. Most companies advertising for IT workers would state that the job is mostly great except for twice a year when ``hell month'' arrives and everyone scrambles to deploy the new release of some major software system. This month is so full of stress, fear, and blame that it makes you hate your employer, your job, and your life. Oh, and by the way, the software releases are often late, so you can't predict which month will be hell month. As a result, you can't schedule any kind of vacation. Without time off to relax, stress builds and makes your life even worse.

Sadly, at many companies hell month is every month.

A company that adopts the DevOps principles is different. A rapid release environment deploys upgrades to production weekly, daily or more often. It is not a stressful event. It is just another day. There is no fear of an upcoming hell month.

Launching new software releases at is fully automated and self-service. The developers do it. SRE is only involved for special cases. The SRE team can therefore focus on writing tools to improve operations and so on.

Imagine if an auto manufacturer's employees spent most of their time assembling cars but when a car actually left the factory it was a fearful, stressful, month of hell. It would be unacceptable to run a car company that way. It should be unacceptable to manage technology that way too.

Adopting DevOps techniques is not just better for the company it is better for you. Over time more and more companies will adopt these techniques not just because it is better for their bottom line, but because they will find it impossible to recruit and hire technical talent.

Who would want to work anywhere else?

Posted by Tom Limoncelli in DevOps

There are two things you can do if you want to understand the future of system administration.

First, if you want to see what DevOps will be like 5-10 years out, you can read the amazing new book, Site Reliability Engineering: How Google Runs Production Systems. I read a preview copy and it was excellent. Many different Google SRE teams got together to produce a very well-rounded book that covers all aspects of Google's SRE program, which is easily 5-10 years ahead of the industry. (Pre-order from O'Reilly or Amazon Kindle or Paper) Congrats to the editors Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy on a great addition to the IT cannon.

Second, if you want to see what SRE will look like in 30-50 years, you should watch the 2009 movie "Moon" staring Sam Rockwell. Anything I say would be a spoiler, so you'll just have to trust me. (BTW, the trailer is full of spoilers. Don't watch it!)

Congrats to all of Google SRE on the publication of the new O'Reilly book! I predict it will be a big hit! (and thanks for letting me blurb the back cover!)

[P.S. I apologize in advance for the very link-bait'y title. What I mean to say is that Google SRE is devops at an incredible scale.]

Posted by Tom Limoncelli in DevOps

Whether or not you are in a DevOps environment, please take this survey. The data is useful for helping improve the situation for system administrators of all kinds.

Posted by Tom Limoncelli in DevOps

Wait... you didn't know there are songs about DevOps? Hell to the yeah!

Best DevOps Song of 2015: Uptown Funk (Mark Ronson ft. Bruno Mars)

Uptown Funk is exemplary of good DevOps operations: It encourages being evidence-driven.

An important principle of DevOps is that you should base decisions on evidence and data, not lore and intuition. Intuition is great but only gets you so far. With a tiny system is is possible for a single sysadmin to know enough about it to make good guesses. However modern systems are complex enough that we must collect data, analyze it, and base decisions on that data.

This means we must also be willing to revert a change if the data doesn't pan out as we predict. That's the scientific method. We measure something, we do an experiment, and we measure again. Then we decide whether we keep the change based on the data. Of course, this requires that our systems are observable, which means the days of un-monitorable systems is long gone.

A more scientific way of saying this is DevOps insists that we prove our assertions by gathering empirical evidence.

So, why is Uptown Funk exemplary of this DevOps principle? We... duh! The point of this song can be summed up by this line:

Cause Uptown Funk gon' give it to you

Now let me tell you something. A lot of people don't believe that the uptown funk gon' give it to you. I understand. It might be difficult to believe if you have not experienced the uptown funk. I don't mean to brag, but I happen to have a lot of uptown funk-related experience. However, you shouldn't believe me just because I have so much experience.

Likewise, Uptown Funk's author Mark Ronson doesn't insist that you simply believe him. He assures you that if you use scientific principles, you will discover on your own just how right he is. In particular, he says:

Don't believe me, just watch

See? Total devops.

Sure, he could have said, "empirical evidence proves my assertion" but he states it more lyrically.

Let me quote the entire chorus to be clear:

Girls hit your hallelujah (whoo)
Girls hit your hallelujah (whoo)
Girls hit your hallelujah (whoo)
'Cause uptown funk gon' give it to you
'Cause uptown funk gon' give it to you
'Cause uptown funk gon' give it to you
Saturday night and we in the spot
Don't believe me just watch (come on)

Mark feels so strongly about being evidence-driven that he implores you to do so 3 times in a row!

Here's the music video:

Worst DevOps Song of 2015: Shake It Off (Taylor Swift)

Technically Shake It Off was released in 2014 but it won the 2015 Grammy and People's Choice Awards so I count it as 2015 too.

To be clear, I love the song from a music and dance perspective. In fact, I own a copy of the CD that this song appears on. By the way, the CD is titled "1989" which refers to the year she was born. [For those of you that were born in 1989 and are reading this, a CD is a way that people used to distribute music. Ask your parents.]

That said, I don't think it exemplifies good DevOps practice.

The main message of the song is simple: ignore the negativity in your life

Or, as she says it:

'Cause the players gonna play, play, play, play, play
And the haters gonna hate, hate, hate, hate, hate
Baby, I'm just gonna shake, shake, shake, shake, shake
I shake it off, I shake it off

Now if TayTay had a better DevOps background it would be more like this:

  • Teams should be encouraged to openly discuss disagreements.
  • File bugs. Don't suffer in silence. Amplify feedback.
  • Blameless postmortems for major outages.

Making that rhyme is left as an exercise for the reader.

Openly discussing disagreements is an important part of breaking down silos. When I was at Google I held periodic team-to-team meetings which would be open forums to discuss a process that affected both teams. Both sides would list out the steps in the process and point out the "pain points" and annoyances about the process. Many bugs and feature requests would be filed during these meetings and often bugs would be fixed in real time. Engineers would have their laptops open and would fix minor issues during the meeting.

Of course we need a way to deal with larger issues too. If enough haters hate, hate, hate, we should perform a postmortem. 2015 brought us a great book on the topic, Beyond Blame by Dave Zwieback.

Sorry, T-Sway, I can't just shake it off. We need to get to the bottom of this by finding the contributing factors.

Here's the music video:


I hope you agree that we should be more evidence driven, that the uptown funk gon' give it to you, and haters should file bugs or otherwise open more productive channels of communication.

I hope to bring more DevOps music reviews to you in 2016.

Feel free to post your DevOps-related songs in the comments!

Happy new year and have a great 2016!

Posted by Tom Limoncelli in DevOpsFunny

This month's NYCDevOps meeting (hosted at the HQ) has special guest speakers Bridget Kromhout and Casey West talking about running Docker images in Cloud Foundry's Elastic Runtime and orchestrating containerized workloads on Lattice.

  • Date: Tuesday, December 15, 2015
  • Time: 6:30 PM
  • Place: The Stack Overflow HQ (near Wall St.)
  • You must RSVP and bring an ID to get into the building.

You should join me at this Meetup. Check it out and RSVP!

Posted by Tom Limoncelli in CommunityDevOps

Adam Bertram wrote an excellent piece in InfoWorld: 7 signs you're doing devops wrong

Posted by Tom Limoncelli in DevOps

You're gonna want this book. Pre-order it now.

(Pre-orders are paper right now; it should be available on Kindle soon. Official release date is Oct 25)

This is the best book I've ever read about Postmortems and creating a Blameless operations culture.


Posted by Tom Limoncelli in DevOps

Pearson / is running a promotion through August 11th on many open-source related books, including Volume 2, The Practice of Cloud System Administration. Use discount code OPEN2015 during checkout and received 35% off any one book, or 45% off 2 or more books.

See the website for details.

Posted by Tom Limoncelli in DevOps

Someone on Quora recently asked, Why did Google include the 'undo send' feature on Gmail?. They felt that adding the 30-second delay to email delivery was inefficient. However rather than answering the direct question, I explained the deeper issue. My (slightly edited) answer is below. NOTE: While I previously worked at Google, I was never part of the Gmail team, nor do I even know any of their developers or the product manager(s). What I wrote here is true for any software company.

Why did Google include this feature? Because the "Gmail Labs" system permits developers to override the decisions of product managers. This is what makes the "Labs" system so brilliant.

A product manager has to decide which features to implement and which not to. This is very difficult. Each new feature takes time to design (how will it work from the user perspective), architect (how will the internals work), implement (write the code that makes it all happen), and support (documentation, and so on). There are only so many hours in the day, and only so many developers assigned to Gmail. The product manager has to say "no" to a lot of good ideas.

If you were the product manager, would you select features that are obviously going to possibly attract millions of new users, or features that help a few existing users have a slightly nicer day? Obviously you'll select the first category. IMHO Google is typically is concerned with growth, not retention. New users are more valuable than slight improvements that will help a few existing users. Many of these minor features are called "fit and finish"... little things that help make the product sparkle, but aren't things you can put in an advertisement because they have benefits that are intangible or would only be understood by a few. Many of the best features can't be appreciated or understood until they are available for use. When they are "on paper", it is difficult to judge their value.

Another reason a product manager may reject a proposed feature is politics. Maybe the idea came from someone that the product manager doesn't like, or doesn't trust. (possibly for good reason)

The "Labs" framework of Google products is a framework that let's developers add features that have been rejected by the product manager. Google engineers can, in their own spare time or in the "20% time" they are allocated, implement features that the product manager hasn't approved. "Yes, Mr Product Manager, I understand that feature x-y-z seems stupid to you, but the few people that want it would love it, so I'm going to implement it anyway and don't worry, it won't be an official feature."

The Third Way of DevOps is about creating a culture that fosters two things: continual experimentation (taking risks and learning from failure) and understanding that repetition and practice is the prerequisite to mastery. Before the Labs framework, adding any experimental feature had a huge overhead. Now most of the overhead is factored out so that there is a lower bar to experimenting. Labs-like frameworks should be added to any software product where one wants to improve their Third Way culture.

Chapter 2 of The Practice of Cloud System Administration talks about many different software features that developers should consider to assure that the system can be efficiently managed. Having a "Labs" framework enables features to be added and removed with less operational hassle because it keeps experiments isolated and easy to switch off if they cause an unexpected problem. It is much easier to temporarily disable a feature that is advertised as experimental.

What makes the "Labs" framework brilliant is that it not only gives a safe framework for experimental features to be added, but it gathers usage statistics automatically. If the feature becomes widely adopted, the developer can present hard cold data to the product manager that says the feature should be promoted to become an official feature.

Of course, the usage statistics might also show that the feature isn't well-received and prove the product manager correct.

A better way of looking at it is that the "labs" feature provides a way to democratize the feature selection process and provides a data-driven way to determine which features should be promoted to a more "official" status. The data eliminates politically-driven decision making and "I'm right because my business card lists an important title"-business as usual. This is one of the ways that Google's management is so brilliant.

I apologize for explaining this as an "us vs. them" paradigm i.e. as if the product managers and developers are at odds with each other. However, the labs feature wouldn't be needed if there wasn't some friction between the two groups. In a perfect world there would be infinite time to implement every feature requested, but we don't live in that world. (Or maybe the "Labs" feature was invented by a brilliant product manager that hated to say "no" and wanted to add an 'escape hatch' that encouraged developers to experiment. I don't know, but I'm pessimistic and believe that Labs started as an appeasement.)

So, in summary: Why did Google include the 'undo send' feature on Gmail? Because someone thought it was important, took the time to implement it under the "labs" framework, users loved the feature, and product management promoted it to be an official Gmail feature.

I wish more products had a "labs" system. The only way it could be better is if non-Googlers had a way to add features under the "labs" system too.

Hey Google, when do we get that?

Posted by Tom Limoncelli in DevOps

If you own a Boeing 787 Dreamliner, and I'm sure many of our readers do, you should reboot it every 248 days. In fact, more frequently than that because at about the 248-day mark, the power system will fail due to a software bug.

Considering that 248 days is about 2^31 * 100, it is pretty reasonable to assume there is a timer with 100 microsecond resolution timer held in a 32-bit unsigned int. It would overflow every 248 days.

"Hell yeah, I did it! I saved 4 bytes every time we store a timestamp. Screw you. It's awesome.
a software engineer that makes planes but doesn't have to operate them.

Reminds me of all the commercial software I've seen that was written by developers that didn't seem to care, or were ignorant of, the operational realities that their customers live with.

Last week at DevOpsDays NYC 2015 I was reminded time and time again that the most important part of DevOps is shared responsibility: The opposite of workers organized in silos of responsibilities, ignorant and unempathetic to the other silos.

Posted by Tom Limoncelli in DevOps

Where does it come from?

Have you read the 2014 State of DevOps report? The analysis is done by some of the world's best IT researchers and statisticians.

Be included in the 2015 edition!

A lot of the data used to create the report comes from the annual survey done by Puppet Labs. I encourage everyone to take 15 minutes to complete this survey. It is important that your voice and experience is represented in next year's report. Take the survey

But I'm not important enough!

Yes you are. If you think "I'm not DevOps enough" or "I'm not important enough" then it is even more important that you fill out the survey. The survey needs data from sites that are not "DevOps" (whatever that means!) to create the basis of comparison.

Well, ok, I'll do it then!

Great! Click the link:

Thank you,

Posted by Tom Limoncelli in DevOps

As you know, I live in New Jersey. I'm excited to announce that some fine fellow New Jerseyians have started a DevOps Meetup. The first meeting will be on Monday, Aug 18, 2014 in Clifton, NJ. I'm honored to be their first speaker.

More info at their MeetUp Page:

DevOps and Automation NJ Group

Hope to see you there!

Posted by Tom Limoncelli in CommunityDevOps

[This article first appeared in the SAGE-AU newsletter.]

Have you heard about the New York City broadway show Spider-Man Turn Off the Dark? It should have been a big success. The music was written by Bono and the Edge from U2. It was directed by Julie Taymor, who had previously created many successful shows including The Lion King. Sadly, before it opened, the show was already making headlines due to six actors getting seriously injured and other issues.

The show opened late, but it did finally open. It ran from June 2011 to January 2014.

When the show closed Taymor said that one of the biggest problems with bringing the show to production was that they were using new technology that was difficult to work with. Many of the scenes involved choreography that was highly dependent on pre-programmed robotics. Any changes that involved the robotics required a 5 hour wait.

A 5 hour wait?

Normally directors and choreographers can attempt dozens of changes in a day of rehearsal to get a scene or dance number "just right." Imagine finding yourself in a situation where you can only attempt a new change once or twice a day.

The ability to confidently make changes at will is key to being able to innovate. Innovation means trying new things and keeping what works, throwing away what doesn't. If you can't make changes, then you can't innovate.

Consider the opposite of innovation. We've all been at a company that resists change or has calcified to the point where they are unable to make change. Nothing can get better if you can't change policies, procedures, or technology. Since entropy means things slowly get worse over time, an organization that is unable to change, by definition, is an organization that is spiraling towards doom.

I'm reminded of this recently due to the Heartbleed security issue. Like most system administrators, the Heartbleed bug meant I had to spend a lot of time upgrading the software and firmware of nearly every system in their organization. For many of us it meant discovering systems that hadn't been upgraded in so long that the implications were unknown. As sysadmins we wanted to protect ourselves against this security flaw, but we also had to face our own fear of change.

We need to create a world where we are able to change, or "change-able".

There are many factors that enable us to be "change-able". One factor is frequency: we can make change, one after the next, in rapid succession.

Software upgrades: Every 1-3 years there is a new Microsoft Windows operating system and upgrading requires much care and planning. Systems are wiped and reloaded because we are often re-inventing the world from scratch with each upgrade. On the other hand, software that is upgraded frequently requires less testing each time because the change is less of a "quantum leap". In addition, we get better at the process because we do it often. We automate the testing, the upgrade process itself, we design systems that are field-upgradable or hot-upgradable because we have to... otherwise these frequent upgrades would be impossible.

Procedures: Someone recently told me he doesn't document his process for doing something because it only happens once a year and by then the process has changed. Since he has to reinvent the procedure each time the best he can do is keep notes about how the process worked last time. Contrast this to a procedure that is done weekly or daily. You can probably document it well enough that, barring major changes, you can delegate the process to a more junior person.

Software releases: If you collaborate with developers who put out releases infrequently, each release contains thousands of changes. A bug could be in any of those changes. Continuous Delivery systems compile and test the software after every source code change. Any new bugs discovered are likely to be found in the very small change that was recently checked in.

Another factor in being "change-able" is the how difficult it is to make a change.

I've been at companies where making a DNS change required editing 5 different files on two different systems, manually running a series of tests and then saying a prayer. I've been at others where one typed a command to insert to delete the record, and the rest just happened for me.

When it is difficult to make a change, we make them less often. We are tempted to avoid any action that requires that kind of change. This has a domino effect that slows and delays other projects. Or, it means we make the decision to live with a bad situation rather than fix it. You settle for less.

When we make changes less frequently, we get worse at doing them. Therefore they become more risky to do. Because they are more risky, we do them even less. It becomes a downward spiral.

DevOps is, if anything, about making operations more "change-able". Everyone has their own definition of DevOps, but what they all have in common is that DevOps makes operations better able to change: change more frequently and change more easily. The result is confidence in our ability to make changes. In that way, confidence is a precondition to being able to innovate.

Which brings us back to Spider-Man Turn Off the Dark. How much innovation could really happen if each change took 5 hours? Imagine paying a hundred dancers, actors, and technicians to do nothing for 5 hours waiting for the next iteration. You can't send them home. You can't tell them "come in for a few minutes every 5 hours". You would, instead, avoid change and settle for less. You would settle for what you have instead of fixing the problems.

Would DevOps have saved Spiderman? Would a more change-able world make me less fearful of the next Heartbleed


Posted by Tom Limoncelli in DevOpsWriting

Yulia Sheynkman and Dave Zwieback are repeating their "Awesome Postmortems" workshop on July 10.

It's a great way to get the team--and not just ops--offsite to experience a healthier way of dealing and learning from failure.

If you are in the NYC-area, this is a great opportunity to learn how to make postmortems an integrated part of how to improve reliability and prevent future outages.

When we wrote our "how to do postmortems" section of the upcoming The Practice of Cloud System Administration, we asked Dave for advice because we respect his expertise. Now you can get a full day of training directly from Yulia and Dave!

(full description below the fold)

You've probably seen experiments where a mouse gets cheese as a reward for pulling a lever. If he or she receives the cheese right away, the brain associates work (pulling the lever) with reward (the cheese) and it motivates the mouse. They want to do more work. It improves job satisfaction.

If the mouse received the cheese a month later, the brain won't associate the work with the reward. A year later? Fuggedaboutit!

Now imagine you are a software developer, operations engineer, or system administrator working on a software project. The software is released every 6 months. The hard work you do gets a reward every 6 months. Your brain isn't going to associate the two.

Now imagine monthly or weekly releases. The interval between work and reward is improved. The association is stronger. Motivation and job satisfaction goes up.

Now imagine using a continuous build/test system. You see the results of your work in the form of "test: pass" or "test: fail". Instant gratification.

Now imagine using a continuous deploy system. Every change results in a battery of tests which, if they pass, results in the software being launched into production. The interval is reduced to hours, possibly minutes.

I think that's pretty cool.

Posted by Tom Limoncelli in DevOps

Someone recently asked me if it was reasonable to expect their RelEng person also be responsible for the load balancing infrastructure and the locally-run virtualization system they have.

Sure! Why not! Why not have them also be the product manager, CEO, and the company cafeteria's chief cook?

There's something called "division of labor" and you have to draw the line somewhere. Personally I find that line usually gets drawn around skill-set.

Sarcasm aside, without knowing the person or the company, I'd have to say no. RelEng and Infrastructure Eng are two different roles.

Here's my longer answer.

A release engineer is concerned with building a "release". A release is the end result of source code, compiled and put into packages, and tested. Many packages are built. Some fail tests and do not become "release candidates". Of the candidates, some are "approved" for production.

Sub-question A: Should RelEng include pushing into production?

In some environments the RelEng pushes the approved packages into production. In other environments that's the sysadmin's job. Both can work, but IMHO sysadmins should build the production environment because they have the right expertise. Depending on the company size and shape, I can be convinced either way but in general I think RelEng shouldn't have that responsibility. On the other hand, if you have Continuous Deployment set up, then the RelEng person should absolutely be involved or own that aspect of the process.

Sub-question B: Should RelEng build the production infrastructure?

RelEng people are now expected to build AWS and Docker images, and therefore are struggling to learn things that sysadmins used to have a monopoly on. However you still need sysadmins to create the infrastructure under Docker or whatever virtual environment you are using.

Longer version: Traditionally sysadmins build the infrastructure that the service runs on. They know all the magic related to storage SANs, Cisco switches, firewalls, RAM/CPU specs for machines, OS configuration and so on. However this is changing. All of those things are now virtual: storage is virtual (SANs), machines are virtual (VMs), and now networks are too (SDN). So, you can now describe the infrastructure in code. The puppet/cfengine/whatever configs are versioned just like all other software. Thus, should they be the domain of RelEng or sysadmins?

I think it is pretty reasonable to expect RelEng people to be responsible for building Docker images (possibly with some help from sysadmins) and AWS images (possibly with a lot of help from sysadmins).

But what about the infrastructure under Docker/VMware/etc? It should also be "described in code" and therefore be kept under revision control, driven by Jenkins/TeamCity/whatever, and so on. I think some RelEng people can do that job, but it is a lot of work and highly specialized therefore the need for a "division of labor" outweighs whether or not a RelEng person has those skills. In general I'd have separate people doing that kind of work.

What do we do at StackExchange? Well, our build and test process is totally automated. Our process for pushing new releases into production is totally automated too, but requires a human to trigger it (possibly something we'll eliminate some day). So, the only RelEng we need a person for is to maintain the system and add occasional new features. Therefore, that role is done by Devs but the SREs can back-fill. The infrastructure itself is designed and run by SREs. So, basically the division of labor described above.

Obviously "your milage may vary". If you are entirely running out of AWS or GCE you might not have any infrastructure of your own.


Posted by Tom Limoncelli in DevOps

  1. People that complain that the enterprise world doesn't get DevOps but don't participate in enterprise conferences.
  2. Lack of a "sound bite" definition of DevOps; leads to confusion. I was recently told "DevOps means developers have to carry pagers... that's why our developers don't want anything to do with it." If that's the definition that is getting out, we're in trouble.
  3. Engineers thinking that "if something is good, it doesn't need marketing". Tell that to the many inventions of Nikola Tesla that never got turned into products. The "build a better mouse trap and people will beat a path to your door" myth was debunked years ago.

So... what are you doing to change this?

Posted by Tom Limoncelli in DevOps

Teams working through The Three Ways need an unbiased way to judge their progress. That is, how do you know "Are we there yet?"

Like any journey there are milestones. I call these "look-for's". As in, these are the things to "look for" to help you determine how a group is proceeding on their journey.

Since there are 3 "ways" one would expect there to be 4 milestones, the "starting off point" plus a milestone marking the completion of each "Way". I add an additional milestone part way through The First Way. There is an obvious sequence point in the middle of The First Way where a team goes from total chaos to managed chaos.

The Milestones

DevOps Assessment Levels: Crayola Maraschino, Tangerine, Lemon, Aqua and Spring.

Posted by Tom Limoncelli in Best of BlogDevOps

There is a devops-related talk in every hour of this year's Usenix LISA conference. Usenix LISA Is a general conference with many tracks going on at any time. A little analysis finds there is always at least one DevOps related talk (usually more than one). This is very impressive. The problem, however, is that many of the talk titles don't make this clear. No worries, I've done the research for you.

[I apologize in advance for any typo or errors. Please report any problems in the comments. The conference website has the latest information. Other lists of presentations: Programming, Unix/Linux administration technical skills, Cloud Computing, and Women at Usenix LISA.]

Posted by Tom Limoncelli in ConferencesDevOpsLISA

I wonder if he know how much influence he had on DevOps culture. The Three Ways of DevOps are essentially The Toyota Way applied to system administration.

Eiji Toyoda, Promoter of the Toyota Way and Engineer of Its Growth, Dies at 100

Posted by Tom Limoncelli in DevOps

As a system administrator you hate to see it happen:

A user has a problem. They don't report it to you (enter a bug report, file a ticket). They whine to their friends, or suffer in silence. Months later you find out and ask, "Why didn't you file a ticket? I could have fixed it!" They either didn't have time, didn't feel it would do any good, or whatever.

Annoying right?

What's 100 times more annoying? When sysadmins do it to each other.

I've seen it many times. Walking through a process (say... setting up a new machine) and some of the steps require... umm... "interesting work-arounds". I ask "is there a bug filed about that?" and am told, "No, they know it's a problem".

Oh do they?

Things can't be fixed unless someone knows it is broken. Assuming that someone knows about a problem is assuming that everyone else has ESP. Don't expect your coworkers to be mind-readers.

In a recent case the person told me "they know about it" but it was a task that he was responsible for doing. The others wouldn't possibly know about this problem unless he went on vacation and the task happened to be needed. That wasn't likely.

When we use a service we often know the "client" end of things much better than the people responsible for the service itself. Don't assume they use the service they provide as much as you do!

I would rather have a duplicate bug filed than no bug filed at all.

These problems, inconveniences, inefficiencies, "issues", "could-be-betters" and "should-be-betters" need to be recorded somewhere so they can be prioritized, sorted, and worked on. Whether the list is kept in a bugtracking system, helpdesk "ticket" system, wiki page, or spreadsheet: it has to be anywhere other than your brain.

In Gene Kim's "Three Ways" the 2nd way is the "Feedback" way. Amplify issues, don't hide them! If there is a process problem you are working around, file a bug report. If there is a crash make sure the programmer gets woken up at 3am so he or she feels the pain too. If a process has a "long version" that is only needed "occasionally" then publish how many times each month the "long path" is needed so that people are aware just how occasional "occasionally" is.

Speaking of The Three Ways, Tim Hunter has re-interpreted them in a much more comprehensible way in this blog post on IT Revolution. I highly recommend reading it.

Posted by Tom Limoncelli in DevOps

I'll be the speaker at LOPSA NYC's meeting in January. This will be a repeat of the "[email protected]" talk that I gave at LISA 2011. It was very well-received. If you missed it at LISA, this may be your last chance to see it live.

Official announcement:

(Please register so you can get into the building. The registration form is at the bottom of the log post)

The talk starts at 7pm. Please come early so you can get through security.

As mentioned previously Mark Burgess, creator of CFEngine, will be speaking at the NYC DevOps MeetUp tonight

  • When: Wednesday, May 25, 2011, 7:00 PM
  • Topic: Mark Burgess presents DevOps and The Future of Configuration Management
  • Where: New York... exact location revealed when you RSVP to the MeetUp

Posted by Tom Limoncelli in CommunityDevOps

This month's NYC DevOps meetup has a special speaker: Mark Burgess, inventor of CFEngine, talking on the future of configuration management.

Wednesday, May 25, 2011, 7:00 PM

Topic: Mark Burgess presents DevOps and The Future of Configuration Management

Mark Burgess is the founder, chairman, CTO and principal author of Cfengine. He is Professor of Network and System Administration at Oslo University College and has led the way in theory and practice of automation and policy based management for 20 years. In the 1990s he underlined the importance of idempotent, autonomous desired state management ("convergence") and formalised cooperative systems in the 2000s ("promise theory"). He is the author of numerous books and papers on Network and System Administration and has won several prizes for his work.

Check out his blog here:

Check out Cfengine here:

Posted by Tom Limoncelli in CommunityDevOpsNYC

  • LISA16