Awesome Conferences

See us live(rss)   

Recently in DevOps Category

As you know, I live in New Jersey. I'm excited to announce that some fine fellow New Jerseyians have started a DevOps Meetup. The first meeting will be on Monday, Aug 18, 2014 in Clifton, NJ. I'm honored to be their first speaker.

More info at their MeetUp Page:

DevOps and Automation NJ Group

Hope to see you there!

Posted by Tom Limoncelli in CommunityDevOps

[This article first appeared in the SAGE-AU newsletter.]

Have you heard about the New York City broadway show Spider-Man Turn Off the Dark? It should have been a big success. The music was written by Bono and the Edge from U2. It was directed by Julie Taymor, who had previously created many successful shows including The Lion King. Sadly, before it opened, the show was already making headlines due to six actors getting seriously injured and other issues.

The show opened late, but it did finally open. It ran from June 2011 to January 2014.

When the show closed Taymor said that one of the biggest problems with bringing the show to production was that they were using new technology that was difficult to work with. Many of the scenes involved choreography that was highly dependent on pre-programmed robotics. Any changes that involved the robotics required a 5 hour wait.

A 5 hour wait?

Normally directors and choreographers can attempt dozens of changes in a day of rehearsal to get a scene or dance number "just right." Imagine finding yourself in a situation where you can only attempt a new change once or twice a day.

The ability to confidently make changes at will is key to being able to innovate. Innovation means trying new things and keeping what works, throwing away what doesn't. If you can't make changes, then you can't innovate.

Consider the opposite of innovation. We've all been at a company that resists change or has calcified to the point where they are unable to make change. Nothing can get better if you can't change policies, procedures, or technology. Since entropy means things slowly get worse over time, an organization that is unable to change, by definition, is an organization that is spiraling towards doom.

I'm reminded of this recently due to the Heartbleed security issue. Like most system administrators, the Heartbleed bug meant I had to spend a lot of time upgrading the software and firmware of nearly every system in their organization. For many of us it meant discovering systems that hadn't been upgraded in so long that the implications were unknown. As sysadmins we wanted to protect ourselves against this security flaw, but we also had to face our own fear of change.

We need to create a world where we are able to change, or "change-able".

There are many factors that enable us to be "change-able". One factor is frequency: we can make change, one after the next, in rapid succession.

Software upgrades: Every 1-3 years there is a new Microsoft Windows operating system and upgrading requires much care and planning. Systems are wiped and reloaded because we are often re-inventing the world from scratch with each upgrade. On the other hand, software that is upgraded frequently requires less testing each time because the change is less of a "quantum leap". In addition, we get better at the process because we do it often. We automate the testing, the upgrade process itself, we design systems that are field-upgradable or hot-upgradable because we have to... otherwise these frequent upgrades would be impossible.

Procedures: Someone recently told me he doesn't document his process for doing something because it only happens once a year and by then the process has changed. Since he has to reinvent the procedure each time the best he can do is keep notes about how the process worked last time. Contrast this to a procedure that is done weekly or daily. You can probably document it well enough that, barring major changes, you can delegate the process to a more junior person.

Software releases: If you collaborate with developers who put out releases infrequently, each release contains thousands of changes. A bug could be in any of those changes. Continuous Delivery systems compile and test the software after every source code change. Any new bugs discovered are likely to be found in the very small change that was recently checked in.

Another factor in being "change-able" is the how difficult it is to make a change.

I've been at companies where making a DNS change required editing 5 different files on two different systems, manually running a series of tests and then saying a prayer. I've been at others where one typed a command to insert to delete the record, and the rest just happened for me.

When it is difficult to make a change, we make them less often. We are tempted to avoid any action that requires that kind of change. This has a domino effect that slows and delays other projects. Or, it means we make the decision to live with a bad situation rather than fix it. You settle for less.

When we make changes less frequently, we get worse at doing them. Therefore they become more risky to do. Because they are more risky, we do them even less. It becomes a downward spiral.

DevOps is, if anything, about making operations more "change-able". Everyone has their own definition of DevOps, but what they all have in common is that DevOps makes operations better able to change: change more frequently and change more easily. The result is confidence in our ability to make changes. In that way, confidence is a precondition to being able to innovate.

Which brings us back to Spider-Man Turn Off the Dark. How much innovation could really happen if each change took 5 hours? Imagine paying a hundred dancers, actors, and technicians to do nothing for 5 hours waiting for the next iteration. You can't send them home. You can't tell them "come in for a few minutes every 5 hours". You would, instead, avoid change and settle for less. You would settle for what you have instead of fixing the problems.

Would DevOps have saved Spiderman? Would a more change-able world make me less fearful of the next Heartbleed

Absolutely.

Posted by Tom Limoncelli in DevOpsWriting

Yulia Sheynkman and Dave Zwieback are repeating their "Awesome Postmortems" workshop on July 10.

https://ti.to/mindweather/awesome-postmortems

It's a great way to get the team--and not just ops--offsite to experience a healthier way of dealing and learning from failure.

If you are in the NYC-area, this is a great opportunity to learn how to make postmortems an integrated part of how to improve reliability and prevent future outages.

When we wrote our "how to do postmortems" section of the upcoming The Practice of Cloud System Administration, we asked Dave for advice because we respect his expertise. Now you can get a full day of training directly from Yulia and Dave!

(full description below the fold)

You've probably seen experiments where a mouse gets cheese as a reward for pulling a lever. If he or she receives the cheese right away, the brain associates work (pulling the lever) with reward (the cheese) and it motivates the mouse. They want to do more work. It improves job satisfaction.

If the mouse received the cheese a month later, the brain won't associate the work with the reward. A year later? Fuggedaboutit!

Now imagine you are a software developer, operations engineer, or system administrator working on a software project. The software is released every 6 months. The hard work you do gets a reward every 6 months. Your brain isn't going to associate the two.

Now imagine monthly or weekly releases. The interval between work and reward is improved. The association is stronger. Motivation and job satisfaction goes up.

Now imagine using a continuous build/test system. You see the results of your work in the form of "test: pass" or "test: fail". Instant gratification.

Now imagine using a continuous deploy system. Every change results in a battery of tests which, if they pass, results in the software being launched into production. The interval is reduced to hours, possibly minutes.

I think that's pretty cool.

Posted by Tom Limoncelli in DevOps

Someone recently asked me if it was reasonable to expect their RelEng person also be responsible for the load balancing infrastructure and the locally-run virtualization system they have.

Sure! Why not! Why not have them also be the product manager, CEO, and the company cafeteria's chief cook?

There's something called "division of labor" and you have to draw the line somewhere. Personally I find that line usually gets drawn around skill-set.

Sarcasm aside, without knowing the person or the company, I'd have to say no. RelEng and Infrastructure Eng are two different roles.

Here's my longer answer.

A release engineer is concerned with building a "release". A release is the end result of source code, compiled and put into packages, and tested. Many packages are built. Some fail tests and do not become "release candidates". Of the candidates, some are "approved" for production.

Sub-question A: Should RelEng include pushing into production?

In some environments the RelEng pushes the approved packages into production. In other environments that's the sysadmin's job. Both can work, but IMHO sysadmins should build the production environment because they have the right expertise. Depending on the company size and shape, I can be convinced either way but in general I think RelEng shouldn't have that responsibility. On the other hand, if you have Continuous Deployment set up, then the RelEng person should absolutely be involved or own that aspect of the process.

Sub-question B: Should RelEng build the production infrastructure?

RelEng people are now expected to build AWS and Docker images, and therefore are struggling to learn things that sysadmins used to have a monopoly on. However you still need sysadmins to create the infrastructure under Docker or whatever virtual environment you are using.

Longer version: Traditionally sysadmins build the infrastructure that the service runs on. They know all the magic related to storage SANs, Cisco switches, firewalls, RAM/CPU specs for machines, OS configuration and so on. However this is changing. All of those things are now virtual: storage is virtual (SANs), machines are virtual (VMs), and now networks are too (SDN). So, you can now describe the infrastructure in code. The puppet/cfengine/whatever configs are versioned just like all other software. Thus, should they be the domain of RelEng or sysadmins?

I think it is pretty reasonable to expect RelEng people to be responsible for building Docker images (possibly with some help from sysadmins) and AWS images (possibly with a lot of help from sysadmins).

But what about the infrastructure under Docker/VMware/etc? It should also be "described in code" and therefore be kept under revision control, driven by Jenkins/TeamCity/whatever, and so on. I think some RelEng people can do that job, but it is a lot of work and highly specialized therefore the need for a "division of labor" outweighs whether or not a RelEng person has those skills. In general I'd have separate people doing that kind of work.

What do we do at StackExchange? Well, our build and test process is totally automated. Our process for pushing new releases into production is totally automated too, but requires a human to trigger it (possibly something we'll eliminate some day). So, the only RelEng we need a person for is to maintain the system and add occasional new features. Therefore, that role is done by Devs but the SREs can back-fill. The infrastructure itself is designed and run by SREs. So, basically the division of labor described above.

Obviously "your milage may vary". If you are entirely running out of AWS or GCE you might not have any infrastructure of your own.

Tom

Posted by Tom Limoncelli in DevOps

  1. People that complain that the enterprise world doesn't get DevOps but don't participate in enterprise conferences.
  2. Lack of a "sound bite" definition of DevOps; leads to confusion. I was recently told "DevOps means developers have to carry pagers... that's why our developers don't want anything to do with it." If that's the definition that is getting out, we're in trouble.
  3. Engineers thinking that "if something is good, it doesn't need marketing". Tell that to the many inventions of Nikola Tesla that never got turned into products. The "build a better mouse trap and people will beat a path to your door" myth was debunked years ago.

So... what are you doing to change this?

Posted by Tom Limoncelli in DevOps

Teams working through The Three Ways need an unbiased way to judge their progress. That is, how do you know "Are we there yet?"

Like any journey there are milestones. I call these "look-for's". As in, these are the things to "look for" to help you determine how a group is proceeding on their journey.

Since there are 3 "ways" one would expect there to be 4 milestones, the "starting off point" plus a milestone marking the completion of each "Way". I add an additional milestone part way through The First Way. There is an obvious sequence point in the middle of The First Way where a team goes from total chaos to managed chaos.

The Milestones

DevOps Assessment Levels: Crayola Maraschino, Tangerine, Lemon, Aqua and Spring.

Posted by Tom Limoncelli in Best of BlogDevOps

There is a devops-related talk in every hour of this year's Usenix LISA conference. Usenix LISA Is a general conference with many tracks going on at any time. A little analysis finds there is always at least one DevOps related talk (usually more than one). This is very impressive. The problem, however, is that many of the talk titles don't make this clear. No worries, I've done the research for you.

[I apologize in advance for any typo or errors. Please report any problems in the comments. The conference website has the latest information. Other lists of presentations: Programming, Unix/Linux administration technical skills, Cloud Computing, and Women at Usenix LISA.]

Posted by Tom Limoncelli in ConferencesDevOpsLISA

I wonder if he know how much influence he had on DevOps culture. The Three Ways of DevOps are essentially The Toyota Way applied to system administration.

Eiji Toyoda, Promoter of the Toyota Way and Engineer of Its Growth, Dies at 100

Posted by Tom Limoncelli in DevOps

As a system administrator you hate to see it happen:

A user has a problem. They don't report it to you (enter a bug report, file a ticket). They whine to their friends, or suffer in silence. Months later you find out and ask, "Why didn't you file a ticket? I could have fixed it!" They either didn't have time, didn't feel it would do any good, or whatever.

Annoying right?

What's 100 times more annoying? When sysadmins do it to each other.

I've seen it many times. Walking through a process (say... setting up a new machine) and some of the steps require... umm... "interesting work-arounds". I ask "is there a bug filed about that?" and am told, "No, they know it's a problem".

Oh do they?

Things can't be fixed unless someone knows it is broken. Assuming that someone knows about a problem is assuming that everyone else has ESP. Don't expect your coworkers to be mind-readers.

In a recent case the person told me "they know about it" but it was a task that he was responsible for doing. The others wouldn't possibly know about this problem unless he went on vacation and the task happened to be needed. That wasn't likely.

When we use a service we often know the "client" end of things much better than the people responsible for the service itself. Don't assume they use the service they provide as much as you do!

I would rather have a duplicate bug filed than no bug filed at all.

These problems, inconveniences, inefficiencies, "issues", "could-be-betters" and "should-be-betters" need to be recorded somewhere so they can be prioritized, sorted, and worked on. Whether the list is kept in a bugtracking system, helpdesk "ticket" system, wiki page, or spreadsheet: it has to be anywhere other than your brain.

In Gene Kim's "Three Ways" the 2nd way is the "Feedback" way. Amplify issues, don't hide them! If there is a process problem you are working around, file a bug report. If there is a crash make sure the programmer gets woken up at 3am so he or she feels the pain too. If a process has a "long version" that is only needed "occasionally" then publish how many times each month the "long path" is needed so that people are aware just how occasional "occasionally" is.

Speaking of The Three Ways, Tim Hunter has re-interpreted them in a much more comprehensible way in this blog post on IT Revolution. I highly recommend reading it.

Posted by Tom Limoncelli in DevOps

I'll be the speaker at LOPSA NYC's meeting in January. This will be a repeat of the "DevOps@Google" talk that I gave at LISA 2011. It was very well-received. If you missed it at LISA, this may be your last chance to see it live.

Official announcement: http://www.lopsa-nyc.org/content/google-sre-when-developers-and-sysadmins-collaborate-things-get-better-faster

(Please register so you can get into the building. The registration form is at the bottom of the log post)

The talk starts at 7pm. Please come early so you can get through security.

As mentioned previously Mark Burgess, creator of CFEngine, will be speaking at the NYC DevOps MeetUp tonight

http://www.meetup.com/nycdevops/events/17211427/

  • When: Wednesday, May 25, 2011, 7:00 PM
  • Topic: Mark Burgess presents DevOps and The Future of Configuration Management
  • Where: New York... exact location revealed when you RSVP to the MeetUp

Posted by Tom Limoncelli in CommunityDevOps

This month's NYC DevOps meetup has a special speaker: Mark Burgess, inventor of CFEngine, talking on the future of configuration management.

http://www.meetup.com/nycdevops/events/17211427/

Wednesday, May 25, 2011, 7:00 PM

Topic: Mark Burgess presents DevOps and The Future of Configuration Management

Mark Burgess is the founder, chairman, CTO and principal author of Cfengine. He is Professor of Network and System Administration at Oslo University College and has led the way in theory and practice of automation and policy based management for 20 years. In the 1990s he underlined the importance of idempotent, autonomous desired state management ("convergence") and formalised cooperative systems in the 2000s ("promise theory"). He is the author of numerous books and papers on Network and System Administration and has won several prizes for his work.

Check out his blog here: http://www.iu.hio.no/~mark/index.html

Check out Cfengine here: http://www.cfengine.org/

Posted by Tom Limoncelli in CommunityDevOpsNYC