There are two things you can do if you want to understand the future of system administration.
First, if you want to see what DevOps will be like 5-10 years out, you can read the amazing new book, Site Reliability Engineering: How Google Runs Production Systems. I read a preview copy and it was excellent. Many different Google SRE teams got together to produce a very well-rounded book that covers all aspects of Google's SRE program, which is easily 5-10 years ahead of the industry. (Pre-order from O'Reilly or Amazon Kindle or Paper) Congrats to the editors Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy on a great addition to the IT cannon.
Second, if you want to see what SRE will look like in 30-50 years, you should watch the 2009 movie "Moon" staring Sam Rockwell. Anything I say would be a spoiler, so you'll just have to trust me. (BTW, the trailer is full of spoilers. Don't watch it!)
Congrats to all of Google SRE on the publication of the new O'Reilly book! I predict it will be a big hit! (and thanks for letting me blurb the back cover!)
[P.S. I apologize in advance for the very link-bait'y title. What I mean to say is that Google SRE is devops at an incredible scale.]
Whether or not you are in a DevOps environment, please take this survey. The data is useful for helping improve the situation for system administrators of all kinds. http://itrevolution.com/2016devopssurvey/
Wait... you didn't know there are songs about DevOps? Hell to the yeah!
Best DevOps Song of 2015: Uptown Funk (Mark Ronson ft. Bruno Mars)
Uptown Funk is exemplary of good DevOps operations: It encourages being evidence-driven.
An important principle of DevOps is that you should base decisions on evidence and data, not lore and intuition. Intuition is great but only gets you so far. With a tiny system is is possible for a single sysadmin to know enough about it to make good guesses. However modern systems are complex enough that we must collect data, analyze it, and base decisions on that data.
This means we must also be willing to revert a change if the data doesn't pan out as we predict. That's the scientific method. We measure something, we do an experiment, and we measure again. Then we decide whether we keep the change based on the data. Of course, this requires that our systems are observable, which means the days of un-monitorable systems is long gone.
A more scientific way of saying this is DevOps insists that we prove our assertions by gathering empirical evidence.
So, why is Uptown Funk exemplary of this DevOps principle? We... duh! The point of this song can be summed up by this line:
Cause Uptown Funk gon' give it to you
Now let me tell you something. A lot of people don't believe that the uptown funk gon' give it to you. I understand. It might be difficult to believe if you have not experienced the uptown funk. I don't mean to brag, but I happen to have a lot of uptown funk-related experience. However, you shouldn't believe me just because I have so much experience.
Likewise, Uptown Funk's author Mark Ronson doesn't insist that you simply believe him. He assures you that if you use scientific principles, you will discover on your own just how right he is. In particular, he says:
Don't believe me, just watch
See? Total devops.
Sure, he could have said, "empirical evidence proves my assertion" but he states it more lyrically.
Let me quote the entire chorus to be clear:
Girls hit your hallelujah (whoo)
Girls hit your hallelujah (whoo)
Girls hit your hallelujah (whoo)
'Cause uptown funk gon' give it to you
'Cause uptown funk gon' give it to you
'Cause uptown funk gon' give it to you
Saturday night and we in the spot
Don't believe me just watch (come on)
Mark feels so strongly about being evidence-driven that he implores you to do so 3 times in a row!
Here's the music video:
Worst DevOps Song of 2015: Shake It Off (Taylor Swift)
Technically Shake It Off was released in 2014 but it won the 2015 Grammy and People's Choice Awards so I count it as 2015 too.
To be clear, I love the song from a music and dance perspective. In fact, I own a copy of the CD that this song appears on. By the way, the CD is titled "1989" which refers to the year she was born. [For those of you that were born in 1989 and are reading this, a CD is a way that people used to distribute music. Ask your parents.]
That said, I don't think it exemplifies good DevOps practice.
The main message of the song is simple: ignore the negativity in your life
Or, as she says it:
'Cause the players gonna play, play, play, play, play
And the haters gonna hate, hate, hate, hate, hate
Baby, I'm just gonna shake, shake, shake, shake, shake
I shake it off, I shake it off
Now if TayTay had a better DevOps background it would be more like this:
Teams should be encouraged to openly discuss disagreements.
File bugs. Don't suffer in silence. Amplify feedback.
Blameless postmortems for major outages.
Making that rhyme is left as an exercise for the reader.
Openly discussing disagreements is an important part of breaking down silos. When I was at Google I held periodic team-to-team meetings which would be open forums to discuss a process that affected both teams. Both sides would list out the steps in the process and point out the "pain points" and annoyances about the process. Many bugs and feature requests would be filed during these meetings and often bugs would be fixed in real time. Engineers would have their laptops open and would fix minor issues during the meeting.
Of course we need a way to deal with larger issues too. If enough haters hate, hate, hate, we should perform a postmortem. 2015 brought us a great book on the topic, Beyond Blame by Dave Zwieback.
Sorry, T-Sway, I can't just shake it off. We need to get to the bottom of this by finding the contributing factors.
Here's the music video:
I hope you agree that we should be more evidence driven, that the uptown funk gon' give it to you, and haters should file bugs or otherwise open more productive channels of communication.
I hope to bring more DevOps music reviews to you in 2016.
Feel free to post your DevOps-related songs in the comments!
This month's NYCDevOps meeting (hosted at the StackOverflow.com HQ) has special guest speakers Bridget Kromhout and Casey West talking about running Docker images in Cloud Foundry's Elastic Runtime and orchestrating containerized workloads on Lattice.
Date: Tuesday, December 15, 2015
Time: 6:30 PM
Place: The Stack Overflow HQ (near Wall St.)
You must RSVP and bring an ID to get into the building.
Pearson / InformIT.com is running a promotion through August 11th on many open-source related books, including Volume 2, The Practice of Cloud System Administration. Use discount code OPEN2015 during checkout and received 35% off any one book, or 45% off 2 or more books.
Someone on Quora recently asked, Why did Google include the 'undo send' feature on Gmail?. They felt that adding the 30-second delay to email delivery was inefficient. However rather than answering the direct question, I explained the deeper issue. My (slightly edited) answer is below. NOTE: While I previously worked at Google, I was never part of the Gmail team, nor do I even know any of their developers or the product manager(s). What I wrote here is true for any software company.
Why did Google include this feature? Because the "Gmail Labs" system permits developers to override the decisions of product managers. This is what makes the "Labs" system so brilliant.
A product manager has to decide which features to implement and which not to. This is very difficult. Each new feature takes time to design (how will it work from the user perspective), architect (how will the internals work), implement (write the code that makes it all happen), and support (documentation, and so on). There are only so many hours in the day, and only so many developers assigned to Gmail. The product manager has to say "no" to a lot of good ideas.
If you were the product manager, would you select features that are obviously going to possibly attract millions of new users, or features that help a few existing users have a slightly nicer day? Obviously you'll select the first category. IMHO Google is typically is concerned with growth, not retention. New users are more valuable than slight improvements that will help a few existing users. Many of these minor features are called "fit and finish"... little things that help make the product sparkle, but aren't things you can put in an advertisement because they have benefits that are intangible or would only be understood by a few. Many of the best features can't be appreciated or understood until they are available for use. When they are "on paper", it is difficult to judge their value.
Another reason a product manager may reject a proposed feature is politics. Maybe the idea came from someone that the product manager doesn't like, or doesn't trust. (possibly for good reason)
The "Labs" framework of Google products is a framework that let's developers add features that have been rejected by the product manager. Google engineers can, in their own spare time or in the "20% time" they are allocated, implement features that the product manager hasn't approved. "Yes, Mr Product Manager, I understand that feature x-y-z seems stupid to you, but the few people that want it would love it, so I'm going to implement it anyway and don't worry, it won't be an official feature."
The Third Way of DevOps
is about creating a culture that fosters two things: continual experimentation (taking risks and learning from failure) and understanding that repetition and practice is the prerequisite to mastery. Before the Labs framework, adding any experimental feature had a huge overhead. Now most of the overhead is factored out so that there is a lower bar to experimenting. Labs-like frameworks should be added to any software product where one wants to improve their Third Way culture.
Chapter 2 of The Practice of Cloud System Administration talks about many different software features that developers should consider to assure that the system can be efficiently managed. Having a "Labs" framework enables features to be added and removed with less operational hassle because it keeps experiments isolated and easy to switch off if they cause an unexpected problem. It is much easier to temporarily disable a feature that is advertised as experimental.
What makes the "Labs" framework brilliant is that it not only gives a safe framework for experimental features to be added, but it gathers usage statistics automatically. If the feature becomes widely adopted, the developer can present hard cold data to the product manager that says the feature should be promoted to become an official feature.
Of course, the usage statistics might also show that the feature isn't well-received and prove the product manager correct.
A better way of looking at it is that the "labs" feature provides a way to democratize the feature selection process and provides a data-driven way to determine which features should be promoted to a more "official" status. The data eliminates politically-driven decision making and "I'm right because my business card lists an important title"-business as usual. This is one of the ways that Google's management is so brilliant.
I apologize for explaining this as an "us vs. them" paradigm i.e. as if the product managers and developers are at odds with each other. However, the labs feature wouldn't be needed if there wasn't some friction between the two groups. In a perfect world there would be infinite time to implement every feature requested, but we don't live in that world. (Or maybe the "Labs" feature was invented by a brilliant product manager that hated to say "no" and wanted to add an 'escape hatch' that encouraged developers to experiment. I don't know, but I'm pessimistic and believe that Labs started as an appeasement.)
So, in summary: Why did Google include the 'undo send' feature on Gmail? Because someone thought it was important, took the time to implement it under the "labs" framework, users loved the feature, and product management promoted it to be an official Gmail feature.
I wish more products had a "labs" system. The only way it could be better is if non-Googlers had a way to add features under the "labs" system too.
Considering that 248 days is about 2^31 * 100, it is pretty reasonable to assume there is a timer with 100 microsecond resolution timer held in a 32-bit unsigned int. It would overflow every 248 days.
"Hell yeah, I did it! I saved 4 bytes every time we store a timestamp. Screw you. It's awesome.
Sincerely, a software engineer that makes planes but doesn't have to operate them.
Reminds me of all the commercial software I've seen that was written by developers that didn't seem to care, or were ignorant of, the operational realities that their customers live with.
Last week at DevOpsDays NYC 2015 I was reminded time and time again that the most important part of DevOps is shared responsibility: The opposite of workers organized in silos of responsibilities, ignorant and unempathetic to the other silos.
A lot of the data used to create the report comes from the annual survey done by Puppet Labs. I encourage everyone to take 15 minutes to complete this survey. It is important that your voice and experience is represented in next year's report. Take the survey
But I'm not important enough!
Yes you are. If you think "I'm not DevOps enough" or "I'm not important enough" then it is even more important that you fill out the survey. The survey needs data from sites that are not "DevOps" (whatever that means!) to create the basis of comparison.
As you know, I live in New Jersey. I'm excited to announce that some fine fellow New Jerseyians have started a DevOps Meetup. The first meeting will be on Monday, Aug 18, 2014 in Clifton, NJ. I'm honored to be their first speaker.
Have you heard about the New York City broadway show Spider-Man Turn Off the Dark? It should have been a big success. The music was written by Bono and the Edge from U2. It was directed by Julie Taymor, who had previously created many successful shows including The Lion King. Sadly, before it opened, the show was already making headlines due to six actors getting seriously injured and other issues.
The show opened late, but it did finally open. It ran from June 2011 to January 2014.
When the show closed Taymor said that one of the biggest problems with bringing the show to production was that they were using new technology that was difficult to work with. Many of the scenes involved choreography that was highly dependent on pre-programmed robotics. Any changes that involved the robotics required a 5 hour wait.
A 5 hour wait?
Normally directors and choreographers can attempt dozens of changes in a day of rehearsal to get a scene or dance number "just right." Imagine finding yourself in a situation where you can only attempt a new change once or twice a day.
The ability to confidently make changes at will is key to being able to innovate. Innovation means trying new things and keeping what works, throwing away what doesn't. If you can't make changes, then you can't innovate.
Consider the opposite of innovation. We've all been at a company that resists change or has calcified to the point where they are unable to make change. Nothing can get better if you can't change policies, procedures, or technology. Since entropy means things slowly get worse over time, an organization that is unable to change, by definition, is an organization that is spiraling towards doom.
I'm reminded of this recently due to the Heartbleed security issue. Like most system administrators, the Heartbleed bug meant I had to spend a lot of time upgrading the software and firmware of nearly every system in their organization. For many of us it meant discovering systems that hadn't been upgraded in so long that the implications were unknown. As sysadmins we wanted to protect ourselves against this security flaw, but we also had to face our own fear of change.
We need to create a world where we are able to change, or "change-able".
There are many factors that enable us to be "change-able". One factor is frequency: we can make change, one after the next, in rapid succession.
Software upgrades: Every 1-3 years there is a new Microsoft Windows operating system and upgrading requires much care and planning. Systems are wiped and reloaded because we are often re-inventing the world from scratch with each upgrade. On the other hand, software that is upgraded frequently requires less testing each time because the change is less of a "quantum leap". In addition, we get better at the process because we do it often. We automate the testing, the upgrade process itself, we design systems that are field-upgradable or hot-upgradable because we have to... otherwise these frequent upgrades would be impossible.
Procedures: Someone recently told me he doesn't document his process for doing something because it only happens once a year and by then the process has changed. Since he has to reinvent the procedure each time the best he can do is keep notes about how the process worked last time. Contrast this to a procedure that is done weekly or daily. You can probably document it well enough that, barring major changes, you can delegate the process to a more junior person.
Software releases: If you collaborate with developers who put out releases infrequently, each release contains thousands of changes. A bug could be in any of those changes. Continuous Delivery systems compile and test the software after every source code change. Any new bugs discovered are likely to be found in the very small change that was recently checked in.
Another factor in being "change-able" is the how difficult it is to make a change.
I've been at companies where making a DNS change required editing 5 different files on two different systems, manually running a series of tests and then saying a prayer. I've been at others where one typed a command to insert to delete the record, and the rest just happened for me.
When it is difficult to make a change, we make them less often. We are tempted to avoid any action that requires that kind of change. This has a domino effect that slows and delays other projects. Or, it means we make the decision to live with a bad situation rather than fix it. You settle for less.
When we make changes less frequently, we get worse at doing them. Therefore they become more risky to do. Because they are more risky, we do them even less. It becomes a downward spiral.
DevOps is, if anything, about making operations more "change-able". Everyone has their own definition of DevOps, but what they all have in common is that DevOps makes operations better able to change: change more frequently and change more easily. The result is confidence in our ability to make changes. In that way, confidence is a precondition to being able to innovate.
Which brings us back to Spider-Man Turn Off the Dark. How much innovation could really happen if each change took 5 hours? Imagine paying a hundred dancers, actors, and technicians to do nothing for 5 hours waiting for the next iteration. You can't send them home. You can't tell them "come in for a few minutes every 5 hours". You would, instead, avoid change and settle for less. You would settle for what you have instead of fixing the problems.
Would DevOps have saved Spiderman? Would a more change-able world make me less fearful of the next Heartbleed
It's a great way to get the team--and not just ops--offsite to experience a healthier way of dealing and learning from failure.
If you are in the NYC-area, this is a great opportunity to learn how to make postmortems an integrated part of how to improve reliability and prevent future outages.
When we wrote our "how to do postmortems" section of the upcoming The Practice of Cloud System Administration, we asked Dave for advice because we respect his expertise. Now you can get a full day of training directly from Yulia and Dave!
You've probably seen experiments where a mouse gets cheese as a reward for pulling a lever. If he or she receives the cheese right away, the brain associates work (pulling the lever) with reward (the cheese) and it motivates the mouse. They want to do more work. It improves job satisfaction.
If the mouse received the cheese a month later, the brain won't associate the work with the reward. A year later? Fuggedaboutit!
Now imagine you are a software developer, operations engineer, or system administrator working on a software project. The software is released every 6 months. The hard work you do gets a reward every 6 months. Your brain isn't going to associate the two.
Now imagine monthly or weekly releases. The interval between work and reward is improved. The association is stronger. Motivation and job satisfaction goes up.
Now imagine using a continuous build/test system. You see the results of your work in the form of "test: pass" or "test: fail". Instant gratification.
Now imagine using a continuous deploy system. Every change results in a battery of tests which, if they pass, results in the software being launched into production. The interval is reduced to hours, possibly minutes.
Someone recently asked me if it was reasonable to expect their RelEng person also be responsible for the load balancing infrastructure and the locally-run virtualization system they have.
Sure! Why not! Why not have them also be the product manager, CEO, and the company cafeteria's chief cook?
There's something called "division of labor" and you have to draw the line somewhere. Personally I find that line usually gets drawn around skill-set.
Sarcasm aside, without knowing the person or the company, I'd have to say no. RelEng and Infrastructure Eng are two different roles.
Here's my longer answer.
A release engineer is concerned with building a "release". A release is the end result of source code, compiled and put into packages, and tested. Many packages are built. Some fail tests and do not become "release candidates". Of the candidates, some are "approved" for production.
Sub-question A: Should RelEng include pushing into production?
In some environments the RelEng pushes the approved packages into production. In other environments that's the sysadmin's job. Both can work, but IMHO sysadmins should build the production environment because they have the right expertise. Depending on the company size and shape, I can be convinced either way but in general I think RelEng shouldn't have that responsibility. On the other hand, if you have Continuous Deployment set up, then the RelEng person should absolutely be involved or own that aspect of the process.
Sub-question B: Should RelEng build the production infrastructure?
RelEng people are now expected to build AWS and Docker images, and therefore are struggling to learn things that sysadmins used to have a monopoly on. However you still need sysadmins to create the infrastructure under Docker or whatever virtual environment you are using.
Longer version: Traditionally sysadmins build the infrastructure that the service runs on. They know all the magic related to storage SANs, Cisco switches, firewalls, RAM/CPU specs for machines, OS configuration and so on. However this is changing. All of those things are now virtual: storage is virtual (SANs), machines are virtual (VMs), and now networks are too (SDN). So, you can now describe the infrastructure in code. The puppet/cfengine/whatever configs are versioned just like all other software. Thus, should they be the domain of RelEng or sysadmins?
I think it is pretty reasonable to expect RelEng people to be responsible for building Docker images (possibly with some help from sysadmins) and AWS images (possibly with a lot of help from sysadmins).
But what about the infrastructure under Docker/VMware/etc? It should also be "described in code" and therefore be kept under revision control, driven by Jenkins/TeamCity/whatever, and so on. I think some RelEng people can do that job, but it is a lot of work and highly specialized therefore the need for a "division of labor" outweighs whether or not a RelEng person has those skills. In general I'd have separate people doing that kind of work.
What do we do at StackExchange? Well, our build and test process is totally automated. Our process for pushing new releases into production is totally automated too, but requires a human to trigger it (possibly something we'll eliminate some day). So, the only RelEng we need a person for is to maintain the system and add occasional new features. Therefore, that role is done by Devs but the SREs can back-fill. The infrastructure itself is designed and run by SREs. So, basically the division of labor described above.
Obviously "your milage may vary". If you are entirely running out of AWS or GCE you might not have any infrastructure of your own.
People that complain that the enterprise world doesn't get DevOps but don't participate in enterprise conferences.
Lack of a "sound bite" definition of DevOps; leads to confusion. I was recently told "DevOps means developers have to carry pagers... that's why our developers don't want anything to do with it." If that's the definition that is getting out, we're in trouble.
Engineers thinking that "if something is good, it doesn't need marketing". Tell that to the many inventions of Nikola Tesla that never got turned into products. The "build a better mouse trap and people will beat a path to your door" myth was debunked years ago.
Teams working through The Three Ways need an unbiased way to judge their progress. That is,
how do you know "Are we there yet?"
Like any journey there are milestones. I call these "look-for's". As in, these are the things to "look for" to help you determine how a group is proceeding on their journey.
Since there are 3 "ways" one would expect there to be 4 milestones, the "starting off point"
plus a milestone marking the completion of each "Way". I add an additional milestone
part way through The First Way. There is an obvious sequence point in the middle
of The First Way where a team goes from total chaos to managed chaos.
There is a devops-related talk in every hour of this year's Usenix LISA conference. Usenix LISA Is a general conference with many tracks going on at any time. A little analysis finds there is always at least one DevOps related talk (usually more than one). This is very impressive. The problem, however, is that many of the talk titles don't make this clear. No worries, I've done the research for you.
As a system administrator you hate to see it happen:
A user has a problem. They don't report it to you (enter a bug report, file a ticket). They whine to their friends, or suffer in silence. Months later you find out and ask, "Why didn't you file a ticket? I could have fixed it!" They either didn't have time, didn't feel it would do any good, or whatever.
What's 100 times more annoying? When sysadmins do it to each other.
I've seen it many times. Walking through a process (say... setting up a new machine) and some of the steps require... umm... "interesting work-arounds". I ask "is there a bug filed about that?" and am told, "No, they know it's a problem".
Oh do they?
Things can't be fixed unless someone knows it is broken. Assuming that someone knows about a problem is assuming that everyone else has ESP. Don't expect your coworkers to be mind-readers.
In a recent case the person told me "they know about it" but it was a task that he was responsible for doing. The others wouldn't possibly know about this problem unless he went on vacation and the task happened to be needed. That wasn't likely.
When we use a service we often know the "client" end of things much better than the people responsible for the service itself. Don't assume they use the service they provide as much as you do!
I would rather have a duplicate bug filed than no bug filed at all.
These problems, inconveniences, inefficiencies, "issues", "could-be-betters" and "should-be-betters" need to be recorded somewhere so they can be prioritized, sorted, and worked on. Whether the list is kept in a bugtracking system, helpdesk "ticket" system, wiki page, or spreadsheet: it has to be anywhere other than your brain.
In Gene Kim's "Three Ways" the 2nd way is the "Feedback" way. Amplify issues, don't hide them! If there is a process problem you are working around, file a bug report. If there is a crash make sure the programmer gets woken up at 3am so he or she feels the pain too. If a process has a "long version" that is only needed "occasionally" then publish how many times each month the "long path" is needed so that people are aware just how occasional "occasionally" is.
Speaking of The Three Ways, Tim Hunter has re-interpreted them in a much more comprehensible way in this blog post on IT Revolution. I highly recommend reading it.
I'll be the speaker at LOPSA NYC's meeting in January. This will be a repeat of the "[email protected]" talk that I gave at LISA 2011. It was very well-received. If you missed it at LISA, this may be your last chance to see it live.
Topic: Mark Burgess presents DevOps and The Future of Configuration Management
Mark Burgess is the founder, chairman, CTO and principal author of Cfengine. He is Professor of Network and System Administration at Oslo University College and has led the way in theory and practice of automation and policy based management for 20 years. In the 1990s he underlined the importance of idempotent, autonomous desired state management ("convergence") and formalised cooperative systems in the 2000s ("promise theory"). He is the author of numerous books and papers on Network and System Administration and has won several prizes for his work.