Awesome Conferences

January 2015 Archives

The Queen of Code is a 17-minute documentary about Grace Hopper. It just came out today and I assure you that if you watch it, you'll be glad you did.

On a personal note, Grace Hopper was going to be the graduation speaker when I graduated from Drew University in 1991 but she was ill. She passed away about 6 months later. I wish I could have met her.

Posted by Tom Limoncelli

Dear Tom,

I've been asked to document our company's System Integration process. Do you have any advice?

A reader.

Dear Reader,

I get this question a lot whether it is system integration, setting up new computers, handling customer support calls, or just about anything. Documenting a process is an important first step to clarifying what the process is. It is a prerequisite to improving it, automating it, or both.

My general advice is to find the process that exists and document exactly how it is done now. Only after that can you evaluate what steps work well and which need to be improved. Don't try to invent a new system to replace the existing chaos. That chaos works (for some definition of "works") and embodies a lot of knowledge about all the little things that have to happen, including a lot of "realities" that may be invisible to managers. This is similar to why it is bad to try to rewrite software systems from scratch.

Creating the document involves interviewing the people that do the tasks, taking notes, and building up a big document. If the process has branches and options, draw a diagram. Meet with people one at a time or in small groups and ask them to explain what they think the process is. Ask clarifying questions. Don't ask them what the process should be, ask what they currently do. If they start talking about improvements write down what they say (so they feel listened to) but then get them back on the subject of what the process is, not what it should be.

If possible, you'll want to get to the point where you can do the process yourself, by following the document you wrote. The next step is to hand someone else the document and see if they can get through it without your help. If each step is done by a different team or department, you may need to get everyone in the same room and walk through the steps together.

When documenting the process (either by interviewing people or by working through the process solo), you'll find plenty of "issues":

  • Steps that are done differently depending on who does it. That needs to be reconciled. Get both people in the same room and help them work it out. Or, document both routes so that management is aware.
  • Steps that are undefined. If nobody can explain what happens at a certain step but the work is getting done somehow, it is better to document that the step has to be researched than leaving it out of your document.
  • Steps that are ill-defined. There may be steps that, for various reasons, one has to figure out in an ad hoc manner each time. If this is a 1-in-a-million edge case, that's ok. If it is in the main path, actual steps need to be clarified. A good start is to define the end-goal and come back later to work out how it actually gets done.

Each of these "problem steps" should be marked in the document as an "area for improvement" or "TODO". A good process engineer will, over time, eliminate these TODOs. It will impress your management to track how many TODOs are remaining. If, for example, every Monday you add a line to a spreadsheet with the current count, eventually you can produce a graph that shows progress. It is also more professional to say "there are 40 remaining TODOs" than "OMG this project is f---ed!". Having the graph makes this more data-driven: it gives visibility to management about the actual amount of chaos in the project. They might not be technical, but they'll understand that 500 is worse than 100, that "progress" looks like decreasing numbers.

In DevOps terminology this is called getting the "flow" right. The First Way of DevOps is about flow. First you need to get the process to be repeatable (i.e. no more "TODO"s). Then you can focus on making the process better: eliminating duplicate work, replacing steps that are problematic, finding and fixing bottlenecks.

The Second Way of DevOps is about the communication between the people involved in the steps. If each step (or groups of steps) is done by a different person or group of people, do they have a way to give feedback to each other? Do they attend each other's status meetings? If one team does something that causes problems for another, does that team muddle through it and suffer, or do they have a channel to raise the issue and get it fixed?

Here are some recommended reading:

Once the process is documented (defined), you'll want to improve it. Some general ways to do this are:

  • Tracking. If many sub-teams are involved, having a way to track which step is active and how things are handed off becomes critical. People need visibility to entire system so they know what is coming to them, and who is waiting for them.
  • Identify and Fix Bottlenecks. Every system has a bottleneck. Chapter 12 Section 12.4.3 of The Practice of Cloud System Administration discusses this more.
  • Improve steps. Are there steps that are unreliable? The cause of most failures? Fix the biggest problems.
  • Automation. Automation generally reduces variation, improves speed, and saves labor. More important than saving labor, it makes it possible for people to be doing some other work, thus multiplies the labor force.

The Practice of Cloud System Administration has lots of advice about all of these next steps.

Posted by Tom Limoncelli

SAGE-AU is doing some great media work, making the case that the proposed data-retention law in Australia would create a nightmare for businesses that use computers. They point out that every ill-defined or vague point in the law creates more and more problems.

"It's very immature legislation proposal. It's more holes than cheese. There's more questions around it than there are answers," he said.

Read the full article here: Australian sysadmins cop brunt of data-retention burden

Every country should have an organize that speaks for the IT workers. Go SAGE-AU!

Posted by Tom Limoncelli

Dear readers: I need your help. I feel like I've lost touch with what new sysadmins go through. I learned system administration 20+ years ago. I can't imagine what new sysadmins go through now.

In particular, I'd like to hear from new sysadmins about what their "rite of passage" was that made them feel like a "real sysadmin".

When I was first learning system administration, there was a rite of passage called "setting up an email server". Everyone did it.

This was an important project because it touches on so many different aspects of system administration: DNS, SMTP, Sendmail configuration, POP3/IMAP4, setting up a DNS server, debugging clients, and so on and so on. A project like this might take weeks or months depending on what learning resources you have, if you have a mentor, and how many features you want to enable and experiment with.

Nowadays it is easier to do that: Binary packages and better defaults have eliminated most of the complexity. Starter documentation is plentiful, free, and accessible on the web. DNS domain registrars host the zone too, and make updates easy. Email addressing has become banal, mostly thanks to uniformity (and the end of UUCP).

More "nails in the coffin" for this rite of passage include the fact that ISPs now provide email service (this didn't used to be true), hosted email services like Google Apps have more features than most open source products, and ...oh yeah... email is passe.

What is the modern rite of passage for sysadmins? I want to know.

If you became a sysadmin in the last 10 years: What project or "rite of passage" made you feel like you had gone from "beginner" to being "a real sysadmin!"

Please tell me here.

Posted by Tom Limoncelli in Education

Someone asked me in email for advice about how to move many machines to a new corporate standard. I haven't dealt with desktop/laptop PC administration ("fleet management") in a while, but I explained this experience and thought I'd share it on my blog:

I favor using "the carrot" over "the stick". The carrot is making the new environment better for the users so they want to adopt it, rather than using management fiat or threats to motivate people. Each has its place.

The more people feel involved in the project the more likely they are to go along with it. If you start by involving typical users by letting them try out the new configuration in a test lab or even loaning them a machine for a week, they'll feel like they are being listened to and will be your partner instead of a roadblock.

Once I was in a situation where we had to convert many PCs to a corporate standard.

First we made one single standard PC. We let people try it out and find problems. We resolved or found workarounds to any problems or concerns raised.

At that point we had a rule: all new PCs would be built using the standard config. No regressions. The number of standard PCs should only increase. If we did that and nothing more, eventually everything would be converted as PCs only last 3 years.

That said, preventing any back-sliding (people installing PCs with the old configuration by mistake, out of habit, or wanting an "exception") was a big effort. The IT staff had to be vigilant. "No regressions!" was our battlecry. Management had to have a backbone. People on the team had to police ourselves and our users.

We knew waiting for the conversion to happen over 3 years was much too slow. However before we could accelerate the process, we had to get those basics correct.

The next step was to convert the PCs of people that were willing and eager. The configuration was better, so some people were eager to convert. Updates happened automatically. They got a lot of useful software pre-installed. We were very public about how the helpdesk was able to support people with the new configuration better and faster than the old configuration.

Did some people resist? Yes. However there were enough willing and eager people to keep us busy. We let those "late adopters" have their way. Though, we'd mentally prepare them for the eventual upgrade by saying things like (with a cheerful voice), "Oh, we're a late adopter! No worries. We'll see you in a few months." By calling them "late adopter" instead of "resistor" or "hard cases" it mentally reframed the issue as them being "eventual" not "never".

Some of our "late adopters" volunteered to convert on their own. They got a new machine and didn't have a choice. Or, they saw that other people were happy with the new configuration and didn't want to be left behind. Nobody wants to be the only kid on the block without the new toy that all the cool kids have.

(Oh, did I mention the system for installing PCs the old way is broken and we can't fix it? Yeah, kind of like how parents tell little kids the "Frozen" disc isn't working and we'll have to try again tomorrow.)

Eventually those conversions were done and we had the time and energy to work on the long tail of "late adopters". Some of these people had verified technical issues such as software that didn't work on the new system. Each of these could be many hours or days helping the user make the software work or finding replacement products. In some cases, we'd extract the user's disk into a Virtual Machine (p2v) so that it could run in the old environment.

However eventually we had to get rid of the last few hold-outs. The support cost of the old machine was $x and if there are 100 remaining machines, $x/100 isn't a lot of money. When there are 50 remaining machines the cost is $x/50. Eventually the cost is $x/1 and that makes that last machine very very expensive. The faster we can get to zero, the better.

We announced that unconverted machines would be unsupported after date X, and would stop working (the file servers wouldn't talk to them) by date Y. We had to get management support on X and Y, and a commitment to not make any exceptions. We communicated the dates broadly at first, then eventually only the specific people affected (and their manager) received the warnings. Some people figured out that they could convince (trick?) their manager into buying them a new PC as part of all this... we didn't care as long as we got rid of the old configuration. (If I was doing this today, I'd use 802.11x to kick old machines off the network after date Z.)

One excuse we could not tolerate was "I'll just support it myself". The old configuration didn't automatically receive security patches and "self-supported machines" were security problems waiting to happen. The virtual machines were enough of a risk.

Speaking of which... the company had a loose policy about people taking home equipment that was discarded. A lot of kids got new (old) PCs. We were sure to wipe the disks and be clear that the helpdesk would not assist them with the machine once disposed. (In hindsight, we should have put a sticker on the machine saying that.)

Conversion projects like this pop up all the time. Sometimes it is due to a smaller company being bought by a larger company, a division that didn't use centralized IT services adopting them, or moving from an older OS to a newer OS.

If you are tasked with a similar conversion project you'll find you need to adjust the techniques you use depending on many factors. Doing this for 10 machines, 500 machines, or 10,000 machines all require adjusting the techniques for the situation.

If you manage server farms instead of desktop/laptop PC fleets similar techniques work.

Posted by Tom Limoncelli in Technical Tips

Tom will be the speaker at the Wed, January 14, 2015 meeting of BBLISA, which meets in Cambridge, MA. I'll be talking about our new book, The Practice of Cloud System Administration.

For more info:

Are you a software developer that is facing rapidly changing markets, technologies and platforms? This new conference is for you.

ACM's new Applicative conference, Feb. 25-27, 2015 in Midtown Manhattan, is for software developers who work in rapidly changing environments. Technical tracks will focus on emerging technologies in system-level programming and application development.

The list of speakers is very impressive. I'd also recommend sysadmins attend as a way to stay in touch with the hot technologies that your developers will be using (and demanding) soon.

Early bird rates through Jan. 28 at

Posted by Tom Limoncelli in Conferences

Hi Boston-area friends! I'll be giving my "Radical ideas from The Practice of Cloud System Administration" talk at the Back Bay LISA user group meeting on Wednesday, January 14, 2015. Visit for more info.

Short version: My mailing list server no longer generates bounce messages for unknown accounts, thus eliminating the email backscatter is generates.

Longer version:

I have a host set up exclusively for running mailing lists using Mailman and battling spam has been quite a burden. I finally 'gave up' and made all the lists "member's only". Luckily that is possible with the email lists being run there. If I had any open mailing lists, I wouldn't have been so lucky. The result of this change was that it eliminated all spam and I was able to disable SpamAssassin and other measures put in place. SpamAssassin has been using more and more CPU time and was letting more and more spam through.

That was a few years ago.

However then the problem became Spam Backscatter. Spammers were sending to nearly every possible username in hopes of getting through. Each of these attempts resulted in a bounce message being sent to the (forged) email address the attempt claimed to come from. It got to the point where 99% of the email traffic on the machine were these bounces. The host was occasionally being blocked as punishment for generating so many bounces. Zero of these bounces were "real"... i.e. the bounce was going to an address that didn't actually send the original message and didn't care about the contents of the bounce message.

These unwanted bounce messages are called "Spam Backscatter".

My outgoing mail queue was literally filled with these bounce messages, being re-tried for weeks until Postfix would give up. I changed Postfix to delete them after a shorter amount of time, but the queue was still getting huge.

This weekend I updated the system's configuration so that it just plain doesn't generate bounces to unknown addresses on the machine. While this is something you absolutely shouldn't do for a general purpose email server (people mistyping the addresses of your users would get very confused) doing this on a highly specialized machine makes sense.

I can now proudly say that for the last 48 hours the configuration has worked well. The machine is no longer a source of backscatter pollution on the internet. The mail queue is empty. It's a shame my other mail servers can't benefit from this technique.

Posted by Tom Limoncelli in System NewsTechnical Tips