Recently in Checklists Category

My Pre-LISA checklist

  • Get haircut
  • Print out 2-factor "rescue codes" in case my 2-factor fob is lost of dies.
  • De-junk my wallet.
  • Practice slides for the Ganeti tutorial, Time Management tutorials.
  • Reach out to co-workers about coverage while I'm away.
  • Verify flights and hotel information.
  • Pack

What's on your pre-LISA checklist? Please post in the comments. I'd like to know!

See you in San Diego!
Tom

Posted by Tom Limoncelli in ChecklistsConferences

In my speaking and writing I always encourage people to automate what they can and document what you can't. If something can't be automated (or isn't worth automating) writing a bullet list of the steps to accomplish the task makes the task less error-prone and easier for others on the team to do it.

I get replies like, "What if I automate myself out of a job?" or "but if I document what I do, anyone can do it and I won't be needed!"

Sysadmin, please! Neither could be further from the truth.

First of all, there's always more IT work to be done. Automating your job is more likely to get you new assignment. Being the person that automated something makes you a hot property and other projects will come to you.

Secondly, if I have 10 people working for me and I'm told that due to downsizing I have to pick 1 person to let go, I'm not going to pick the person that writes good documentation and automation. The (thankfully few) times I've had to lay off people I did anything I could to save the people that are "force multipliers": they create tools that let team members do more with less. A wiki full of accurate, up-to-date, "how to" bullet lists is just as much of a "tool" as a script that automates a task. If you don't know how to program, learn. If you don't want to learn, learn to write "how to" bullet lists.

The C programming language let one programmer could be as productive as 100 assembly-language programmers. Did this mean that 99% of all programmers were fired? No, it meant that more programmers were hired than ever before: the profession was opened to more people and the pent up demand for software was released.

There's another way to think about it: This is totally different than the auto industry where the introduction of robots resulted in layoffs. In IT, you are the robot!

P.S. I count 21 different tutorials at Usenix LISA that teach how to automate and 2 tutorials related to documentation. That's just the tutorials... there are 4 other tracks in the program!

Usenix LISA 2011 is Dec 4-9 in Boston. You can register any time, but you get a discount if you register by Nov 14. I look forward to seeing you there!

Posted by Tom Limoncelli in Checklists

Since I can not attend the LISA Workshop on Teaching System Administration (I'll be teaching system administration that day!), I'd like to take a moment to say something to the attendees.

Often we are in the thick of things and we lose sight of how valuable our work is.

What you are doing is incredibly important; maybe more important that you realize. IT isn't just important, it is scary-important. The usual old sayings about how important IT is are now obsolete. It isn't that IT is a part of how food gets from the farm to our plate, we, as a society, no longer know how to provide food without IT. Medicine isn't just billed and administered with the assistance of IT, we can't provide medical services without IT anymore. Sysadmins are not just "important", the existence of excellence in system administration is key to sustaining civilization as we know it.

Those teaching system administrators need to step up to the plate. Our world depends on you.

It is time for an organization to take a leadership role in defining a standard sysadmin curriculum and get it adopted at all 4-year and 2-year schools. The 2-year training is embarrassingly bad. The 4-year training is bad to mediocre.[1]

Students are graduating 4-year programs without understanding the internals of systems, nor how they are used en masse in the real world. This would be like auto mechanics not being taught how an internal combustion engine works or doctors some how graduating medical school without knowing that patients are alive between office visits.

10% of us know the right way to do things. The other 90% don't. Why the un-even distribution of knowledge? The trouble this brings is far reaching. Sarbanes-Oxley essentially says, "If you are going to be so unbelievably stupid as to do backups without testing them, create accounts without having a mechanism to make sure they are disabled when the employee leaves, and letting developers have unrestricted raw access to live databases; then we're going to legislate how you have to do your job." HIPAA essentially says that our industry has proven itself too incompetent to be trusted with securing databases or WiFi networks in hospitals. Therefore how to do our jobs is being written into legislation.[2]

What's next? What will be the next example of rampant incompetence that leads to more legislation that tells us how we have to do our jobs? What crap caused by the worst of us will ruin it for the rest of us? What other obvious best practice that sites somehow still successfully ignore will become required by law? "have a helpdesks that don't suck"? "Track your customer requests with a 'ticket' system"? "buy load balancers in pairs"? "ping a machine after you've unplugged it to make sure you unplugged the right one"? "lock our screens when you leave your desk"? Many of these were "rocket science" 10 years ago. Now it's just embarrassing to see IT teams that are blind to these ideas.

This is a problem that is bigger than any one person can solve. You and I know this. We've written books to try to educate, but how much can one person do? These are the greatest challenge to our industry has ever faced. This is the kind of thing that requires group effort.

Creating such curriculum would take a long time, and getting it widely adopted even longer. However, with the power of Usenix, the expertise of LOPSA, and the academic ubiquity of ACM, this could really happen.

I hope that the members of the workshop take the time to think big.

Things don't get better on their own.

Sincerely, Tom Limoncelli

[1] These are based on indirect experience. The truth is that we don't have a measure for how to quantify if a school is doing a good job. First we need a standard to measure institutions by, then we need to go around measuring institutions. Providing a self-evaluation kit would even be a major step forward.

[2] One might say that it is the executive management of hospitals that is to blame. I disagree. We are at fault for not being able to explain the issue in a way that gets executive attention. Worse, often we are at companies that are selling systems with known problems. Why do we even offer a known-bad solution? Is it our own ignorance or is it like the consultant I once saw explain to a customer 3 options, one he pointed out that he recommends against. Of course the customer wanted the one he was recommending against. Why did he even mention that option? It wasn't an option. The customer wouldn't have thought of it on their own. It was a counter-example that you turned into an option. Knucklehead!

[ This is the kind of topic I'll be covering in detail at my training class at Usenix LISA 2010: Time Management: Team Efficiency.]

I saw this question in email the other day...

In a small team managing [what I think to be] a fairly diverse environment like ours how do you handle [cross-]training/redundancy? I know there are a large number of things right now that no one other than me knows how to do. I think I have the everyday issues well documented, but non-routine issues may cause issues.

First, let's define "cross training". A department with many sub-teams wants everyone to be able to handle tasks from the other sub-teams. For example, you have an IT department with three sub-teams: a Linux sub-team, a Networking sub-team, and a Storage sub-team. In an ideal world, the Storage sub-team members should be able to handle 80% of the requests of the Linux sub-team, and vice versa. Being able to handle 80% of the Storage-related requests probably means knowing about 20% of what someone on the Storage sub-team knows. That's ok. 80% of the requests are probably things like add/delete/change requests (add a new virtual partition, increase the size of an existing one, etc.), and common problems (what to do with a NFS stale file handle, etc.). If everyone in the department could handle those tasks, the individual teams could focus on higher-order issues like scaling, monitoring, and optimization.

The #1 thing you need to be able to do is document those "sharable tasks". But everyone hates writing documentation, right? So don't write documentation: write checklists. You only have to document the steps, and you can use language that assumes the person has basic knowledge. If you keep the checklists on a system that permits anyone to edit the documents (i.e. a wiki or a source code repository), then they can fine-tune your docs as they use them.

Some standard checklists to write are: 1. things we do for each newhire. 2. things we do when an employee is terminated. 3. how to: allocate space in the machine room, set up a new server, deploy a workstation, add to the puppet configuration, etc. etc.

If you do cross-training right, rather than a pager rotation for each sub-team, you can have one globally pager rotation. That means rather than being oncall once every 3 weeks, you might be on call once every 12 weeks!

To set up cross-training for a pager rotation, I also suggest checklists.

For each page that you might receive, have a checklist of what actions to take. Check list, try rebooting that, look at the logs for messages that say such-and-such. The last step should always be "if all that failed to fix the problem, escalate to so-and-so." If so-and-so feels he/she is getting called in the middle of the night too much, ask them to improve the checklist.

Encourage people to write the checklist when they add the alert rule to the monitoring system. If someone won't or doesn't write the checklist for a particular alert it just means they have agreed to be called in the middle of the night every time.

These checklists will grow and improve over time. Every time you have an outage, augment the checklists that would prevent that problem in the future.

In my tutorial at Usenix LISA, I'll expand on this and show how this system can be used to coordinate training in general, especially in ways that help bring newhires up to speed. This is material I've never written about or included in any other tutorial, and you can only see it at Usenix LISA in San Jose, Nov 7-12, 2010. Seating is limited. Register today!

Posted by Tom Limoncelli in ChecklistsConferences

At the risk of being a total fan-boy for Atul Gawande's 'The Checklist Manifesto: How to Get Things Right' (book and ebook), I want to point readers to this extract published in The Financial Times.

It covers a VC that uses checklists to get better results when selecting investments, and a dramatic description of the checklist use during the US Airways flight 1549 flight where Captain Chesley B. "Sully" Sullenberger III did an emergency landing in the Hudson river.

Three favorite quotes:

Posted by Tom Limoncelli in Checklists

 
LISA14 I'm Teaching button