May 2009 Archives

My high school had a radio station.  Volunteering there taught me lot about planning, timing, music, and electronics.  Plus, we had a CP/M machine that was used to store our inventory of albums and that gave me an excuse to spend hours with a real live computer.

One of the pieces of equipment they had was a machine that recorded the station any time the microphone was on.  Firstly, this kept students from saying "bad words" on the air, as it provided evidence.  More importantly it was a training device.  After a show we could listen to the tape to understand what we sounded like and make improvements.

My first tape sounded something like this:
  • That was a really great song.  Now here's ____ with ____.  It's a really great song.
  • click
  • That was a really great song.  Now here's ____ with ____.  It's a really great song.
  • click
  • That was a really great song.  Now here's ____ with ____.  It's a really great song.
  • click
  • That was a really great song.  Now here's ____ with ____.  It's a really great song.
  • click
You see, dear readers, it seems that I felt it was really important for you to know that the music I was playing was, and I clearly meant this in a heart-felt way, "great."

It was laughable how how repetitive I was. My adviser explained two things:
  1. We only play great music on this radio station. Therefore, you don't have to tell the audience it is a great song.  In fact, one would say that the fact that we played it means it is a great song. We're a great station; we define greatness.
  2. Every new DJ makes the same mistake.
These points were learned and relearned by every student that joined the DJ staff.

When introducing a song you don't need to say something is great. It is more powerful to talk about the song's qualities and let the listener realize that it is great. For example one might say that it is "their newest release", that it was "requested by a caller", or that "I've been waiting to play this all week".  All of those things say volumes more about the song than it is "great".

This holds true for anything we introduce: Introducing a friend to another, introducing information we're about to give to co-workers (formally or informally), introducing new software to users [Ever see an IT person spend 15 minutes telling users the new software is fantastic, wonderful, great, amazing, and awesome but forgetting to say what the software does?  I have!], and especially when introducing speakers at a conference.

I was reminded of this concept because yesterday I saw someone making this same mistake. At a day-long mini-conference the chair introduced every single speaker as "awesome" or "incredibly awesome."  That's how the audience was introduced to the person that came to say a few words as a representative of the event's co-sponsor.  That's how the audience was introduced to the world-famous, award-winning, well-published, keynote speaker who had traveled 200 miles to be there.  The audience did know that the keynote speaker was particularly important because her introduction included the word "awesome" at least six times. The representative of the co-sponsor was only called "awesome" once.

It was painful to watch these introductions.  I wanted to grab the microphone and offer to do the introductions myself.  I would have stated 2-3 biographical details from their bio (which were written in the program) and let those points speak for themselves.

And my introductions would have been... well... awesome.


Posted by Tom Limoncelli in Personal Growth

On a mailing list recently someone asked, "Does anyone have any recommendations for useful metrics to measure the performance of the systems team? (i.e. not the systems themselves)"

Fundamentally you want to define an SLA and then demonstrate that you are meeting it (or how close you are to meeting it, with improvement over time).  The problem is how do you define an SLA?  Here are some example metrics:

  1. 90% of all tickets will be closed in 3 days (measure the number of tickets that are older than 3 days)
  2. VPN and remote access services up 99.99% of the time (measure uptime outside of scheduled maintenance windows)
  3. New users have accounts/machines/etc. within n days of their start (preferably n=-1)
  4. IMAP latency below n microseconds (measure how long it takes to do a simulated login, read of 100 messages, and log out)
I prefer measuring things that can be measured automatically.  All of the above can be.  Asking humans to take manual measurements is a burden and error prone.

I recently started a new assignment where I was supposed to write down the number of open tickets at the beginning and end of the day, and keep count of how many tickets I had completed.  Oh brother.  As you can imagine, I failed.  There wasn't a single day that I remembered to collect all three data points.  Eventually I found a script that extracts this data from our ticket system.

Some things that can't be automatically measured:

  • Customer happiness.  Yes, you can send out surveys but I don't think that's accurate.  People don't respond to surveys unless they are dissatisfied with you or compulsive survey-takers.  It is better to give people a way to tell a manager that they were unhappy so that the team can be "educated".  The problem becomes, how do I ask for that kind of feedback from our users?  Sometimes it helps to disguise that in the form of a survey.  A single-question survey ("On a rank of 1 to 5, how did we do?") followed by a big, big, optional comment box.   The rank data you collect might be useful if your boss likes pretty graphs (especially if you graph over long periods of time).  The real value will be in the comments you get.  Listen to the comments you get and make sure the person that made the comment gets a personal phone call or visit not to defend or explain, but to ask for their suggestions on how we could do better.  Angry customers want to be listened to more than anything else.  In fact, they want to be listened to more so than they want the problem fixed.  (Oh, you'll get compliments too.  Print them out and put them on the wall for everyone to see!)
  • "Time to Return to Service" i.e. when there is an outage (dead disk, dead router, etc.) how long before you were able to return the service to an operational state.  Don't measure this.  Measuring that distracts engineers from building systems that prevent outages (RAID, redundant routers, and so on).  If you instead measure uptime you are driving good behavior without micromanaging.  If I was measured on my "return to service" times, I'd stop building systems with RAID or redundant routers so that I can have a lot of outages and tons of data to show how good I am at swapping in new hardware.  That disk that you paid for shouldn't be sitting in a box next to the computer, it should be part of a RAID system that automatically recovers when there is an outage.

My last recommenation is controversial.  You should penalize people that beat their SLA too well.  If the SLA says there will be 99.9% uptime, and I provide 99.999% uptime then I am probably doing one of two bad things:  Either I'm paying for redundancy that is wasteful or I'm avoiding important system upgrades and therefore impeding innovation.   If I am hovering around 99.9% by +/- 0.1% then I've demonstrated that I can balance uptime with budget and innovation.  If management complains about outages but I'm still at 99.9%, then they need to change the SLA and be willing to fund the resources to achieve it, or accept the intangible costs of a slower rate of upgrades.  They may back down or they may choose one of the other options.  That's fine.  If you think about it the essential role of management is to set goals and provide resources to meet those goals.  By working to hit (not exceed) your SLA you are creating an environment where they can perform their essential role whether they realize it or not.  Similarly, if they want to save money you can respond with scenarios that include fewer upgrades (higher risk of security problems, less productivity due to the opportunity cost of lacking new features) or by accepting a lower SLA due to an increase in outages.

Tom

Posted by Tom Limoncelli in Technical Management

You know that here at E.S. we're big fans of monitoring.  Today I saw on a mailing list a post by Erinn Looney-Triggs who wrote a module for Nagios that uses dmidecode to gather a Dell's serial number then uses their web API to determine if it is near the end of the warantee period.  I think that's an excellent way to prevent what can be a nasty surprise.

Link to the code is here: Nagios module for Dell systems warranty using dmidecode

What unique things do you monitor for on your systems?

Posted by Tom Limoncelli in Technical Tips

The term "Warehouse-Scale" Machines has been coined.  The term describes the specific design that sites like Google use.  The data centers that Google runs aren't like other data centers where each rack has a mish-mosh of machines that result as various people request and fill rack space.  It's more like a single huge machine running many processes.  A machine has memory, CPUs, and storage and buses that connect them all.  A warehouse-scale machine has thousands of machines all with a few, specific, configurations.  You treat the machines as CPUs and/or storage; the network is the bus that connects them all.

There is a new on-line book (108 pages!) by the people at Google that are in charge of the Google data center operations (disclaimer: Urz is my boss's boss's boss's boss's boss)

The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines
by Luiz André Barroso and Urs Hölzle, Google Inc.

Abstract

As computation continues to move into the cloud, the computing platform of interest no longer resembles a pizza box or a refrigerator, but a warehouse full of computers. These new large datacenters are quite different from traditional hosting facilities of earlier times and cannot be viewed simply as a collection of co-located servers. Large portions of the hardware and software resources in these facilities must work in concert to efficiently deliver good levels of Internet service performance, something that can only be achieved by a holistic approach to their design and deployment. In other words, we must treat the datacenter itself as one massive warehouse-scale computer (WSC). We describe the architecture of WSCs, the main factors influencing their design, operation, and cost structure, and the characteristics of their software base. We hope it will be useful to architects and programmers of today's WSCs, as well as those of future many-core platforms which may one day implement the equivalent of today's WSCs on a single board.


http://www.morganclaypool.com/toc/cac/4/1


Dear fellow sysadmins,

The surest sign that sysadmins are mis-understood is how difficult it is to install, debug, or maintain various products.  Any sysadmin can tell if the installation process was designed as an afterthought. Any sysadmin can point to a variety of... I'll be polite and say... "design decisions" that make a product completely and utterly impossible to debug.

I've talked with product managers about why their product is the speedbump that slows me down when debugging a problem that is buried in a network of 150 devices from 15 different companies.  In the old days I was told, "that's why you should buy everything from one vendor... us!" and in today's multi-platform arena I'm told, "but our goal is to make our product so easy to use it you don't need to debug it."

I'm sure that last sentence made you cringe.   You get it.

I've explained how GUIs are bad when they prevent the basic principles of system administration: change management, automated auditing, backups, and unfettered debugging.  We have practices and methodologies we need to implement!  Don't get in our way!

The more enlightened product managers understand that the easier it is to automate the installation of their product, the easier it is for me to buy a lot of their product.  The more enlightened product managers understand that an ASCII configuration file can be checked in to SubVersion, audited by a Perl script, or even generated automagically from a Makefile.  Sadly, those product managers are rare.

One would think that companies would be investing millions of dollars in research to make sure their products are beloved by sysadmins.

I like to think that somewhere out there is a group of researchers studying this kind of thing. I imagine that they find sysadmins that volunteer to be videotaped as they do their job.  I imagine the researchers (or their graduate students) pouring over those tapes as they try to understand our strange ways. I imagine Dian Fossey studying not Gorillas in the Mist but Sysadmins at the Keyboard.

These researchers do exist.

I've seen them.

For the last two years they've met and exchanged ideas at a conference called CHMIT.

Some of them actually video tape sysadmins and examine what is it about products that make our lives good and !good.

My favorite moment was watching a researcher describing their observation of a sysadmin the heat of a real outage.  The sysadmin closed the firewall's GUI and connected to the command line interface in two different windows.  In one they kept repeating a command to output some debugging information. In the other they typed commands to fix the problems. This was something the GUI would never had let him do without risking carpel tunnel syndrome.  The researcher beamed as he explained the paradigm we were witnessing.  He sounded like he had been lucky enough to catch the Loch Ness Monster on film but what he had captured was something more valuable: photographic evidence of why sysadmins love command lines!

The person sitting next to me sighed and said, "Oh my god.  Is that why nobody uses the GUI we spend millions to develop?"
 
I love this conference.

These researchers study people like me and it makes the world a better place.

More than researchers attend.  Sysadmins make up a large part of the audience.
 
This year CHMIT 2009 will be in Baltimore, MD the days following LISA 2009 which by amazing coincidence is also in Baltimore, MD.

Will you be there?  I know I will.

Mark November 7-9, 2009 on your calendar.  Registration opens soon.  Papers can be submitted now. www.chimit09.org

Tom Limoncelli

Posted by Tom Limoncelli in Conferences

I'm happy to announce that Time Management for System Administrators (O'Reilly) is now available on Kindle (both Kindle 1, 2 and iPhone), and is being sold without any DRM.

It's a good time to read TM4SA: With the economic slow-down, most IT shops are being asked to "do more with less".  TM4SA is really a book about personal efficiency.  It is a self-help book for the overburdened geek.

Kindle makes it easy: No cables to wrangle.  No special lighting needed.  Read it on the train, in the park, or at the office.  Best of all, read it at your leisure.  TM4SA is the kind of book that you can read a bite at a time.  Short chapters make it perfect for reading "when you have a few minutes" while waiting for a system update to download and install.

Read the full announcement from O'Reilly.

[ Note: Both Time Management for System Administrators and The Practice of System and Network Administration (2nd Ed) are available as E-Book and can be read on-line on www.safaribooksonline.com or mobile-optimized m.safaribooksonline.com. ]

SANTA CLARA -- Sun Microsystems, a small "Silicon Valley" startup finally achieved it's goal of being acquired by a larger company this week. They have been purchased by Oracle.

Most startups plan their exit strategy as either being an IPO or being acquired. Sun's unique strategy was to do both with a 23-year gap in between. This long, painfully slow, strategy included years of selling their products to big name Wall Street firms, developing cutting edge operating systems and microprocessors, convincing the entire world that RISC is better than CISC, and blowing it all by ignoring the rise of cheap x86-based PCs. Sun is also reported to have invented "the dot", an enabling technology that precipitated the "dot com" revolution.

The purchase by Oracle surprised and stunned industry observers that had been on vacation and hadn't been paying attention to anything for the last few months. Said one analyst on vacation in the Bahamas, "When I left for holiday I heard IBM was going to snatch them up. Whatever happened to that?"

The original founders, Andy von Bechtolsheim, Vinod Khosla, Bill Joy, and Scott McNealy, were excited to make the announcement at a press conference in Mountain. Now that they have completed their first startup the entire world is watching to see what they do next.

Posted by Tom Limoncelli in Funny

 
LISA14 I'm Teaching button