Awesome Conferences

How do I measure my group's performance?

On a mailing list recently someone asked, "Does anyone have any recommendations for useful metrics to measure the performance of the systems team? (i.e. not the systems themselves)"

Fundamentally you want to define an SLA and then demonstrate that you are meeting it (or how close you are to meeting it, with improvement over time).  The problem is how do you define an SLA?  Here are some example metrics:

  1. 90% of all tickets will be closed in 3 days (measure the number of tickets that are older than 3 days)
  2. VPN and remote access services up 99.99% of the time (measure uptime outside of scheduled maintenance windows)
  3. New users have accounts/machines/etc. within n days of their start (preferably n=-1)
  4. IMAP latency below n microseconds (measure how long it takes to do a simulated login, read of 100 messages, and log out)
I prefer measuring things that can be measured automatically.  All of the above can be.  Asking humans to take manual measurements is a burden and error prone.

I recently started a new assignment where I was supposed to write down the number of open tickets at the beginning and end of the day, and keep count of how many tickets I had completed.  Oh brother.  As you can imagine, I failed.  There wasn't a single day that I remembered to collect all three data points.  Eventually I found a script that extracts this data from our ticket system.

Some things that can't be automatically measured:

  • Customer happiness.  Yes, you can send out surveys but I don't think that's accurate.  People don't respond to surveys unless they are dissatisfied with you or compulsive survey-takers.  It is better to give people a way to tell a manager that they were unhappy so that the team can be "educated".  The problem becomes, how do I ask for that kind of feedback from our users?  Sometimes it helps to disguise that in the form of a survey.  A single-question survey ("On a rank of 1 to 5, how did we do?") followed by a big, big, optional comment box.   The rank data you collect might be useful if your boss likes pretty graphs (especially if you graph over long periods of time).  The real value will be in the comments you get.  Listen to the comments you get and make sure the person that made the comment gets a personal phone call or visit not to defend or explain, but to ask for their suggestions on how we could do better.  Angry customers want to be listened to more than anything else.  In fact, they want to be listened to more so than they want the problem fixed.  (Oh, you'll get compliments too.  Print them out and put them on the wall for everyone to see!)
  • "Time to Return to Service" i.e. when there is an outage (dead disk, dead router, etc.) how long before you were able to return the service to an operational state.  Don't measure this.  Measuring that distracts engineers from building systems that prevent outages (RAID, redundant routers, and so on).  If you instead measure uptime you are driving good behavior without micromanaging.  If I was measured on my "return to service" times, I'd stop building systems with RAID or redundant routers so that I can have a lot of outages and tons of data to show how good I am at swapping in new hardware.  That disk that you paid for shouldn't be sitting in a box next to the computer, it should be part of a RAID system that automatically recovers when there is an outage.

My last recommenation is controversial.  You should penalize people that beat their SLA too well.  If the SLA says there will be 99.9% uptime, and I provide 99.999% uptime then I am probably doing one of two bad things:  Either I'm paying for redundancy that is wasteful or I'm avoiding important system upgrades and therefore impeding innovation.   If I am hovering around 99.9% by +/- 0.1% then I've demonstrated that I can balance uptime with budget and innovation.  If management complains about outages but I'm still at 99.9%, then they need to change the SLA and be willing to fund the resources to achieve it, or accept the intangible costs of a slower rate of upgrades.  They may back down or they may choose one of the other options.  That's fine.  If you think about it the essential role of management is to set goals and provide resources to meet those goals.  By working to hit (not exceed) your SLA you are creating an environment where they can perform their essential role whether they realize it or not.  Similarly, if they want to save money you can respond with scenarios that include fewer upgrades (higher risk of security problems, less productivity due to the opportunity cost of lacking new features) or by accepting a lower SLA due to an increase in outages.


Posted by Tom Limoncelli in Technical Management

No TrackBacks

TrackBack URL:

2 Comments | Leave a comment

Grumble...Sixapart may have defined Trackback, but they don't appear to have made it easy to use on LiveJournal.

Another reason not to measure recovery time is that you may want your admins to actually diagnose a failuer instead of, say, just restarting a system. A few extra minutes of testin may prevent future failures. At the very least some more data can be collected.

Leave a comment