Non-technical strategies

In the last few weeks I've written about ways to get peers to adopt a technology you like, and how to get your managers to adopt it too. Today I'd like to point out some "non-traditional" strategies you might employ when those fail. This list was created when talking with a reader about how to get approval for installing a trouble-ticket system.

Often the non-technical push-back is against the entire concept of ticket systems and nothing will be "good enough". In that case, don't bring a knife to a gun fight. In fact, find a way to avoid the fight entirely.

The Art of War and other strategy books would suggest alternate strategies like these:

  • Privately confront the primary dissenter directly: talk privately with the person to find the reasons behind their actions and settle those issues. Enlist them as a supporter.
  • Go around the dissenter entirely: set up the ticket system of your own choosing for a project they are not involved in, when it is successful it will be politically difficult not to expand its use to all projects.
  • Go over the dissenter's head: get the dissenter's boss on board.
  • Leverage influential people: If there is someone that the dissenter feels walks on water and can do no wrong, get an endorsement from that person.
  • Act faster: install something and put it into action before they can push back.
  • Act slower: are there benefits to putting off the decision? For example, will the dissenter retire or change jobs soon? (You may not be allowed to know that they are on the way out. If your boss smiles knowingly when you ask, maybe they know something you don't know.)
  • Produce more data: Gather data and produce charts that show undeniably you are right (don't show a single charts that disagrees; if the dissenter doesn't have the raw data, they can't make those charts).
  • Produce less data: Work in secret to build the system.
  • The power of crowds: Can you get a lot of other people on board such that the dissenter is outvoted?
  • The Power of the Demo: Are they rejecting a system they haven't actually used? Install your preferred solution on a VM and give demos to likely supporters. (The secret to a successful demo is doing at least 5 dress rehearsals)
  • Divide an conquer: Find out where the opposition isn't in agreement with each other and play one side against the other.
  • Isolate dissent: Identify the dissenters and exclude them from the process (find a politically viable justification for this).
  • Overload the dissenter: Give them so much other work to do that they don't have time to dissent; or put so much of the research on their shoulders that they ask to be taken out of the decision process.
  • Reduce the choices: Don't show 15 different models and hope they pick the one you want. Only show options that you will accept.
  • Give too many choices: Show so many potential products that they are overwhelmed; declare your expertise and recommend the one you want.
  • Selective comparison: Show 1 really awful system followed by a perfect demo of your system. (In a related note: At a singles bar always stand next to an ugly person.)
  • Force a "win": Get agreement to default to your solution if a decision isn't made by a certain date ("because we can't delay ProjectX"). Make sure you've given them more work than can be accomplished by that date so-as to trigger the default.
  • Make the dissenter think they are making the decision: If you ask a child "what do you want for dinner?" they'll ask for ice cream. If you ask, "Should we have hamburgers or hotdogs?" they'll think they're making the decision even though you've already made it for them. (Worst of all: don't list choices one at a time, they'll keep saying "no" until you run out of choices: "Do you want hamburgers?" "no" "Do you want hotdogs?" no "Umm... well, we have ice cream" "yes!").
  • Take advantage of emergencies: In an emergency the normal decision process goes away. Can you create a situation (or wait for a situation) where you can get permission to install RT or ORTS "just for this one emergency" and then take advantage of the fact that "nothing is more permanent than a temporary solution"?)
  • Bullies only respect other bullies: Declare that your solution is the ONLY solution and brow-beat anyone that disagrees.
  • Discredit the enemy: If the dissenter is always going to find reasons to reject something, don't try to deal with the points they bring up; discredit the dissenter's opinions. ("He isn't a real stake-holder, why should we listen to him?" "He rejects anything new, remember the time....", "He won't even be using the system, why is he causing trouble for us?")
  • Running code beats vaporware: a running system beats the theory that it won't work.
  • Avoid the issue: Find another project to work on that will make you a success; leave this "can't win" situation to co-workers that are suckers.

If done right, these strategies could work or could get you fired. Proceed with caution. Work with your boss, or if you boss the problem, confer with peers.

Please post comments with your suggestions and experiences. (This website now supports OpenID and other systems.)

Posted by Tom Limoncelli at August 25, 2010 12:00 PM | Comments (0) | TrackBack

How to kill innovation

This BusinessWeek article spells out the kind of behavior that kills innovation which I saw at Bell Labs. His example is at a bank, where a new product is killed because nobody else is doing this product.

At Bell Labs in the 1990s I felt that the president spent most of his time alternating between two activities: Canceling projects because "if it was a good idea, why don't we hear that the competition is doing it?" and complaining that the competition had just released a product that we hadn't thought to create. The truth was that we had, but he canceled it for the former reason.

The article recommends a better way to encourage innovation: "we can turn to a third form of logic: abductive logic, the logic of what could be. To use abduction, we need to creatively assemble the disparate experiences and bits of data that seem relevant in order to make an inference--a logical leap--to the best possible conclusion."

As system administrators we often put down extremely new ideas. Centralized file servers were a bad idea, until everyone else was doing it. The web was "too much bandwidth and should be blocked." WiFi can't be made secure. Cloud computing is "untested."

Sometimes I am concerned that we get burnt out and forget that while it is our job to measure risk, we do this to find creative was to mitigate it; lest we find ourselves using it to justify stopping innovation.

Here's a good New Years Resolution: Make an effort to take the logical leap to see what could be.

Posted by Tom Limoncelli at January 16, 2010 10:24 AM | Comments (2) | TrackBack

Can my SLA rule work for networks? Yes.

Last week I mentioned that that if you have a service that requires a certain SLA, it can't depend on things of lesser SLA.

My networking friends balked and said that this isn't a valid rule for networks. I think that violations of this rule are so rare they are hard to imagine. Or, better stated, networking people do this so naturally that it is hard to imagine violating this rule.

However, here are 3 from my experience:


  • Situation: A company who's internet connection is a DSL modem. The modem is in the hallway near the computer room, but not in the computer room. As a result, when someone knocks the modem over, the company's website is down. (web site depending on router). Improvement: move the router into the computer room.
  • A computer room with excellent UPS and power infrastructure... but the router isn't on the UPS for weird historical reasons (it is depending on external power). Improvement: move the router onto the UPS.
  • An excellent computer room with fine ethernet switches... but the router is in the lab one room over. Each VLAN has a physical cable connected to it with a cable that runs to that other room. I was told, "the researchers are doing some experiments on the router so they wanted it in their lab". Improvement: Move the router into the computer room.

3 true stories.

Posted by Tom Limoncelli at November 19, 2009 12:44 PM | Comments (2) | TrackBack

How do I measure my group's performance?

On a mailing list recently someone asked, "Does anyone have any recommendations for useful metrics to measure the performance of the systems team? (i.e. not the systems themselves)"

Fundamentally you want to define an SLA and then demonstrate that you are meeting it (or how close you are to meeting it, with improvement over time).  The problem is how do you define an SLA?  Here are some example metrics:

  1. 90% of all tickets will be closed in 3 days (measure the number of tickets that are older than 3 days)
  2. VPN and remote access services up 99.99% of the time (measure uptime outside of scheduled maintenance windows)
  3. New users have accounts/machines/etc. within n days of their start (preferably n=-1)
  4. IMAP latency below n microseconds (measure how long it takes to do a simulated login, read of 100 messages, and log out)
I prefer measuring things that can be measured automatically.  All of the above can be.  Asking humans to take manual measurements is a burden and error prone.

I recently started a new assignment where I was supposed to write down the number of open tickets at the beginning and end of the day, and keep count of how many tickets I had completed.  Oh brother.  As you can imagine, I failed.  There wasn't a single day that I remembered to collect all three data points.  Eventually I found a script that extracts this data from our ticket system.

Some things that can't be automatically measured:

  • Customer happiness.  Yes, you can send out surveys but I don't think that's accurate.  People don't respond to surveys unless they are dissatisfied with you or compulsive survey-takers.  It is better to give people a way to tell a manager that they were unhappy so that the team can be "educated".  The problem becomes, how do I ask for that kind of feedback from our users?  Sometimes it helps to disguise that in the form of a survey.  A single-question survey ("On a rank of 1 to 5, how did we do?") followed by a big, big, optional comment box.   The rank data you collect might be useful if your boss likes pretty graphs (especially if you graph over long periods of time).  The real value will be in the comments you get.  Listen to the comments you get and make sure the person that made the comment gets a personal phone call or visit not to defend or explain, but to ask for their suggestions on how we could do better.  Angry customers want to be listened to more than anything else.  In fact, they want to be listened to more so than they want the problem fixed.  (Oh, you'll get compliments too.  Print them out and put them on the wall for everyone to see!)
  • "Time to Return to Service" i.e. when there is an outage (dead disk, dead router, etc.) how long before you were able to return the service to an operational state.  Don't measure this.  Measuring that distracts engineers from building systems that prevent outages (RAID, redundant routers, and so on).  If you instead measure uptime you are driving good behavior without micromanaging.  If I was measured on my "return to service" times, I'd stop building systems with RAID or redundant routers so that I can have a lot of outages and tons of data to show how good I am at swapping in new hardware.  That disk that you paid for shouldn't be sitting in a box next to the computer, it should be part of a RAID system that automatically recovers when there is an outage.

My last recommenation is controversial.  You should penalize people that beat their SLA too well.  If the SLA says there will be 99.9% uptime, and I provide 99.999% uptime then I am probably doing one of two bad things:  Either I'm paying for redundancy that is wasteful or I'm avoiding important system upgrades and therefore impeding innovation.   If I am hovering around 99.9% by +/- 0.1% then I've demonstrated that I can balance uptime with budget and innovation.  If management complains about outages but I'm still at 99.9%, then they need to change the SLA and be willing to fund the resources to achieve it, or accept the intangible costs of a slower rate of upgrades.  They may back down or they may choose one of the other options.  That's fine.  If you think about it the essential role of management is to set goals and provide resources to meet those goals.  By working to hit (not exceed) your SLA you are creating an environment where they can perform their essential role whether they realize it or not.  Similarly, if they want to save money you can respond with scenarios that include fewer upgrades (higher risk of security problems, less productivity due to the opportunity cost of lacking new features) or by accepting a lower SLA due to an increase in outages.

Tom
Posted by Tom Limoncelli at May 28, 2009 10:49 PM | Comments (2) | TrackBack

Queue Inversion Week

[Hal is Founder/CEO of Deer Run Associates. This article originally appeared on his blog Righteous IT.]

Reliving the last story from my days at the mid-90's Internet skunkworks, reminded me of another bit of tactical IT advice I learned on that job, and which has become a proven strategy that I've used on other engagements. I call it "Queue Inversion Week".

One aspect of our operations religion at the skunkworks was, "All work must be ticketed" (there's another blog post behind that mantra, which I'll get to at some point). We lived and died by our trouble-ticketing system, and ticket priority values generally drove the order of our work-flow in the group.

The problem that often occurs to organizations in this situation, however, is what I refer to as the "tyranny of the queue". Everybody on the team is legitimately working on the highest-priority items. However, due to limited resources in the Operations group, there are lower priority items that tend to collect at the bottom of the queue and never rise to the level of severity that would get them attention. The users who have submitted these low-priority tickets tend to be very understanding (at least they were at the skunkworks) and would wait for weeks or months for somebody in my group to get around to resolving their minor issues. I suspect that during those weeks/months the organization was actually losing a noticable amount of worker productivity due to these "minor" issues, but we never quantified how much.

What did finally penetrate was a growing rumble unhappiness from our internal customers. "We realize you guys are working on bigger issues," they'd tell me in staff meetings, "but after a few months even a minor issue becomes really irritating to the person affected." The logic was undeniable.

I took the feedback back to my team and we started kicking around ideas. One solution that had a lot of support was to simply include time as a factor in the priority of the item: after the ticket had sat in the queue for some period of time, the ticket would automatically be bumped up one priority level. The problem is that when we started modeling the idea, we realized it wouldn't work. All of the "noise" from the bottom of the queue would eventually get promoted to the point where it would be interfering with critical work.

Then my guy Josh Smift, who basically "owned" the trouble ticketing system as far as customization and updates was concerned, had the critical insight: let's just "invert" the queue for a week. In other words, the entire Operations crew would simply work items from the bottom of the queue for a week rather than the top. It was simple and it was brilliant.

So we looked at the project schedule and identified what looked like a "slack" week and declared it to be "Queue Inversion Week." We notified our user community and encouraged them to submit tickets for any minor annoyances that they'd been reluctant to bring up for whatever reason.

To say that "Queue Inversion Week" was a raging success was to put it mildly indeed. Frankly, all I wanted out of the week was to clear our ticket backlog and get our customers off our backs, but the whole experience was a revelation. First, the morale of my Operations team went through the roof. Analyzing the reasons why, I came to several conclusions:

  • It got my folks out among the user community and back in touch with the rest of the company, rather than being locked up in the data center all day long. The people who my folks helped were grateful and expressed that to my team, which makes a nice change from the usual, mostly negative feedback IT Ops people tend to get.
  • The tickets from the bottom of the queue generally required only the simplest tactical resolutions. Each member of my team could resolve dozens of these items during the week (in fact, a friendly competition arose to see who could close the most tickets), and feel terrific afterwards because there was so much concrete good stuff they could see that they'd done.
  • Regardless of what outsiders think, I believe most people in IT Operations really want to help the people who are their customers. It's depressing to know that there are items languishing at the bottom of the queue that will never get worked on. This week gave my team an excuse to work on these issues.

I think I can reasonably claim that Queue Inversion Week also had a noticable impact on the morale of the skunkworks as a whole. After all, many of the annoying problems that our users had been just doing work-arounds for were now removed as obstacles. Like a good spring cleaning, everybody could breathe a little easier and enjoy the extra sunshine that appeared through the newly cleaned windows.

We repeated Queue Inversion Week periodically during my tenure at the skunkworks, and every time it was a positive experience that everybody looked forward to and got much benefit from. You can't necessarily have it happen on a rigid schedule, because other operational priorities interfere, but any time it looks like you have a little "slack" in the project schedule coming up and the bottom of your queue is full of little annoying tasks, consider declaring your own "Queue Inversion Week" and see if it doesn't do you and your organization a world of good.

Posted by Hal Pomeranz at February 13, 2009 8:53 AM | Comments (2) | TrackBack

Ready for LISA 2008 in San Diego!

I'm going to LISA '08 I've registered, I've booked my hotel. Are you going to LISA 2008? On Thursday I will be doing a 90-minute open Q&A session about Time Management. Feel free to stop by and ask me anything. On Friday I will be presenting my newest talk titled, "System Administration and The Economics of Plenty". When we start to see how plentiful the world is, we think about our roles as system administrators differently. It affects everything from how we set policy to how we do our jobs. Register online today! I hope to see you there!
Posted by Tom Limoncelli at September 29, 2008 10:00 AM | Comments (0) | TrackBack

Tom @ Ohio LinuxFest 2008, Columbus, Ohio, October 10-11, 2008

Tom will be teaching two half-day tutorials: "Time Management for System Administrators" and "Interviewing and Hiring System Administrators". This is a rare opportunity to see these talks presented in the Ohio area. Register soon!

With the economy in a down-turn, Time Management is key to being efficient at what you do. With people's hiring budgets being slashed, it is important that the people you do hire are top notch. Both of these tutorials are intended for both the new and experienced system administrator or IT manager.

The sixth annual Ohio LinuxFest will be held on October 10-11, 2008 at the Greater Columbus Convention Center in downtown Columbus, Ohio. Hosting authoritative speakers and a large expo, the Ohio LinuxFest welcomes Free and Open Source Software professionals, enthusiasts, and anyone who wants to take part in the event. The Ohio LinuxFest is a free, grassroots conference for the Linux/Open Source Software/Free Software community
Posted by Tom Limoncelli at September 28, 2008 3:09 PM | Comments (0) | TrackBack

Keeping inventory data accurate

In system administration we have to keep many lists: lists of users, lists of machines, lists of IP addresses, and so on. The only way to keep information from growing stale is to make sure key processes are driven off of the live database.

Here are three different techniques I've seen used:

Level 1: Periodically gather the information. A spreadsheet is great for this and simple. Once a year you collect information and then you spend 354 days with out-of-date information. I've seen this in a number of places. At Lucent they hired a company to document "everything with a power plug" once a year. The information was put into a big read-only database that everyone ignored. I wonder how much they paid for this "service".

Level 2. Automatic collection. You know that, at least for the machines you know about, data is being collected and it is, hopefully staying up to date. If the process is automated, you can run the process weekly or daily. Machines can stay hidden if the "discovery" software isn't very good, or if someone wants it to stay hidden.

Level 3. Actively-used data. Rather than storing data, if you actively use it then you know it is up to date because people are dedicated to keeping it up to date. They receive a benefit, not just you. If the inventory is used to drive software upgrades, then people will complain they are "left behind" and you'll know to add them to the inventory. If patches only go to machines in the inventory, then sysadmins are compelled to keep the list accurate so they aren't dealing with security flaws.

Level 3 is a self-correcting system, which saves times and assures far greater accuracy than other solutions.

A company with limited IP address space found itself constantly emailing its engineers begging them to return unused IP addresses that had been allocated to them. Nobody listened. When they changed the IP allocation process to be a "rental agreement" that required semiyearly confirmation that they IP address was still in active use (they received email with a link to confirm the ownership), suddenly the list became much more accurate.

The pressure for a person to keep the data should be self-serving to the person, not you. Employees are quick to demand corrections to any inaccuracies related to payroll, right? The payroll department has an incentive to pay everyone their accurate salary. The employee has an incentive to make sure they are paid the correct amount, and make sure their home address and such is accurate. I once saw a company try to send holiday cards to each employee. A secretary was about to blast email to everyone asking for their home address. Since the email wasn't going to say why she needed their address (the card was a surprise), I was sure it was going to cause nothing but a big flap about privacy. Instead I encouraged her to simply get permission to use people's home address as listed in the payroll system. While monthly paychecks were direct-deposit, bonuses and tax-info was sent by paper-mail. Everyone kept that database extremely accurate.

What do you use to keep inventories and other lists of information up to date?

Posted by Tom Limoncelli at April 7, 2008 9:00 AM | Comments (0) | TrackBack

Tom @ $GROUPNAME in New Jersey, Nov 15, 2006

Tom will be doing a dress-rehearsal of his "Site Reliability @ Google" talk at $GROUPNAME this Wednesday night. Be the first to hear his new material.

The person that carpools with the most first-timers (people new to $GROUPNAME) will receive a free copy of his book, Time Management for System Administrators, from O'Reilly.

Location: CoRE Auditorium, Rutgers Busch Campus, Piscataway, NJ.

Time: 7pm

For more information, visit their web site: www.groupname.org

Posted by Tom Limoncelli at November 12, 2006 12:21 PM | Comments (0) | TrackBack

Tom + Strata @ LISA '06 in Wash D.C., Dec 3-8, 2006

Tom and Strata be teaching and speaking at LISA 2006 in Washington D.C., Dec 3-9, 2006. This is one of our favorite conferences of the year because it is so dam useful. Get your boss to send ya. This year it is in Washington D.C., which makes it easy to get to for all the east-coasters that usually don't get around.

Tom will be speaking/teaching:

Mon9am-5pmWorkshopManaging Sysadmins (co-facilitator)
Wed2pm-3:30Invited TalkSite Reliability at Google/My First Year at Google
ThuAMTutorialTime Management: Getting It All Done and Not Going (More) Crazy!
Thu12:30pm-1:30pmExhibition"Meet the Authors" at Reiter's Conference Bookstore
Thu2pm-3:30Guru TalkHow to Get Your Paper Accepted at LISA
Thu4pm-5:40Guru TalkTime Management for System Administrators
Fri11am-12:30Hit The
Ground
Running
Mac OS X

Strata Rose Chalup will be speaking/teaching:

MonPMTutorialProject Troubleshooting
WedPMTutorialProblem-Solving for IT Professionals
ThuAMTutorialPractical Project Management for Sysadmins and IT Professionals
Wed9pm-10pmBOFSysadmin Education

In addition, we will be hanging out in what is known as "the hallway track". In fact, if you haven't attended LISA before, you should know that a lot of the educational value is the people you meet. Tom says, "Early in my career a lot of what I learned was from the conversations in the hallway."

Incident Command for IT: What We Can Learn from the Fire Department

At LISA2005 Brent Chapman gave an excellent talk "Incident Command for IT: What We Can Learn from the Fire Department". (Slides are webified here or download the PDF).

The ICS methodology has a lot of really good points to it. Adopting it for IT work should have a lot of benefits, not just in emergencies:

If you use it for "routine" and pre-planned events like moves, upgrades, and deployments, your team will be more comfortable using it for "surprise" events like outages and security incidents.
Brent has more about LISA2005 in his blog entry.

Posted by Tom Limoncelli at December 20, 2005 9:22 AM | Comments (0) | TrackBack