Awesome Conferences

See us live(rss)   

All outages are due to a failure to plan

I can't take credit for this, as a co-worker recently introduced me to this point.

All outages are, at their core, a failure to plan.

If a dead component (for example, a hard drive) failed, then there was a lack of planning for failed components. Components fail. Hard disks, RAM chips, CPUs, mother boards, power supplies, even ethernet cables fail. If a component fails and causes a visible outage, then there was a failure to plan for enough redundancy to survive the outage. There are technologies that, with prior forethought, can be included in a design to make any single component's failure a non-issue.

If a user-visible outage is caused by human error it is still a failure to plan: someone failed to plan for enough training, failed to plan the right staffing levels and competencies, failure to plan disaster exercises to verify training, failure to plan to validate the construction and execution of training.

What about the kind of outages that "nobody could have expected"? Also a failure to plan.

What about the kind of outages that are completely unavoidable? That is a failure to plan to have an SLA that permits a reasonable amount of downtime each year. If you plan includes up to 4 hours of downtime each year, those first 239 minutes are not an outage. If someone complains that 4 hours isn't acceptable to them, there was a failure to communicate that everyone should only adopt plans that are ok with 4 hours of downtime each year; or the communication worked but the dependent plan failed to incorporate the agreed upon SLA. If someone feels they didn't agree to that SLA, there was a failure to plan how to get all stakeholders buy-in.

If designs that meet the SLA are too expensive then there was a failure to plan the budget. If a the product can not be made profitable at the expense required to meet the SLA, there was a failure to plan a workable business case.

If the problem is that someone didn't follow the plan, then the plan failed to include enough training, communication, or enforcement.

If there wasn't enough time to plan all of the above, there was a failure to start planning early enough to incorporate a sufficient level of planning.

The next time there is an outage, whether you are on the receiving end of the outage or not, think about what was the failure to plan at the root of this problem. I assure you the cause will have been a failure to plan.

Posted by Tom Limoncelli in Management

No TrackBacks

TrackBack URL: http://everythingsysadmin.com/cgi-bin/mt-tb.cgi/1513

12 Comments | Leave a comment

How does this work? It would seem that it would require an 5 inch binder to cover every contingency listed above for everything?

I say this because at one site I worked at there was a 5 inch binder to cover everything one had to do on changing the paper in the printer. It covered during emergencies, outages, etc etc. There was a slightly smaller one on making coffee in the approved coffee makers. While most of it was copy and paste from other documents (if during a fire, power off the device and follow fire-evacuation plan A,B, or C depending on the circumstances.) Bringing in new or updating hardware required long planning meetings to make sure that all the contingencies and coverages were covered... and then when something wasn't covered there were tiger teams to go over who forgot to plan X or who didn't follow plan Y subsection E.

For the short time I was there, the initial planning would be something simple, but would rapidly become a "well did you think about that." corner case hole of extra documents. Those documents then needed to go through various approval chains etc and more questions and stuff would be added on. A year after I left, the data information project was ready to be implemented but had to go back to the drawing board as gopher had to be replaced by httpd.

So how does one plan for dealing with humans and their need to make sure that a plan covers every iota of every issue without fail?

Sorry, I'm not buying it.

Many (probably even most) outages are a failure to plan.

But some outages stem from a valid decision to not spend the resources needed to avoid the outage. For example, one of my clients went dark for two hours when the one and only ISP connection went out. Certainly, a plan could be put in place to have redundant internet connections, but the cost in dollars and effort aren't worth it to this client. So the outage happened. That's a willingness to accept risk, not a failure to plan.

And those seem to me to be some pretty impressive hoops you've jumped through to go from "human error" to "failed to plan." Certainly, plans can be created to mitigate and eliminate as much human error as possible. But we're fallible. Putting the smartest, best employee you have through hundreds of hours of training won't stop him/her from making errors.

Jeff, you wrote:
"But some outages stem from a valid decision to not spend the resources needed to avoid the outage."
If the decision was made not to spend the resources, then it is part of the SLA and not an outage. If someone was surprised that the SLA accepts such outages, then it was a failure to communicate the plan.

See?

I take your point, Tom, and maybe I'm just splitting hairs here, but to me SLA means someone decided "we can stand X min of downtime per year," which, I completely agree, is a plan.

Saying "don't spend the money/time cuz it ain't worth it" just strikes me as too loose to call a "plan."

(As an aside, apologies for the wacky comment 'name.' Seems like MT doesn't really like me logging with my Google ID all that much!)

-Jeff

I guess what I am said originally goes with -Jeff's comment. What constitutes a proper plan? I can state "we have accepted these failures as too much to proceed" and we can communicate that.. but knowing human nature many people will come back and state that they didn't realize that was the SLA which then leads to the circle of planning where you spend days working out that if the local nuclear plant has a meltdown, how are you going to do tape backups in the irradiated zone. You end up with plans that you can point to that covers your ass but no way that it would ever be really implemented.

If a dead component (for example, a hard drive) failed, *AND THAT DISRUPTS YOUR ACTIVITIES* then there was a lack of planning for failed components.

"prior forethought" - what are the other kinds of forethought?

Yeah... no.

An outage is when a customer says they can't access a service. Whether its your failed components or their failure to forgot to turn on a laptop... it's the perception of the user or organization that's paying for a product or service that matters.

Half of the above just sounds like making excuses to cover your backside when the finger pouting begins.

Since the list of possible failures can be infinitely large if you include human error, you'd need an infinite amount of time and resources to plan for every contingency. That means exhaustive plan in is just procrastination.

It doesn't matter if your SLA says 4 hours is okay, if you knock things out at the most inconvenient time. On the other hand if you knock things out for a week when the user is on a two week vacation they don't care either (usually). At crunch time the SLA means squat. A pissed off customer will cancel a service no matter what it says.

This is what risk analysis is for, and unfortunately a lot of sysadmins under estimate probability and impact, because everyone does perfect work.

In other words, if you're selling a cheap VoIP service for 1/2 what your incumbent sells residential telephone service, and you can expect - at best - 99.9% uptime, and everyone knows this (except the customer, who will complain mightily at any downtime at all)...

Then everything is going exactly to plan when the poo hits the fan and you're down for 2 days every year. "That's not an outage", as you say.

Hmm. After doing the math, 48 hours of downtime a year is better than 99.99% uptime.

Of course, where the customer is concerned, a half an hour of downtime (better than 99.9999% uptime) during peak business is devastating. Or at least, so they argue.

I think the user-error failures should be expanded to include "failure to review processes for drift" and "failure to encourage an open culture where identified gaps can be discussed openly".

All outages *are* due to a failure to plan. We could and should be planning more effectively than we do. But, trying to "plan better" in a completely direct way is only part of the answer.

To *never* fail at planning, planners must have infallible insight and infinite resources. They must be able to correctly calculate the probability of different events (not just disks or RAM failing or the other typical stories of our professions, but wars, natural disasters, pandemics, extinction events, ...), even things which have never occurred in written history and have not left sufficient geological traces for us to analyze, or things due to technologies we haven't invented yet but will six months from now. They must further determine how all relevant systems will respond to these events and determine responses, regardless of what those events are, and whether they occur separately or in conjunction with each other. Finally, generating an SLA that distils this prophetic information.

That obviously isn't realistic. In the real world, there are finite resources, fallible humans, and far less than complete ability to predict the future.

It's important to understand that the limits are actually part of the structure of the system. They need to be taken into consideration, even as they change over time. Things like recognizing the impact of cognitive biases on planning and incorporating information from research in areas like resilience engineering help. But, there is still a *LONG* way to go...

I'm going to lean on on this, Tom.

Failure to plan (or anticipate) is but one (not the only) contributing causes of outages, for some definition of outage.

SLAs aren't protections or plans. They are hedges and comforting concepts, not plans. Outages aren't prevented by having an SLA.

Also: human error isn't a cause, it's a symptom, and you should question every postmortem document you come across where a (hopefully they're capturing more than one) root cause was "human error". This is a 1970s approach.

While the statement you make has dramatic effect due to the generalization ("ALL outages...") it appears that you're attempting to rationalize it through logical hoops.

Availability, business impact, planning, error, SLA, causality...these are words that need context and while your attempt is noble (simplifying an approach to be better in our field) the reasoning is faulty.

In the end: we need to be *ok* that oversimplifying causality doesn't work for complex systems. If we continue to attempt it, we aren't going to make progress. Reductive Bias is a powerful thing, and it has you by the cuff of your shirt on this one.

Anticipation is absolutely a legitimate effort, skill, and organizational behavior. But so is Monitoring (in the organizational sense), Response, and Learning. A learning phase doesn't simply focus on Anticipation in order to prevent untoward surprises in the future, it looks at the other areas as well.

If you posit that causality points at failure to plan, then you miss out on what is required for complex systems failures to work: adaptive capacities of humans making decisions in their work. During outage response, a large amount of that decision-making and action-taking is improvised, not planned.

I'd be up for talking about this more, because the insistence on singular causes is (IMO) the largest impediment for our field to progress.

Leave a comment