I can't take credit for this, as a co-worker recently introduced me to this point.
All outages are, at their core, a failure to plan.
If a dead component (for example, a hard drive) failed, then there was a lack of planning for failed components. Components fail. Hard disks, RAM chips, CPUs, mother boards, power supplies, even ethernet cables fail. If a component fails and causes a visible outage, then there was a failure to plan for enough redundancy to survive the outage. There are technologies that, with prior forethought, can be included in a design to make any single component's failure a non-issue.
If a user-visible outage is caused by human error it is still a failure to plan: someone failed to plan for enough training, failed to plan the right staffing levels and competencies, failure to plan disaster exercises to verify training, failure to plan to validate the construction and execution of training.
What about the kind of outages that "nobody could have expected"? Also a failure to plan.
What about the kind of outages that are completely unavoidable? That is a failure to plan to have an SLA that permits a reasonable amount of downtime each year. If you plan includes up to 4 hours of downtime each year, those first 239 minutes are not an outage. If someone complains that 4 hours isn't acceptable to them, there was a failure to communicate that everyone should only adopt plans that are ok with 4 hours of downtime each year; or the communication worked but the dependent plan failed to incorporate the agreed upon SLA. If someone feels they didn't agree to that SLA, there was a failure to plan how to get all stakeholders buy-in.
If designs that meet the SLA are too expensive then there was a failure to plan the budget. If a the product can not be made profitable at the expense required to meet the SLA, there was a failure to plan a workable business case.
If the problem is that someone didn't follow the plan, then the plan failed to include enough training, communication, or enforcement.
If there wasn't enough time to plan all of the above, there was a failure to start planning early enough to incorporate a sufficient level of planning.
The next time there is an outage, whether you are on the receiving end of the outage or not, think about what was the failure to plan at the root of this problem. I assure you the cause will have been a failure to plan.
How does this work? It would seem that it would require an 5 inch binder to cover every contingency listed above for everything?
I say this because at one site I worked at there was a 5 inch binder to cover everything one had to do on changing the paper in the printer. It covered during emergencies, outages, etc etc. There was a slightly smaller one on making coffee in the approved coffee makers. While most of it was copy and paste from other documents (if during a fire, power off the device and follow fire-evacuation plan A,B, or C depending on the circumstances.) Bringing in new or updating hardware required long planning meetings to make sure that all the contingencies and coverages were covered... and then when something wasn't covered there were tiger teams to go over who forgot to plan X or who didn't follow plan Y subsection E.
For the short time I was there, the initial planning would be something simple, but would rapidly become a "well did you think about that." corner case hole of extra documents. Those documents then needed to go through various approval chains etc and more questions and stuff would be added on. A year after I left, the data information project was ready to be implemented but had to go back to the drawing board as gopher had to be replaced by httpd.
So how does one plan for dealing with humans and their need to make sure that a plan covers every iota of every issue without fail?