I can't take credit for this, as a co-worker recently introduced me to this point.
All outages are, at their core, a failure to plan.
If a dead component (for example, a hard drive) failed, then there was a lack of planning for failed components. Components fail. Hard disks, RAM chips, CPUs, mother boards, power supplies, even ethernet cables fail. If a component fails and causes a visible outage, then there was a failure to plan for enough redundancy to survive the outage. There are technologies that, with prior forethought, can be included in a design to make any single component's failure a non-issue.
If a user-visible outage is caused by human error it is still a failure to plan: someone failed to plan for enough training, failed to plan the right staffing levels and competencies, failure to plan disaster exercises to verify training, failure to plan to validate the construction and execution of training.
What about the kind of outages that "nobody could have expected"? Also a failure to plan.
What about the kind of outages that are completely unavoidable? That is a failure to plan to have an SLA that permits a reasonable amount of downtime each year. If you plan includes up to 4 hours of downtime each year, those first 239 minutes are not an outage. If someone complains that 4 hours isn't acceptable to them, there was a failure to communicate that everyone should only adopt plans that are ok with 4 hours of downtime each year; or the communication worked but the dependent plan failed to incorporate the agreed upon SLA. If someone feels they didn't agree to that SLA, there was a failure to plan how to get all stakeholders buy-in.
If designs that meet the SLA are too expensive then there was a failure to plan the budget. If a the product can not be made profitable at the expense required to meet the SLA, there was a failure to plan a workable business case.
If the problem is that someone didn't follow the plan, then the plan failed to include enough training, communication, or enforcement.
If there wasn't enough time to plan all of the above, there was a failure to start planning early enough to incorporate a sufficient level of planning.
The next time there is an outage, whether you are on the receiving end of the outage or not, think about what was the failure to plan at the root of this problem. I assure you the cause will have been a failure to plan.