Here's a good strategy to improve the reliability of your systems: Buy the most expensive computers, storage, and network equipment you can find. It is the really high-end stuff that has the best "uptime" and "MTBF".
Wait... why are you laughing? There are a lot of high-end, fault-tolerant, "never fails" systems out there. Those companies must be in business for a reason!
Ok.... if you don't believe that, let me try again.
Here's a good strategy to improve the reliability of your systems: Any time you have an outage, find who caused it and fire that person. Eventually you'll have a company that only employs perfect people.
Wait... you are laughing again! What am I missing here?
Ok, obviously those two strategies won't work. System administration is full of examples of both. At the start of "the web" we achieved high uptimes by buying Sun E10000 computers costing megabucks because "that's just how you do it" to get high performance and high uptimes. That strategy lasted until the mid-2000's. The "fire anyone that isn't perfect" strategy sounds like something out of an "old school" MBA textbooks. There are plenty of companies that seem to follow that rule.
We find those strategies laughable because the problem is not the hardware or the people. Hardware, no matter how much or how little you pay for it, will fail. People, no matter how smart or careful, will always make some mistakes. Not all mistakes can be foreseen. Not all edge cases are cost effective to prevent!
Good companies have outages and learn from them. They write down those "lessons learned" in a post-mortem document that is passed around so that everyone can learn. (I've written about how to do a decent postmortem before.)
If we are going to "learn something" from each outage and we want to learn a lot, we must have more outages.
However (and this is important) you want those outages to be under your control.
If you knew there was going to be an outage in the future, would you want it at 3am Sunday morning or 10am on a Tuesday?
You might say that 3am on Sunday is better because users won't see it. I disagree. I'd rather have it at 10am on Tuesday so I can be there to observe it, fix it, and learn from it.
In school we did this all the time. It is called a "fire drill". The first fire drill of the school year we usually did a pretty bad job. However, the second one was much better. The hope is that if there was a real fire it will be after we've gotten good at it.
Wouldn't you rather just never have fires? Sure, and when that is possible let me know. Until then, I like fire drills.
Wouldn't you rather have computer systems that never fail? Sure, and when that's possible let me know. Until then I like sysadmin fire drills.
Different companies call them different things. Jesse Robins at Twitter calls them GameDay" exercises. John Allspaw at Etsy calls refers to "resilience testing" in his new article on ACM Queue. Google calls them something else.
The longer you go without an outage, the more rusty you get. You actually improve your uptime by creating outages periodically so that you don't get rusty. It is better to have a controlled outage than waiting for the next outage to find you out of practice.
Fire drills don't have to be visible to the users. In fact, they shouldn't be. You should be able to fail over a database to the hot spare without user-visible affects.
Systems that are fault tollerant should be peridically tested. Just like you test your backups by doing an occasional full restore (don't you?) you should periodically fail over that datbase server, web server, RAID system, and so on. Do it in a controlled way: plan it, announce it, make contingency plans, and so on. Afterwords write up a timeline of what happened, what mistakes were made, and what can be done to improve this next time. For each improvement file a bug. Assign someone to hound people until the list of bugs are all closed. Or, if a bug is "too expensive to fix", have management sign off on that decision. I believe that being unwilling to pay to fix a problem ("allocate resources" in business terms) is equal to saying "I'm willing to take the risk that it won't happen." So make sure they understand what they are agreeing to.
Most importantly: have the right attitude. Nobody should be afraid to be mentioned in the "lesson's learned" document. Instead, people should be rewarded, publically, for finding problems and taking responsibility to fix them. Give a reward, even a small one, to the person that fixes the most bugs filed after a fire drill. Even if the award is a dorky certificiate to hang on their wall, a batch of cookies, or getting to pick which restaurant we go to for the next team dinner, it will mean a lot. Receiving this award should be something that can be listed on the person's next performance review.
The best kind of fire drill tests cross-team communication. If you can involved 2-3 teams in planning the drill you have potential to learn a lot more. Does everyone involved know how to contact each other? Is the conference bridge big enough for everyone? If the managers of all three teams have to pretend to be unavailable during the outage, are the three teams able to complete the drill?
My last bit of advice is that fire drills need management approval. The entire management chain needs to be aware of what is happening and understand the business purpose of doing all this.
John's article has a lot of create advice about explaining this to management, what push-back you might expect, and so on. His article, Fault Injection in Production is so well written even your boss will understand it. (ha ha, a little boss humor there)
[By the way... ACM Queue is really getting 'hip' lately by covering these kind of "DevOps" topics. I highly recommend visiting queue.acm.org periodically.]
You could split this into 2 types of drill - one where it's planned in advance, everyone knows and then you run through it as a process. This tests your plan and systems.
The other type is along the likes of how Netflix use their Chaos Monkey - stuff will die at random times (during defined periods i.e. working hours) so you can be sure that you have correctly engineered around failure. This tests your resilience and if that fails, you ability to handle unexpected failure.