I have two fears when I reboot a server.[1]
Every time I reboot a machine I fear it won't come back up.
The first cause of this fear is that some change made since the last reboot will prevent it from being able to reboot. If that last reboot was 4 months ago it could have been any change made in the last 4 months. You spend all day debugging the problem. Is it some startup script that has a typo? Is it an incompatible DLL? Sigh. This sucks.
The second cause of this fear is when I've made a change to a machine (say, added new application service) and then rebooted it to make sure the service starts after reboot. Even if this reboot isn't required by the install instructions I do it anyway because I want to make sure the service will restart properly on boot. I want to discover this now rather than after an unexpected crash or a 4am power outage. If there is going to be a problem I want it to happen in a controlled way, on my schedule.
The problem, of course, is that if you've made a change to a machine and the reboot fails you can't tell if it is the first or second category. Early in my career I've had bad experiences trying to debug why a machine won't boot up only to find that it wasn't caused by my recent change but by some change made months earlier. It makes the detective work a lot more difficult.
Here are some thoughts on how I've eliminated or reduced these fears.
1. Reboot, change, reboot.
If I need to make a big change to a server, first I reboot before making any changes. This is just to make sure the machine can reboot on its own. This eliminates the first category of fears. If the machine can't boot after the change then I know the reason is my most recent change only.
Of course, if I do this consistently then there is no reason to do that first reboot, right? Well, I'm the kind of person that looks both ways even when crossing a one-way street. You never know if someone else made a change since the last reboot and didn't follow this procedure. More likely there may have been a hardware issue or other "externality".
Therefore: reboot, change, reboot.
Oh, and if that last reboot discovers a problem (the service didn't start on boot, for example) and requires more changes to make things right, then you have to do another reboot. In other words, a reboot to test the last change.
2. Reduce the number of services on a machine.
The reboot problem is bigger on machines that serve multiple purposes. For example that one machine that is the company DNS server, DHCP server, file server, plus someone put the Payroll app on it, and so on and so on. Now it has 15 purposes. I have less fear of rebooting a single-purpose server because there is less chance that a change to one service will break another service.
The problem is that machines are expensive so having one machine for each service is very costly. It also leaves machines idle most of the time; most applications are going to be idle a lot of the time.
The solution here is to use many virtual machines (VMs) on a single physical box. While there is more overhead than, say, running the same services natively on the same box, the manageability is better. By isolating each application it gives you better confidence when patching both the application and the underlying OS.
(As to which VM solution you should use, I'm biased since I work on The Ganeti Project which is a open source virtualization management system that doesn't require big expensive SANs or other hardware. And since I'm plugging that I'll also plug the fact that I'll be giving talks about Ganeti at the upcoming Maryland "Crabby Sysadmins" meeting, Cascadia IT conference in Seattle, and PICC conference in New Jersey)
3. Better testing.
Upgrades should never be a surprise. If you have to roll out (for example) a new DNS server patch you should have a DNS server in a test lab where you test the upgrade first. If successful you roll out the change to each DNS server one at a time, testing as you go.
Readers from smaller sites are probably laughing right now. A test lab? Who has that? I can't get my boss to pay for the servers I need, let alone a test lab. Well, that is, sadly, one of the ways that system administration just plain makes more sense when it is done at scale. At scale it isn't just worth having a test lab, the risk of a failure that affects hundreds or thousands (or millions) of users is too great to not have one.
The best practice is to have a repeatable way to build a machine that provides a certain service. That way you can repeatably build the server with the old version of software, practice the upgrade procedure, and repeat if required. With VMs you might clone a running server and practice doing the upgrades on the clone.
4. Better automation.
Of course, as with everything in system administration, automation is our ultimate goal. If your process for building a server for a particular service is 100% automated, then you can build a test machine reliably 100% of the time. You can practice the upgrade process many times until you absolutely know it will work. The upgrade process should be automated so that, once perfected, the exact procedure will be done on the real machines.
This is called "configuration management" or CM. Some common CM systems include CfEngine, Chef, and Puppet. These systems let you rapidly automate upgrades, deployments, and more. Most importantly they generally work by you specifying the end-result (what you want) and it figures out how to get there (update this file, install this package, etc.)
In a well-administered system with good configuration management an upgrade of a service is a matter of specifying that the test machines (currently at version X) should be running version X+1. Wait for the automation to complete its work. Test the machines. Now specify that the production machines should be at version X+1 and let it do the work.
Again, small sites often think that configuration management is something only big sites do. The truth is that every site, big and small, can use these configuration management tools. The truth is that every site, big and small, has an endless number of excuses to keep doing things manually. That's why we see the biggest adopters of these techniques are web service farms because they are usually starting from a "green field" and don't have legacy baggage to contend with.
Which brings me to my final point. I'm sick of hearing people say "we can't use [CfEngine/Chef/Puppet] because we have too many legacy systems. You don't have to manage ever byte on every machine at the beginning. In fact that wouldn't be prudent. You want to start small.
Even if you have the most legacy encrusted, old, systems a good start is to have your CM system keep /etc/motd updated on your Unix/Linux systems. That's it. It has business justification: there may be some standard message that should be there. Anyone claiming they are afraid you will interfere with the services on the machine can't possibly mean that modifying /etc/motd will harm their service. It reduces the problem to "we can't spare the RAM and CPU" that the CM system will require. That's a much more manageable problem.
Once you are past that, you can use the system to enforce security policies. Make sure /etc isn't world writable, disable telnetd, and so on. These are significant improvements in the legacy world.
Of course, now you have the infrastructure in place, all your new machines can be built without this legacy baggage. That new web farm can be build by coding up CM modules that create your 3 kinds of machines: static web servers, database servers, CGI/application servers. Using your CM system you build these machines. You now have all the repeatability and automation you need to scale (and as a bonus, /etc/motd contains the right message).
This is a "bottom up" approach: changing small things and working up to bigger things.
You can also go the other direction: Use CM to create your services "the right way", have your success be visible, and use that success to gain trust as you slowly, one step at a time, expand the CM system to include legacy machines.
Writing about my "fear of reboot" brought back a lot of old memories. They are, let me be clear, memories. I haven't felt the "fear of reboot" for years because I've been using the above techniques. None of this is rocket science. It isn't even trailblazing any more. It's 12+ year old, proven techniques. The biggest problem in our industry is convincing people to enter the 21st century and they don't want to be reminded that they're a decade late.
Tom
[1] Especially servers that have multiple purposes. By the way, for the purpose of this article I'll use the term "service" to mean an application or group of cooperating processes that provide an application to users, "machine" to mean any machine, and "server" to mean a machine that provides one or more services.
Is there a particular reason why you don't provide a more printer friendly version of your posts via a link? I try not to print much, but every once in a while I come across something so good ( like this post ) that I would like to print it and share it. I find that certain people are a whole lot more likely to read a printed page than they are to actually click a link I send them.