Awesome list; I especially like your piece on testing restores, which is someplace where a lot of IT shops fall short. I'd like to add one point, which I always bring up to people claiming that tape is dead and that disk-to-disk is the wave of the future.
One scenario for backups that people often forget is "disgruntled admin has trashed all data." Malice on the part of an IT administrator has gained a lot more attention as something that actually happens, largely because of Terry Childs locking out the city of San Francisco from managing their network back in 2008. It's generally the hardest to deal with, as in most organizations, the backup/storage administrators hold all the keys to the kingdom.
It's different from the case where the building burned down, because WAN replication to a remote site will usually save you in these cases. However, a rogue storage admin can usually trash the backup copy just as easily as they can trash the master, unless the software has some WORM capabilities that cannot be disabled.
Intruders are just as dangerous. WebHostingTalk experienced a situation in 2009 where their backups were allegedly trashed by an intruder before the intruder started deleting posts out of their live database:
For certain kinds of data, especially data under regulatory compliance requirements, it's extremely important to keep the data offline where nobody, not even the person primarily responsible, can get to it without setting off red flags.
This is a wonderful test/check list for all IT shops.
Although, I prefer a more holistic (top-down) approach like ITSM (IT Service Management). It is a huge methodology, but can be fitted gradually, filling the gaps with a list like this and adapt it to the organization size and its idiosyncrasies.
I would like to have more of advice on managing Virtualization/Cloud installations best practices, there isnt't many literature out there on Security, Backup, Configuration Management and Monitoring on this environment. Do you think the cloud is a generalization of a local rack of servers or different ball game ?
I could certainly recommend ITSM and ITIL, but then the person reading my blog would be lost for 6 months or more, possibly never to return. I wanted a checklist people can read in one sitting.
Best practices of Virtualization/Cloud is a hot topic. The cloud practices overlap with a rack of local servers, but there is a lot more to it. I'm not an expert outside of my little virtualization echo chamber (http://code.google.com/p/ganeti). I'm sure people with big VMware clusters have different experience than Ganeti users.
The one hour limit on root passwords seems a bit arbitrary and in some cases violates the canary principle. When there are tens of thousands of highly disparate systems, it rarely is a good idea to change anything all at once (in one hour). That may not be enough time to realize that the seemingly innocent root password change is actually triggering a previously unknown bug that is causing outages on significant numbers of systems. It is important to manage authentication, (including the password of last resort) and be able to change that in a reliable manner. But if one says "Change a hundred passwords per minute until done", we wouldn't finish in an hour, or even a few hours. Too much focus on "magic numbers" like an hour can actually hurt as it prevents fairly sane approaches from being considered because it's not quite fast enough.
When it comes to virtualization, we pointed out that virtualization doesn't change the rules, because it isn't new. It's been around for decades, even in Unix. We just called it something else and didn't brand all the virtualization techniques under the newfangled "cloud" name. (Try telling a mainframer that virtualization is brand new technology and see what you get back.)
Mark : My take on the password change questions, is that they are "can" not "do". ie would the answer to "We need to change every root password right now", be "oh man this is going to take days" or "we can change them in an hour, does the risk we are mitigating outweigh the risk of making the change quickly ?"
Tom : I think something you have missed, although you have hit a number of aspects of it, is "does the department have a service and system go live checklist ?" ie a set of criteria that a service or system is tested against before it is considered supportable and able to be considered live ?
This is an excellent list.
This is a great list. It's hard to add to what is already great, but more could be said about empowering.
ex. Do staff have the apporpriate resources and authority to perform thier jobs with excellence and efficiently including all necessary and reasonable: HW, SW, training, talent, objectives, decisions, support, creativity, etc?
It's not as objective as your questions but I'm sure you can communicate the topic better than I. It seems like a point so important should be worth additional consideration.
Is there a printable or downloadable PDF version of this post?
There isn't a PDF version but I've put a plan version here that is more printable:
> 16. Do you use configuration management tools like cfengine/puppet/chef?
...and ten years later Lunix sysadmins discovered that Group Policy is good. All that remains is to write MS AD policies processing modules for Linux.
'If you hae 10,000 disks expect' s/hae/have/
I like the idea of the OpsDoc template.
Mediawiki makes this a bit difficult, though: 'templates' in Mediawiki don't really seem extensible enough to be used for this kind of purpose.
Do you know of a method for templating pages in mediawiki?
Great, great text, thank you so much. It summarizes the basic points of your books ver
y well. (The books are great too, by the way!)
There are a few typos in this text, and TheLimoncelliTest.pdf differs from the-test.html (see, for example, the erroneous headline ("a [...] teams") in the PDF).
These errors are in both documents:
- s/that stopping/stopping/c (or s/that stopping/that is stopping/c)
- "has Having" -> maybe something like: "has unnecessary contents when it doesn't even need to exist. Having"
- s/what when wrong/what went wrong/c
- (maybe:) s/compile a/compile the/c
- s/role it/roll it/c
- as Chris pointed out above: s/hae/have/c
I additionally suggest these for clarity:
- s/stability then/stability, then/c
- s/optional others/optional, others/c
These are only wrong in the PDF:
- s/When What/What/c
Search blog entries:
Syndicate this site (XML)
This weblog is licensed under a Creative Commons License.