"Dear Tom, I'm a junior sysadmin and want to be more knowledgeable about the operating systems I administer. I get the feeling that a lot of my co-workers run on myth, superstitions, and folklore when it comes to their job and I want to be better. Sincerely, The Truth Is In There"
Dear Truth,
I applaud your quest to avoid superstition in your role as system administrator. Every time I fix a problem by rebooting (rather than knowing the real cause and fixing it) I feel a little bit of me dies inside. It hurts our industry and our profession when we develop bad habits like guessing instead of knowing.
There are three topic areas that are complicated, misunderstood, and therefore prone to folklore: memory subsystem, the file subsystem, and processes. If I had to add a third it would be the security subsystem, but often understanding the first three is a prerequisite to fully understanding security.
Memory is complicated. Virtual memory, swapping, and so on make this a complicated topic. To tune a system without understanding how these really work (vs. what we were taught in school) is the difference between success and failure. Understanding how modern memory systems work can result in a 9x performance improvement.
Knowing how the filesystem works is as important to a sysadmin as knowing anatomy is to a doctor. Knowing the filesystem begins with understanding how data is laid out on the disk (blocks and tracks), how files and directories are organized (what's stored in the directory structure, for example), and how the file system is buffered and how it interacts with the memory system. Ever since the OS concept of "unified memory and file systems", good performance comes from a tight integration of the memory and file system. Also, the file system dictates the namespace of the OS, which affects every thing else. Do you know what kind of access is slow in your operating system's namespace? You should.
A deep knowledge of how processes work is important syadmins are often required to debug problems that happen at the "edge cases" of processes: Some weird scheduling mishap because there isn't enough memory for all processes and the "wrong" process gets swapped out, developers come to you unsure why their new software release creates zombie processes, and so on.
Here are my suggestions on the best books in this category:
- Windows: "Windows Internals", Russinovich, Solomon and Ionescu
- Linux: "Linux Kernel Internals", Beck, Bohme, Dziadzka, Kunitz et al
- FreeBSD: "The Design and Implementation of the FreeBSD Operating System", by McKusick and Neville-Neil
- The TCP/IP Protocol: "TCP/IP Illustrated", W. Richard Stevens
While you may not be a FreeBSD user, that book is excellent to read no matter what operating system you use. It it used as a textbook in many schools because it teaches the fundamental underpinnings of operating system design. If you use an POSIX system, consider reading it.
"TCP IP Illustrated" because, while not an operating system, is my favorite book for learning how TCP/IP works: from ARP and ping, to telnet, to all those funny TCP sliding window issues. This book (and the 2 sequels) is an amazing tour of how the protocols you use every day work.
Hope that helps,
Tom
For linux administrators, I can't recommend reading LWN's Weekly Edition enough. It's published every Thursday, and each edition becomes available for free a week after it's published. It's like getting a lecture from a group of experienced sysadmins and kernel developers every week.