Awesome Conferences

Recently in Career Advice Category

U Penn's Wharton School of Business has one of the best Operations Management classes in the world, and for the last few years it has been available as a MOOC. They've announced a new session starting on September 29 2014.

I've been watching some of the video lectures and Christian Terwiesch is an excellent lecturer.

What is learned in the class applies to system administration, running a restaurant, or a hospital. I think that system administrators, DevOps engineers, and managers can learn a lot from this class.

Sign up for An Introduction to Operations Management or just watch the videos. Either way, you'll learn a lot!

Highly recommended.

Posted by Tom Limoncelli in Career Advice

I was reminded of this excellent blog post by Leon Fayer of OmniTI.

As software developers, we often think our job is to develop software, but, really, that is just the means to an end, and the end is to empower business to reach their goals. Your code may be elegant, but if it doesn't meet the objectives (be they time or business) it doesn't f*ing work.

Likewise I've seen sysadmin projects that spent so much time in the planning stage that they never were going to ship unless someone stood up and said, "we've planned enough. I'm going to start coding whether you like it or not". Yes, that means that some aspect of the design wasn't perfect. Yet, the suggestion that more planning would lead to the elimination of all design imperfections is simply hubris. (If not hubris, it is a sign that one's OCD or OCD-like tendencies is being used as a cowardly excuse to not get started.)

But what I really want to write about today is...

"The big project that won't ship".

There once was a team that had a large software base. One part of it was obsolete and needed to be rewritten. It was written in an unsupported language. It didn't have half the features it needed. It didn't even have a GUI.

There were two proposals:

One was to refactor and recode bits of it until the system was replaced. Along the way every few weeks the results would see the light of day. There were many milestones: add a read-only "viewer" GUI, build a better data storage system, refactor the old code to use the new GUI, enhance the GUI to include full editing, etc.

The competing proposal was to assign 4 developers to build a replacement system. They'd be given 2 years to write the new system from scratch. During that time they'd be protected and, essentially, hidden. The justification for this was that the old system was so broken that doing any kind frankenstein half-old half-new system would be flatly impossible or would be a drag to efficiency. It would be more efficient to code it "pure" and not constantly be dealing with the old system.

Management approved the competing proposal. 1.5 years later the project hadn't gotten anywhere. When people were needed for other projects, management looked around and decided to steal the 4 engineers. This is because it is good management to take resources away from low priority projects and put it on high priority projects. Any project, no matter how noble, with no results for 18 months, is lower priority than a project with a burning need. In fact, the definition of a low priority project is that you can wait 2 years for the results.

The project was cancelled and 1.5 years of work was thrown away. 4 engineers times 18 months... at least a million dollars down the tube.

Meanwhile the person that proposed the incremental project had gone forward in parallel with the first milestone: a simple enhancement to the existing system that solved the biggest complaint of the system. It talked to the old datastore and would have to be re-engineered when the new datastore was finally available, but it worked and solved a very serious problem. It was a "half measure" but served its purpose.

The person that created the "half measure" had been scolded for wasting time on a parallel project. Yet, the "big" project was cancelled and this "half measure" is still in use today. At least he had the gravitas to not say "I told you so".

The biggest "cost" to a company is opportunity cost. That is, the loss of $$$ from not taking action. By shipping early and often you grab opportunity.

Imagine a factory that made widgets for 24 months, stored them in a warehouse, and then started shipping them all at once. That would be crazy. A factory sells what they make as soon as they are manufactured. Software companies used to write code for years and then ship it. That was crazy. Now you make a minimum viable product, ship that, and use the knowledge gained to make the next iteration.

My career advice is to only do projects that produce usable output every few weeks or months. Being on a project that will not show any results for a year or more is a good way to hide from management. Being invisible is a career killer. For software projects this means setting early milestones of some kind of minimal viable product. For purely operational projects be able to announce milestones or progress (number of machines converted, number of ms latency improvement, etc.)

At StackExchange there is a big project coming up related to how we provision new machines. While a "green field" approach would be nice, I'm looking into how we can refactor the current cruddy bits so that we can do this project incrementally. The biggest problem is that we have a crappy CMDB with no API. Everything seems to touch that one element and replacing it is going to be a pain. (I'd like to evaluate Flipkart/HostDb if anyone has opinions, let me know.) However I think we can restructure the project into 5 independent milestones. By "independent" I mean they can be done in any order with the other 4 requiring minimal refactoring as a result.

This will have a few benefits: We'll get the benefit of each milestone as it happens. Certain milestones can be done in parallel by different sub-teams. If the first few completed milestones make the process "good enough", we don't have to do the other milestones.

Posted by Tom Limoncelli in Career Advice

If you are an junior Linux/Unix sysadmin looking to advance your technical skills, here is a list of talks, workshops, and tutorials that you should attend at Usenix LISA 2013.

These are skill-building, technical presentations. I only made exceptions for a few "soft topics" talks only if they are for junior sysadmins looking to advance their careers.

[I apologize in advance for any typo or errors. Please report any problems in the comments. The conference website has the latest information. Other lists of presentations: DevOps, Programming, Unix/Linux administration technical skills, Cloud Computing, and Women at Usenix LISA.]

This year's Usenix LISA conference has two exciting events about Women and Computing:

Sunday, Nov 3, 2013:

Thursday, Nov 7, 2013:

  • 11:00 a.m.-12:30 p.m.
  • Panel: Women in Advanced Computing
  • Moderator: Rikki Endsley, USENIX Association; Panelists: Amy Rich, Mozilla Corporation; Deanna McNeil, Learning Tree International; Amy Forinash
  • Format: Panel


Participation by women at this year's conference is impressive. Here is a list of talks (I may be missing some, I'm going by first name which is an imperfect algorithm.)

Recently on a mailing list sysadmins were describing horrible management they've experienced. Here is my reply:

First, I want to say that my heart goes out to all of you describing terrible working conditions, bad management, and so on. I have huge amounts of sympathy for you all.

Health is more important than anything else. If your job is driving you crazy and giving you high BP, my prescription is, 'Try, try, then quit'. Try to change things, talk to management, work to create the workplace you desire. Try again, I'm sure you feel like you've tried a lot, but people aren't mind-readers... make sure you've had serious conversations with the right people. However step three is quit. Send resumes and get the hell out of there.

It is vitally important that we don't feel any guilt about leaving a bad job, especially if we've made a "good faith effort" to turn things around (as I'm sure you have). Just like when people being laid off are told, heartlessly, "Sorry, it was a business decision" there are times you have to tell a company, "Sorry, it was a personal decision". (I want to acknowledge that not everyone is in a position where they can just up and leave. Being able to do so is quite a privilege, but I think people that work in IT are more likely to be in this position than most fields.)

There are two reasons we shouldn't feel guilt about leaving these kind of "bad jobs". First, our health is more important than anything else. Second, it is important that we don't try to 'save' companies that are intrinsically bad at IT management. I say this not as a joke and I don't say it lightly. If you feel a company is incurably bad at IT, it makes the world a better place for that company to go out of business. IT is the lifeblood of companies. It is a requirement for nearly any facet of business to function in today's world. Companies that treat IT has an appendage are dinosaurs that need to be left to die.

IT is not a "speciality". It is a skill everyone should have. Any CEO, COO, or VP that doesn't understand IT and IT MANAGEMENT that ALSO thinks they don't need to understand it is fooling themselves. Expecting only the people in the IT department to have IT and IT management skills is insane. Expecting that IT and IT management astuteness only needs to be found in the IT department is insane. Companies don't have a 'math department' that people run to any time they need to add, subtract, multiply, and divide. They expect everyone to have basic math skill and only turn to mathematicians for advanced or specialized mathematics. Similarly a modern company must expect that every staff person understands the basics of IT and every manager, VP, and CxO executive should be expected to understand IT and IT management as it is a fundamental, essential, part of doing business.

IT and IT management is as essential to a business as accounting is. You don't expect your CEO and other managers to be experts at accounting, but you expect them to understand a lot more than just the basics. However if, during a job interview, you learned that the CEO didn't know that accountants existed, or thought financial statements "magically wrote themselves" you would run like hell as fast as possible, right? You would reject any job offers and hope, for the sake of the well-being of the economy, that such a company disappears as soon as possible.

Why wouldn't you do the same for a company that treats IT and IT management like that?

Andy Lester, author of "Land the Tech Job You Love", has an excellent blog post up called Bad Tech Job Interview Questions (and How To Answer Them).

It is a good read whether or not you are interviewing. It has good advice if you are on either side of the interview table.

Posted by Tom Limoncelli in Career Advice

This came up in discussion recently. Here's how I differentiate between a junior and senior sysadmin:

  1. A senior person understands the internal workings of the systems he/she administers and debugs issues from a place of science, not guessing or rote memorization.

  2. A senior person has enough experience to know a problem's solution because he or she has seen and fixed it before (but is smart enough to check that assumption since superficial symptoms can be deceiving).

  3. A senior person automates their way out of problems rather than "working harder". They automate themselves out of a job constantly so they can be re-assigned to more interesting projects.

But most importantly...

A senior person demonstrates technical leadership by creating the processes that other people can follow, thereby enabling delegation and multiplying their effectiveness. Maybe the senior person is the only one technical enough to work out the procedure for replacing a bad disk on a server, but they document it in a way that less experience people can do the task. Maybe the senior person is the only one technical enough to set up a massive monitoring system, but they document how to add new devices so that everyone can add to what is monitored. Therefore they multiply their effectiveness because they use their knowledge not to do work, but to make it possible that an army of people can do the work instead. Good documentation is the first step to automating a process, so by working out the process, they start the "guess work -> repeatable -> automated" life-cycle that repetitive tasks should follow.

The old way is to maintain your "status" by hoarding information. You are the only person that knows how to do things and that is your power base. The new way is you maintain your "status" by sharing information. Everyone looks up to you because it is your documentation that taught them how to do their job. As they learn by following documentation that you wrote they get better at what they do and soon they are senior too. However now you are the senior person that helped everyone get to where they are today. In terms of corporate status, there is nothing better than that!

Tom

Posted by Tom Limoncelli in Career Advice

The coursework would be very focused on understanding the internals of each layer of the stack. To make a comparison to the auto industry: Your training won't result in you being a mechanic that can follow the manufacturer's manual: you will be the person that can write the manual because that's how much you understand how the car works.

But the real change I'd like to see is how the labs are done.

  • When you enter the school they give you 12 virtual machines on their VMware cluster (or Ganeti cluster).

  • In phase one you go through a progression that ends with turning those 6 machines into 2 load balancers, 3 web servers, a replicated database, a monitoring host, etc. (this is done as a series of labs that start with setting up one web server, then building up to this final configuration).

  • At that point the CS department turns on a traffic generator and your system now gets a steady stream of traffic. There is a leader-board showing show has the best uptime.

  • Phase 2 you set up a dev and qa clone of what you did in Phase 1, but you do it with Puppet or cfengine. Eventually those tools have to manage your live system too, and you have to make that change while the system is in use.

  • Once you have a dev->qa->live system and your uptime stats become 20% of your grade.

  • Another element I'd like to have is that there is a certain point in which everyone has to run other people's system using only the operational documentation that the creator left behind.

  • There might be another point in which the best student's cluster is cloned to create a web hosting system that provides real service to the community. Students would run it cooperatively, maintaining everything from the software to the operational docs.

However, by the time you get your degree you'd not only know the technical side of system administration but you'd also have the practical experience that would make you extremely valuable in the market.

Update: To be clear, there should be gobs and gobs of theory. I would want the above to be the labs that match the theory. For example, theory on OS matched with Linux kernel as an example; theory of autonomic computing with cfengine/puppet as an example, and so on and so on.

Posted by Tom Limoncelli in Career Advice

Someone recently asked me what language a sysadmin should learn.

If you are a sysadmin for Windows the answer is pretty easy: PowerShell.

The answer is more complicated for Unix/Linux sysadmins because there are more choices. Rather than start a "language war", let me say this:

I think every Unix/Linux sysadmin should know shell (sh or bash) plus one of Perl, Ruby, Python. It doesn't matter which.

The above statement is more important to me than whether I think Perl, Python or Ruby is better, or has more job openings, or whatever criteria you use. Let me explain:

It is really important to learn bash because it is so fundamental to so many parts of your job. Whether it is debugging an /etc/init.d script or writing a little wrapper. Every Unix/linux sysadmin should know: how to do a for loop, while loop, if with [[ or [, and $1, $2, $3... $* and $@, case statements, understand how variable substitution works, and how to process simple command-line flags. With those basic things you can go very far. I'm surprised at how many people I meet with a lot of Unix/Linux years under their belt that can't do a loop in bash; when they learn how they kick themselves for not learning earlier.

The choice of perl/python/ruby is usually driven by what is already in use at your shop. Ruby and Python became popular more recently than Perl, so a lot of shops are Perl-focused. If you use Puppet, knowing Ruby will help you extend it. I work at Google which is big on Python so I learned after coming here; it was a shock to the system after being a Perl person since 1991 (someone recently told me Perl didn't exist in 1991... I introduced him to a little something called Wikipedia).

From a career-management point of view, I think it is important to be really really really good at one of them and know a little of the others; even if that means just reading the first few chapters of a book on the topic. Being really really really good at one of them means that you have a deep understanding of how to use it and how it works "under the hood" so you can make better decisions when you design larger programs. The reason I suggest this as a career-management issue is that if you want to be hired by a shop that uses a different language, being "the expert that is willing to learn something else" is much more important than being the person that "doesn't know anything but has great potential" or "knows a little of this and that but never had the patience to learn one thing well".

Tom

P.S. Other thoughts on this topic: Joseph Kern has advice about the three languages every sysadmin should know and Phil Pennock has great advice and an interesting summary of the major scripting languages.

Posted by Tom Limoncelli in Career Advice

Structured Speaking

I've found that a structure that gives obvious "book-ends" around each topic make it easier for the audience to follow.

Most of my talks lately have been either 4-5 small case studies or a Top 10 List. Each case study is a repetition of "who are the players, what happened, what did we learn". The repetition gives the audience a clear understanding of "we're moving to the next topic now" because they see the pattern. In a Top 10 list there is the obvious "book end" of announcing the next number.

I started doing this after seeing too many presentations where the presenter runs topic to topic smeared together with very little separation. Sometimes I get confused because I'm still on the last topic and they've moved on without letting the audience know.

Announcing the number of case studies ahead of time is also useful. You want the audience to be focused on not what you are saying, not subconsciously trying to reverse-engineer the structure you are using.

This is true for writing a paper as well as giving a talk.

Posted by Tom Limoncelli in Career Advice

Fear of Rebooting

I have two fears when I reboot a server.[1]

Every time I reboot a machine I fear it won't come back up.

The first cause of this fear is that some change made since the last reboot will prevent it from being able to reboot. If that last reboot was 4 months ago it could have been any change made in the last 4 months. You spend all day debugging the problem. Is it some startup script that has a typo? Is it an incompatible DLL? Sigh. This sucks.

The second cause of this fear is when I've made a change to a machine (say, added new application service) and then rebooted it to make sure the service starts after reboot. Even if this reboot isn't required by the install instructions I do it anyway because I want to make sure the service will restart properly on boot. I want to discover this now rather than after an unexpected crash or a 4am power outage. If there is going to be a problem I want it to happen in a controlled way, on my schedule.

The problem, of course, is that if you've made a change to a machine and the reboot fails you can't tell if it is the first or second category. Early in my career I've had bad experiences trying to debug why a machine won't boot up only to find that it wasn't caused by my recent change but by some change made months earlier. It makes the detective work a lot more difficult.

Here are some thoughts on how I've eliminated or reduced these fears.

 1. Reboot, change, reboot.

If I need to make a big change to a server, first I reboot before making any changes. This is just to make sure the machine can reboot on its own. This eliminates the first category of fears. If the machine can't boot after the change then I know the reason is my most recent change only.

Of course, if I do this consistently then there is no reason to do that first reboot, right? Well, I'm the kind of person that looks both ways even when crossing a one-way street. You never know if someone else made a change since the last reboot and didn't follow this procedure. More likely there may have been a hardware issue or other "externality".

Therefore: reboot, change, reboot.

Oh, and if that last reboot discovers a problem (the service didn't start on boot, for example) and requires more changes to make things right, then you have to do another reboot. In other words, a reboot to test the last change.

 2. Reduce the number of services on a machine.

The reboot problem is bigger on machines that serve multiple purposes. For example that one machine that is the company DNS server, DHCP server, file server, plus someone put the Payroll app on it, and so on and so on. Now it has 15 purposes. I have less fear of rebooting a single-purpose server because there is less chance that a change to one service will break another service.

The problem is that machines are expensive so having one machine for each service is very costly. It also leaves machines idle most of the time; most applications are going to be idle a lot of the time.

The solution here is to use many virtual machines (VMs) on a single physical box. While there is more overhead than, say, running the same services natively on the same box, the manageability is better. By isolating each application it gives you better confidence when patching both the application and the underlying OS.

(As to which VM solution you should use, I'm biased since I work on The Ganeti Project which is a open source virtualization management system that doesn't require big expensive SANs or other hardware. And since I'm plugging that I'll also plug the fact that I'll be giving talks about Ganeti at the upcoming Maryland "Crabby Sysadmins" meeting, Cascadia IT conference in Seattle, and PICC conference in New Jersey)

 3. Better testing.

Upgrades should never be a surprise. If you have to roll out (for example) a new DNS server patch you should have a DNS server in a test lab where you test the upgrade first. If successful you roll out the change to each DNS server one at a time, testing as you go.

Readers from smaller sites are probably laughing right now. A test lab? Who has that? I can't get my boss to pay for the servers I need, let alone a test lab. Well, that is, sadly, one of the ways that system administration just plain makes more sense when it is done at scale. At scale it isn't just worth having a test lab, the risk of a failure that affects hundreds or thousands (or millions) of users is too great to not have one.

The best practice is to have a repeatable way to build a machine that provides a certain service. That way you can repeatably build the server with the old version of software, practice the upgrade procedure, and repeat if required. With VMs you might clone a running server and practice doing the upgrades on the clone.

 4. Better automation.

Of course, as with everything in system administration, automation is our ultimate goal. If your process for building a server for a particular service is 100% automated, then you can build a test machine reliably 100% of the time. You can practice the upgrade process many times until you absolutely know it will work. The upgrade process should be automated so that, once perfected, the exact procedure will be done on the real machines.

This is called "configuration management" or CM. Some common CM systems include CfEngine, Chef, and Puppet. These systems let you rapidly automate upgrades, deployments, and more. Most importantly they generally work by you specifying the end-result (what you want) and it figures out how to get there (update this file, install this package, etc.)

In a well-administered system with good configuration management an upgrade of a service is a matter of specifying that the test machines (currently at version X) should be running version X+1. Wait for the automation to complete its work. Test the machines. Now specify that the production machines should be at version X+1 and let it do the work.

Again, small sites often think that configuration management is something only big sites do. The truth is that every site, big and small, can use these configuration management tools. The truth is that every site, big and small, has an endless number of excuses to keep doing things manually. That's why we see the biggest adopters of these techniques are web service farms because they are usually starting from a "green field" and don't have legacy baggage to contend with.

Which brings me to my final point. I'm sick of hearing people say "we can't use [CfEngine/Chef/Puppet] because we have too many legacy systems. You don't have to manage ever byte on every machine at the beginning. In fact that wouldn't be prudent. You want to start small.

Even if you have the most legacy encrusted, old, systems a good start is to have your CM system keep /etc/motd updated on your Unix/Linux systems. That's it. It has business justification: there may be some standard message that should be there. Anyone claiming they are afraid you will interfere with the services on the machine can't possibly mean that modifying /etc/motd will harm their service. It reduces the problem to "we can't spare the RAM and CPU" that the CM system will require. That's a much more manageable problem.

Once you are past that, you can use the system to enforce security policies. Make sure /etc isn't world writable, disable telnetd, and so on. These are significant improvements in the legacy world.

Of course, now you have the infrastructure in place, all your new machines can be built without this legacy baggage. That new web farm can be build by coding up CM modules that create your 3 kinds of machines: static web servers, database servers, CGI/application servers. Using your CM system you build these machines. You now have all the repeatability and automation you need to scale (and as a bonus, /etc/motd contains the right message).

This is a "bottom up" approach: changing small things and working up to bigger things.

You can also go the other direction: Use CM to create your services "the right way", have your success be visible, and use that success to gain trust as you slowly, one step at a time, expand the CM system to include legacy machines.

Writing about my "fear of reboot" brought back a lot of old memories. They are, let me be clear, memories. I haven't felt the "fear of reboot" for years because I've been using the above techniques. None of this is rocket science. It isn't even trailblazing any more. It's 12+ year old, proven techniques. The biggest problem in our industry is convincing people to enter the 21st century and they don't want to be reminded that they're a decade late.

Tom

[1] Especially servers that have multiple purposes. By the way, for the purpose of this article I'll use the term "service" to mean an application or group of cooperating processes that provide an application to users, "machine" to mean any machine, and "server" to mean a machine that provides one or more services.

Posted by Tom Limoncelli in Career Advice

Cluster management operates on a very large scale: whereas a storage system that can hold a petabyte of data is considered large by most people, our storage systems will send us an emergency page when it has only a few petabytes of free space remaining.
This guy is not exaggerating.

This is why I love working at Google. (Did I mention that my coworker has a sign on his desk that says, "I work here because I love this shit!")

Tom

Posted by Tom Limoncelli in Career Advice

[This was originally published on The Usenix Update Blog]

We want YOU to submit a paper this year to the LISA conference  Really.  Yes, you!  Whether you are in academia developing new algorithms that improve system administration, leader of an open source project that sysadmins find valuable, or a practitioner in industry that has written new software to improve productivity, we believe there's a paper inside all of you that wants to get out!  (Usenix LISA is December 4-9, 2011 in Boston).  LISA is also a great venue for student papers: it is a friendly audience and we have a "Best Student Paper" award that pays cash.

Usenix LISA is doing three big things this year to make it easier to submit a paper:

1.  We provide mentoring.

Submitting a paper to a conference can be intimidating, a lot of work and stressful. To make the process easier, the members of the LISA Program Committee (PC) are available to provide mentoring.  You can bounce ideas off of us by email or phone, we'll proofread your drafts, and we'll try to answer any questions about the conference or submission process.  Just write, "assign me a mentor" in email to the conference co-chairs at [email protected].

Mentors can help turn your accepted abstract into a "print ready" final draft.  We'll also work with you over video chat to rehearse and strengthen your presentation for the conference.

2.  You don't have to submit a full paper.

It can be heartbreaking to write a complete paper only to learn it wasn't accepted for this year's conference.   Papers are 8 to 18 pages; that's a lot of writing. In recent years about 20 of the approximately 80 submitted papers were accepted.

While you may submit a complete paper, we also accept an "extended abstract" of 4-8 pages.  You only write the full paper by the publication deadline if your abstract is accepted.

In an extended abstract, you document the meat of your paper. You want to make sure you don't leave out important points such as what you have achieved along with how you achieved it.  Phrases like "the full paper will reveal the new algorithm" don't allow the PC to evaluate your efforts. Working with a mentor can help you through this process to ensure you submit the best abstract possible.

3. You don't have to be a scientist.

"But I haven't invented anything!"   Refereed Papers describe work that advances the art or practice of system administration and are held to high research standards.  However, LISA has an additional category called "Practice and Experience Reports" (PER) that describe a substantial system administration project whose story reveals lessons worth sharing.  In other words, you did something awesome and want to tell the world about it so they can learn from your mistakes (Did I say mistakes?  I meant "learn from your awesomeness".)  Actually failures are often worth documenting as we learn the most!

A PER is judged on the basis of whether it addresses a pressing or rising need in the industry and the usefulness of the lessons learned. If accepted, a final draft of the full report (4-10 pages) is due by the publication deadline, just like refereed papers.

The first paper I presented at a LISA conference would have been a PER, if the category had existed then.  That was 1997!  My paper wasn't rocket science (or even computer science), but we were able to explain some valuable insights into what to do (but mostly what not to do).

We're also looking for proposals for general "Talks", special Q&A talks called "The Guru Is In", and "Poster Session".

Conclusion

Every PC member is currently reaching out to friends, calling universities, and visiting user groups to encourage people to submit papers. We'd love for you to announce the Call For Participation at your local user group meetings (and we'll give you a little gift if you do). Let us know if you're interested in getting more involved by participating on a future PC.

LISA11 is making an extra big effort to seek out new papers and new authors.  We're doing outreach, we're making the submission process easier, and we're providing mentoring. So, if you have never submitted an abstract to LISA, maybe this is your year.  Contact us if you are on the fence.  Maybe we can answer your questions and concerns to put you on the path to successful author.

The submission deadline is June 9, 2011.  That may seem far in the future but it creeps up on us very fast.  Start brainstorming your paper now and we look forward to receiving your submission soon!

Tom Limoncelli
LISA11 Co-Chair

Key dates:

  • Submission deadline: June 9, 2011, 11:59 p.m. PDT: Extended abstracts, papers, experience reports, and proposals for invited talks, workshops, and tutorials.
  • Notification to all submitters: July 11, 2011
  • Publication deadline: September 15, 2011: Final papers and reports due
  • Poster proposals due: November 11, 2011

"Dear Tom, I'm a junior sysadmin and want to be more knowledgeable about the operating systems I administer. I get the feeling that a lot of my co-workers run on myth, superstitions, and folklore when it comes to their job and I want to be better. Sincerely, The Truth Is In There"

Dear Truth,

I applaud your quest to avoid superstition in your role as system administrator. Every time I fix a problem by rebooting (rather than knowing the real cause and fixing it) I feel a little bit of me dies inside. It hurts our industry and our profession when we develop bad habits like guessing instead of knowing.

There are three topic areas that are complicated, misunderstood, and therefore prone to folklore: memory subsystem, the file subsystem, and processes. If I had to add a third it would be the security subsystem, but often understanding the first three is a prerequisite to fully understanding security.

Memory is complicated. Virtual memory, swapping, and so on make this a complicated topic. To tune a system without understanding how these really work (vs. what we were taught in school) is the difference between success and failure. Understanding how modern memory systems work can result in a 9x performance improvement.

Knowing how the filesystem works is as important to a sysadmin as knowing anatomy is to a doctor. Knowing the filesystem begins with understanding how data is laid out on the disk (blocks and tracks), how files and directories are organized (what's stored in the directory structure, for example), and how the file system is buffered and how it interacts with the memory system. Ever since the OS concept of "unified memory and file systems", good performance comes from a tight integration of the memory and file system. Also, the file system dictates the namespace of the OS, which affects every thing else. Do you know what kind of access is slow in your operating system's namespace? You should.

A deep knowledge of how processes work is important syadmins are often required to debug problems that happen at the "edge cases" of processes: Some weird scheduling mishap because there isn't enough memory for all processes and the "wrong" process gets swapped out, developers come to you unsure why their new software release creates zombie processes, and so on.

Here are my suggestions on the best books in this category:

While you may not be a FreeBSD user, that book is excellent to read no matter what operating system you use. It it used as a textbook in many schools because it teaches the fundamental underpinnings of operating system design. If you use an POSIX system, consider reading it.

"TCP IP Illustrated" because, while not an operating system, is my favorite book for learning how TCP/IP works: from ARP and ping, to telnet, to all those funny TCP sliding window issues. This book (and the 2 sequels) is an amazing tour of how the protocols you use every day work.

Hope that helps,
Tom

Posted by Tom Limoncelli in Career Advice

Sysadmin Mantras

You may have already seen this, but it's good to read it again anyway.

http://dormando.livejournal.com/484577.html

This is a good post about Ops Mantras, or doing sysadmin the right way. It is still valid.

(full text after the bump)

Posted by Tom Limoncelli in Career Advice

Twice in the last few months I've been asked, essentially, how do I keep system administration interesting? (One prefaced the question with, "Tom, you've been a sysadmin for 2 decades, right?" Thank you for making me feel old.)

There are a few things I do to keep it interesting:

  • Do things you enjoy. Each year on January 1st I spend a good couple hours doing nothing but think about my career, where I want it to go, and what I enjoy. What I enjoy changes over time. It used to be unpacking and playing (um... I mean... finding the business value, sir) with new hardware. For a while it was learning new languages. Now it is coding and automating processes.
  • Don't do things you don't enjoy. Often to be a more "senior" sysadmin you have to go into management. If you've started to hate your job, and you've recently taken on more management responsibilities, get out. There are plenty of companies that have a career ladder that lets you stay technical. The companies that don't have that kind of system really need to be put out of business. We can help that happen by making sure the best sysadmins leave those companies.
  • Diversity. Do a lot of different things. Not all at once, but over years. Since I've been a sysadmin I've gone through phases: my VMS years, my Solaris years, my network jockey years, my PSO years, my small-organization years, my big deployment years, my Python years, etc.

But most of all, I keep things interesting by always working to make myself obsolete. I automate and document my job so that other people can do it, and then I let them do it, and then my manager moves me to other things. I strongly believe in "automate/document the parts of your job that you hate". It helps eliminate them quickly. In fact, that may be why I discovered I dislike management... I can't automate it.

It is also important to work at a company you believe in. Even my least liked job (a small company in the telecom billing industry) was a "cause" that I believed in: we were making the world a better place by showing that we could replace some god-awful bad IT with modern stuff and help change an industry. If you are working at a company "to pay the bills" not because you believe in what they do, get out fast.

Now that I've come to the end of this blog post it dawned on me that I've forgotten the #1 thing that "keeps it interesting" for me: interacting with sysadmins outside my current company! That's why I go to so many user groups and conferences, especially conferences like Usenix LISA and LOPSA-NJ PICC. That's why I started a user group when there wasn't one near me. The best way to "keep things fresh" is to be in places where new ideas are shared.

So that's how I kept things interesting.

P.S. The next Usenix LISA is Nov 7-12, 2010 in San Jose, CA.

Posted by Tom Limoncelli in Career Advice