February 2013 Archives

https://plus.google.com/u/0/101281951565093176572/posts/XALRuBSdgqP

From the organizers:

An impressive number of registrations over the past few days has prompted us to extend early bird pricing through Monday, March 4th. Save as much as $75 over at-the-door pricing by registering before 11:59pm Monday evening!

If you're visiting Seattle from out of town, don't forget to make your hotel reservations by phone and be sure to mention the conference to receive a discounted room rate and parking:

We also hope you'll join us Thursday, March 14th as the Seattle Area System Administrators Guild (SASAG) hosts a welcome reception sponsored by Silicon Mechanics in the Governors Room at the Hotel Deca. There will be light refreshments and lots of people to connect with. The Governor's Room is conveniently located next to the District Lounge, purveyors of stronger libations. The reception is from 7pm to 9pm but the doors won't shut until much later.

Seriously, folks, if you are anywhere near the Pacific North West go to this conference!

Posted by Tom Limoncelli in Conferences

I often recommend the book The Design and Implementation of the FreeBSD Operating System by Kirk McKusick and George V. Neville-Neil as the best way to learn about Unix. It teaches all the parts of the Unix kernel (process tables, file systems, network stacks, etc) and the algorithms used. A sysadmin gains keen insights into what is going on, which helps them design new systems and debug running systems. It is an excellent textbook and teaches OS theory and concepts along with the narrative of how FreeBSD works.

However because it has "FreeBSD" in the title, people often ask if there is a Linux version. The truth is that 99% of the book overlaps with Linux's way of doing things.

Is there a real equivalent book for Linux? I asked recently on HN. The answer I got was Linux Kernel Development by Robert Love but I have not yet read it. From reading the Table of Contents it seems to be less textbook, more for the practitioner; less OS theory, more directly for developers. (This is not a criticism... that's exactly what the title would lead one to expect)

Has anyone that has read both give a comparison?

Thanks, Tom

Posted by Tom Limoncelli in Poll or Question

Just seen on Google+: the call for participation is open:

https://www.usenix.org/blog/lisa-13-call-participation-opens

Extended abstracts, papers, experience reports, and proposals for talks, workshops, and tutorials due: Thursday, April 30, 2013, 11:59 p.m. PDT

If you haven't attended LISA before, be sure to check out papers and videos from past LISA events. USENIX members help support open access to conference papers and videos of paper presentations.

New in 2013! LISA Labs: New this year is a "hack space" available for informal mini-presentations by seasoned professionals, participation in live experiments, tutoring, and mentoring. This will bring a hands-on component to the conference, where attendees can investigate new technologies, apply what they have learned, and interact with other attendees in a participatory technical setting. Send ideas to [email protected]

The LISA '13 co-chairs will provide early feedback on ideas for papers, talks, or conference activities. Beat the deadlines--email [email protected] now.

Posted by Tom Limoncelli

In an effort to help the less technical community understand what Wikimedia Foundation's systems administrators do, Sumana Harihareswara wrote some very interesting blog posts. They're interesting to technical people too.

It is particularly interesting how she expresses the value of what we do to the Wikimedia managers and donors. There's also some information in there about how Wikimedia Foundation Ops uses Puppet, Nagios, and Ganglia.

They're both worth reading. Enjoy!

Posted by Tom Limoncelli in Technical Management

One week left to get early bird pricing! http://www.casitconf.org/casitconf13

The conference is in Seattle, WA, March 15-16, 2013. Don't miss it!

Posted by Tom Limoncelli

IT systems have many parts. Each needs to be upgraded or patched. The old way to handle this is to align all the individual release schedules so that you can make a "big release" that gets tested as a unit, and released as unit. You can do this when things change at a sane rate.

Now more things are changing and the rate is much faster. We also have less control. Operating systems have frequent patches. There are urgent security patches that need to roll out "immediately". Applications have frequent updates, many even upgrade themselves. Our PCs have firmware updates for the BIOS, the keyboard, the IPMI controller, the mouse (yes, my damn mouse needed a flash update recently!). There is no way we can align all these release schedules, test as a unit, and release it as a whole.

The situation is the same or worse for web services. The whole point of a Service Oriented Architecture (SOA) is that each piece is loosely coupled and can be upgraded at its own schedule. If every service you depend on is upgrading out from under you, it isn't possible to align schedules.

The old best practice of aligned release schedules is becoming less and less relevant.

I'm not saying that this is good or bad. I'm saying this is the new reality that we live under. In the long term it is probably for the best.

My question for the readers of this blog are: What are the new tools and best practices you use that address this new paradigm?

Reverting in "git"

I'm slowly learning "git". The learning curve is hard at first and gets better as time goes on. (I'm also teaching myself Mercurial, so let's not start a 'which is better' war in the comments).

Reverting a file can be a little confusing in git because git uses a different model than, say, SubVersion. You are in a catch-22 because to learn the model you need to know the terminology. To learn the terminology you need to know the model. I think the best explanations I've read so far have been in the book Pro Git, written by Scott Chacon and published by Apress. Scott put the entire book up online, and for that he deserves a medal. You can also buy a dead-tree version.

How far back do you want to revert a file? To like it was the last time you did a commit? The last time you did a pull? Or revert it back to as it is on the server right now (which might be neither of those)

Revert to like it was when I did my last "git commit":

git checkout HEAD -- file1 file2 file3

Revert to like it was when I did my last "pull":

git checkout FETCH_HEAD -- file1 file2 file3

Revert to like it is on the server right now:

git fetch
git checkout FETCH_HEAD -- file1 file2 file3

How do these work?

The first thing you need to understand is that HEAD is an alias for the last time you did "git commit".

FETCH_HEAD is an alias for the last time you did a "git fetch". "git fetch" pulls the lastest release from the server, but hides it away. It does not merge it into your workspace. "git merge" merges the recently fetched files into your current workspace. "git pull" is simply a fetch followed by a merge. I didn't know about "git fetch" for a long time; I happily used "git pull" all the time.

You can set up aliases in your ~/.gitconfig file. They act exactly like real git commands. Here are the aliases I have:

[alias]
  br = branch
  st = status
  co = checkout
  revert-file = checkout HEAD --
  revert-file-server = checkout FETCH_HEAD --
  lg = log --graph --pretty=format:'%Cred%h%Creset -%C(yellow)%d%Creset %s %Cgreen(%cr) %C(bold blue)<%an>%Creset' --abbrev-commit --date=relative

This means I can do git br instead of "git branch", saving me a lot of typing. git revert-file file1 file2 file3 is just like the first example above. git revert-file-server is a terrible name, but it basically diffs between the last fetch and my current workspace. git lg outputs a very pretty log of recent changes (I stole that from someone who probably stole it from someone else. Don't ask me how it works).

To add these aliases on your system, find or add a [alias] stanza to your ~/.gitconfig file and add them there.

Posted by Tom Limoncelli in Technical Tips

Metcalfe's law

Metcalfe's law states that the value of a telecommunications network is proportional to the square of the number of connected users of the system (n^2).

Robert M. Metcalfe, the inventor of Ethernet, originally meant it to apply to devices on a network that could communicate with each other. It isn't sufficient to be on the same network if they speak incompatible protocols. It isn't sufficient to speak compatible protocols if they aren't connected.

A more plainspoken way to state Metcalfe's law is that every one new user added to a network makes the network more than one unit more useful.

A more simple way to understand this law is: "The first person to buy a fax machine was a fool." Imagine how useless it would be to be the only person in the world with a fax machine. You can't send anyone a fax because nobody else owns a fax machine. When two people owned fax machines the utility or usefulness became a lot more, assuming those two people needed to communicate. When 100 people owned fax machines it was more than 50 times more useful than when two people owned them because now 99 people all had 99 other people that could communicate: 9801 pairs. Maybe not all 9801 pairs would be used but the network had a lot more potential than when only a few people had fax machines.

When Metcalfe invented Ethernet very few computers were connected to each other. Communicating between computers usually meant writing data to media such as tape or a "disk pack" and physically moving the media to the other computer. This was often done even if the computers were right next to each other. Think about all the old movies you've seen where computers have tape drives spinning big loops of magnetic tapes. That's how data got between computers.

Whether Metcalfe's law is exaggerated or misapplied to other things, the general point is correct: Linear growth in the number of users creates superlinear growth in the network's usefulness.

I think every sysadmin should understand this law. I think we "get it" as far as the literal sense: We get that more connected computers is more useful. We gain huge satisfaction when we add a device to our network; especially if it is one that previously couldn't connect such as when WiFi is added to a home thermostat, television, or phone. We get that more compatibility within our network is more useful. We are frustrated when two software systems can not talk with each other; we get huge satisfaction when vendors provide standard interfaces, APIs, and file formats so that more things are compatible.

On the social level Metcalfe's law applies as well. If you belong to a local or national user group gaining more members isn't just a matter of pride. Every additional member adds to the potential utility of the group. Every active member adds utility superlinearly. If you are a member of such a group, getting your friends and co-workers to join (or getting current members to be active participants) benefits you and all other members more than you'd think.

Matt Simmons of the Standalone Sysadmin blog asked about labeling network cables in a datacenter on the LOPSA-Tech mailing list which brought up a number of issues.

He wrote:

So, my current situation is that I'm working in a datacenter with 21 racks arranged in three rows, 7 racks long. We have one centralized distribution switch and no patch panels, so everything is run to the switch which lives in the middle, roughly. It's ugly and non-ideal and I hate it a bunch, but it is what it is. And it looks a lot like this.

Anyway, so given this really suboptimal arrangement, I want to be able to more easily identify a particular patch cable because, as you can imagine, tracing a wire is no fun right now.

He wanted advice as to whether the network cables should be labeled with exactly what the other end is connected to, including hostname and port number, or use a unique ID on each cable so that as they move around they don't have to be relabeled.

We write about this in the Data Centers chapter of The Practice of System and Network Administration but I thought I'd write a bit more for this blog.

My reply is after the bump...


Posted by Tom Limoncelli in Technical Tips

Sheeri K. Cabral's talk from LCA2013 is now available online:

"The Finer Art of Being a Senior Sysadmin"

The video is 17 minutes long and makes a lot of references to a blog post I wrote last September.

It is a great talk and well worth watching!

Posted by Tom Limoncelli in Conferences

Realizing that I've recommended a lot of books lately. I thought I'd list them here for others to benefit. I'm not saying I've read them all and these are the best, but these are the ones I've read and found useful.

Management:

Layout and web design:

Self-help:

  • Improve your memory: Page-a-Minute Memory Book by Harry Lorayne
  • Go from depressed to feeling better: The Feeling Good Handbook by David D. Burns (Note: this book is rather big but don't worry... the first 2 chapters are the good part. Read them then pick-and-choose the rest of the chapterse.)

Posted by Tom Limoncelli in Personal Growth

You do not want to miss this conference! http://casitconf.org/

  • You will learn how to automate the configuration of all your systems when Nathen Harvey teaches you Chef or Garrett Honeycutt teaches you Puppet.
  • You'll stay one step ahead of the game by learning IPv6 from Owen DeLong, the man that teaches IPv6 so well you'll thank him 128 times.
  • The wizard of PowerShell himself, Steven Murawski will teach you how to automate anything in Windows.
  • You'll fix things once and they'll stay fixed after Stuart Kendrick teaches you how to do Root Cause Analysis.
  • You'll learn how to translate "geek" to "manager-speak" and other tips in Navigating the Business World by the internationally recognized experts Nicole Forsgren Velasquez and Carolyn Rowland.
  • Don Crawley will teach you so many secrets of Customer Service that you'll be able to say "no" to users and they'll thank you.
  • Last but not least, David N. Blank-Edelman (who happens to be this year's keynote speaker) will surprise and delight you (and make music play out your printer queue) in his tutorial "Over the Edge System Administration". He'll also help make it easier to try out new technologies in his tutorial "Build A SysAdmin Sandbox".

But most of all: Go to Cascadia because the attendee you meet while waiting on line at lunch has a suggestion on how to fix that thing your boss was complaining about that is so awesome you'll get a promotion. It's called "networking" and I don't mean TCP/IP.

Sign up today! Click on the big, friendly "Register Now" button on the home page. http://casitconf.org/

Posted by Tom Limoncelli in CommunityConferences

Users tend to be concerned with what a system does (features, functionality) and sysadmins tend to be concerned with the operational aspects of a system. I just noticed this great Wikipedia page that lists "Non-functional requirements" of a system.

Broadly, functional requirements define what a system is supposed to do whereas non-functional requirements define how a system is supposed to be. Functional requirements are usually in the form of "system shall do <requirement>", while non-functional requirements are "system shall be <requirement>".
I could see myself using this as a tool for jogging my memory when I'm trying to think of all the aspects of a system that I need to be concerned with either operationally or when writing requirements.

Check it out: http://en.wikipedia.org/wiki/List_of_system_quality_attributes

Posted by Tom Limoncelli in Professionalism

You do not want to miss this conference! http://casitconf.org/

  • You will learn how to automate the configuration of all your systems when Nathen Harvey teaches you Chef or Garrett Honeycutt teaches you Puppet.
  • You'll stay one step ahead of the game by learning IPv6 from Owen DeLong, the man that teaches IPv6 so well you'll thank him 128 times.
  • The wizard of PowerShell himself, Steven Murawski will teach you how to automate anything in Windows.
  • You'll fix things once and they'll stay fixed after Stuart Kendrick teaches you how to do Root Cause Analysis.
  • You'll learn how to translate "geek" to "manager-speak" and other tips in Navigating the Business World by the internationally recognized experts Nicole Forsgren Velasquez and Carolyn Rowland.
  • Don Crawley will teach you so many secrets of Customer Service that you'll be able to say "no" to users and they'll thank you.
  • Last but not least, David N. Blank-Edelman (who happens to be this year's keynote speaker) will surprise and delight you (and make music play out your printer queue) in his tutorial "Over the Edge System Administration". He'll also help make it easier to try out new technologies in his tutorial "Build A SysAdmin Sandbox".

But most of all: Go to Cascadia because the attendee you meet while waiting on line at lunch has a suggestion on how to fix that thing your boss was complaining about that is so awesome you'll get a promotion. It's called "networking" and I don't mean TCP/IP.

Sign up today! Click on the big, friendly "Register Now" button on the home page. http://casitconf.org/

Posted by Tom Limoncelli in CommunityConferences

If you use the Ganeti command line you probably have used gnt-instance list and gnt-node list. In fact, most of the gnt-* commands have a list subcommand. Here's some things you probably didn't know.

Part 1: Change what "list" outputs

Unhappy with how verbose gnt-instance list is? The -o option lets you pick which fields are output. Try this to just see the name:

gnt-instance list -o name

I used to use awk and tail and other Unix commands to extract just the name or just the status. Now I use -o name,status to get exactly the information I need.

I'm quite partial to this set of fields:

gnt-instance list --no-headers -o name,pnode,snodes,admin_state,oper_state,oper_ram

The --no-headers flag means just output the data, no column headings.

What if you like the default fields that are output but want to add others to them? Prepend a + to the option:

gnt-node list --no-headers -o +group,drained,offline,master_candidate 

This will print the default fields plus the node group, and the three main status flags nodes havee: is it drained (no instances can move onto it), offline (the node is essentially removed from the cluster), and whether or not the node can be a master.

How does one find the list of all the fields one can output? Use the list-fields subcommand. For each gnt-* command it lists the fields that are available with that list command. That is, gnt-instance list-fields shows a different set of names than gnt-node list-fields.

Putting all this together I've come up with three bash aliases that make my life easier. They print a lot of information but (usually) fit it all on an 80-character wide terminal:

alias i='gnt-instance list --no-headers -o name,pnode,snodes,admin_state,oper_state,oper_ram | sed -e '\''s/.MY.DOMAIN.NAME//g'\'''
alias n='gnt-node list --no-headers -o +group,drained,offline,master_candidate | sed -e '\''s/.MY.DOMAIN.NAME//g'\'''
alias j='gnt-job list | tail -n 90 | egrep --color=always '\''^|waiting|running'\'''

(Change MY.DOMAIN.NAME to the name of your domain.)

Part 2: Filter what's output

The -F option has got to be the least-known about feature of the Ganeti command line tools. It lets you restrict what nodes or instances are listed.

List the instances that are using more than 3 virtual CPUs:

gnt-instance list -F 'oper_vcpus > 3'

List the instances that have more than 6G of RAM (otherwise known as "6144 megabytes"):

 gnt-instance list -F 'be/memory > 6144'

The filtering language can handle complex expressions. It understands and, or , ==, <, > and all the operations you'd expect. The ganeti(7) man page explains it all.

Which nodes have zero primary instances? Which have none at all?

bc..
gnt-node list --filter 'pinst_cnt 0' gnt-node list -F 'pinst_cnt 0 and sinst_cnt == 0'

Strings must be quoted with double-quotes. Since the entire formula is in single-quotes this looks a bit odd but you'll get used to it quickly.

Which instances have node "fred" as their primary?

gnt-instance list --no-header -o name  -F  'pnode == "fred" '

(I included a space between " and ' to make it easier to read. It isn't needed otherwise.)

Which nodes are master candidates?

gnt-node list --no-headers -o name -F 'role == "C" '

Do you find typing gnt-cluster getmaster too quick and easy? Try this command to find out who the master is:

gnt-node list --no-headers -o name -F 'role == "M" '

Like most gnt-* commands it must be run on the master, so be sure to use gnt-cluster getmaster to find out who the master is and run the command there.

If you use the "node group" feature of Ganeti (and you probably don't) you can find out which nodes are in node group foo:

gnt-node list -o name -F 'group == "foo" '

and which instances have primaries that are in group foo:

 gnt-instance list --no-header -o name  -F  "pnode.group == "foo"'

It took me forever to realize that, since snodes is a list, one has to use in instead of ==. Here's a list of all the instances whose secondary is in node group "bar":

gnt-instance list --no-header -o name  -F '"bar" in snodes.group'

("snodes" is plural, "pnode" is singular")

To recap:

  1. The following commands have a list-fields subcommand and list accepts -o and -F options: gnt-node , gnt-instance , gnt-job , gnt-group , gnt-backup .
  2. -o controls which fields are output when using the list subcommand.
  3. -F specifies a filter that controls which items are listed.
  4. The field names used with -o and -F are different for each gnt-* command.
  5. Use the list-fields subcommand to find out what fields are available for a command.
  6. The filtering language is documented in ganeti(7). i.e. view with: man 7 ganeti
  7. The man pages for the individual gnt-* commands give longer explanations of what each field means.
  8. In bash , filters have to be in single quotes so that the shell doesn't interpret <, >, double-quotes, and other symbols as bash operators.

Enjoy!

Posted by Tom Limoncelli in GanetiTechnical Tips

Something happened at home today that reminded me of something I used to do when I worked at Bell Labs.

My rule was simple. If a machine in the computer room wasn't labeled, I was allowed to power it off. No warning. Click. No power.

If I logged into a machine as root and the prompt didn't include the hostname, the only command I was interested in typing was "halt".

Both of these rules came from the same source: If sloppy system administration was going to lead to errors and downtime, I wanted that downtime to happen during the day when we can fix it instead of late at night when we should be asleep.

(Of course, if a machine didn't have the hostname in its root prompt that also meant our configuration management system wasn't running on the machine which is a security violation. Therefore halting the machine, as far as I was concerned, was solving a security issue. But I digress...)

When I would explain this rule to people often they would ask, "Would you do that even if it was an important machine?"

"Ah ha!", I would exclaim, "My definition of 'important machine' includes that it is properly labeled! You see, if something is important we take good care of it. We protect it. One way we do that is to label it front, back, and in the root prompt."

Sometimes I would get a look of horror.

I never actually had to turn off a machine. I presume that if I did the owner would have come running soon after. I'd say something like, "Thank god your here! I was able to make a label for this machine and I need to ask you what its name is!" If they asked, "How did you know It was my machine?" I would have said, "Well, once it is labeled I'll be able to send email to the owner and ask!"

I did once threaten to power off a machine. It was a new machine and someone was standing there loading the operating system. "Hey! No fair! This machine was brand new! I was going to label it!"

"Amazingly enough", I pointed out, "machines can be labeled before you load the OS too."

[I wouldn't let him type until the machine was labeled.]

Labels are a very basic safety precaution. It prevents human error.

Labeling machines is obvious, or is it? I visit other people's data centers (or "computer closets") and find tons of unlabeled equipment. "It's ok, I know which machine is which." they say. That's just an accident waiting to happen! Without properly labeled machines it is just too easy to accidentally power off the wrong machine, disconnect the wrong cable, and so on.

Isn't this simple professionalism?

You don't label things for Today You. Today You is smart, knows what's going on, and got good night's rest. You label things for Tomorrow You plus Other People. Tomorrow You may be related to Today You but I assure you they are different people. Tomorrow You didn't get a good night's sleep. Tomorrow You was away for a few weeks and now can't believe how similar all those machines look. Tomorrow You left for a better job and is now someone else trying to figure out what the fark is going on.

Other People need labels on machines for all the obvious reasons so I won't bore you. However when I visit other people's computers rooms (and I do get invited on many tours) they often say that the lack of labels is OK because "they're a solo sysadmin." Let's debunk that right now. Nobody is totally solo. When you call into the office from 3,000 miles away and ask the secretary, office manager, janitor, or CEO to go into the computer room to powercycle a machine, you are no longer a solo sysadmin. Things must be labeled.

In my time management classes I talk about delegating work to other people. Usually someone laughs. "Delegate? What planet are you on? I can't delegate anything!" Of course you can't if you haven't labeled things.

I prefer to write about big networks, big data centers, and big sysadmin teams. To them, all of the above is obvious. It's a waste of time to write this, right? Sadly something happened to me in my non-work life that reminded me about my "no label, no power" rule.

To be honest I haven't touched actual server hardware in years. All my servers are in remote data centers, often in countries I've never been to, with highly skilled datacenter techs doing all the physical work. I've never seen or touched them.

However if I were to change jobs and found myself dealing with hardware and small computer rooms again the first thing I will do is put a big sign on the wall that says:

"This computer room is for important computers only. Important computers are labeled front and back. Unimportant computers will be powered off with no warning. -The Management"

Isn't that reasonable?

Posted by Tom Limoncelli in Professionalism

 
LISA14 I'm Teaching button