Awesome Conferences

Recently in Technical Tips Category

DNS as Code

StackOverflow has open sourced the DNS management system we've been using for years. I was the original author, and Craig Peterson has been writing most of the code lately. A blog post went live about the system here: Blog post: Introducing DnsControl - "DNS as Code" has Arrived

My favorite part of DNSControl? We can switch between DNS providers faster than cyberthreads can take them down.

Check it out if you manage DNS zones, use CloudFlare or Fast.ly CDNs, or just like to read Go (golang) code.

Tom

Posted by Tom Limoncelli in Technical Tips

Today I learned that you can't copy a Mac application's plist by just copying the file. However, you can export the plist and import it on a new machine:

Step 1: Exit the app.

To make sure the file is stable.

Step 2: Export the plist data:

$ defaults export info.colloquy ~/info.colloquy.backup

To know the name of the plist (info.colloquy in this example) look in ~/Library/Preferences. Use the filename but strip off the .plist suffix. If an app has multiple plists, (I assume you need to) do each of them individually.

Step 3: Copy the backup file to the new machine

I like to either copy it to Dropbox and wait for it to sync on the other machine, or scp it to my VPS and then scp it down to the new machine.

Step 4: Import the plist data:

$ defaults import info.colloquy ~/info.colloquy.backup

Step 5: Start the app and make sure it worked.

Because we're adults and we check our work.

Posted by Tom Limoncelli in Technical Tips

I have two accounts on GitHub: Personal and work. How do I access both from the same computer without getting them confused? I have two different ssh keys and use .ssh/config to do the right thing. Some bash aliases also help.

Why?

Why is it important to me that I keep these two accounts separate?

First, certain repos can only be accessed from one account or the other. Public repos can be accessed from either, but private repos are not so lucky.

Second, I want the commit logs for different projects to reflect whether I am "Tom the person" or "Tom the employee". My work-related repos shouldn't be littered with my personal email address. That would be ugly and, more importantly, it would mean that a coworker searching for my commits would have to do two searches: one for each of my names. My username is different at my work account ("tlimoncelli" vs. "tal")... how could I expect a coworker to know that?

Lastly, when I contribute to a third-party project it is a very intentional decision whether I am contributing as a StackOverflow employee or as myself. I want bug reports and other notifications sent to one email address or the other depending on the situation. This future-proofs things so that when I change jobs (no current plans, BTW), I'll stop receiving notifications from projects that I'm involved in for purely work-related reasons.

What doesn't work?

You can't upload the same SSH public key to GitHub two accounts. GitHub uses the key to determine which account you are accessing, so they must be unique. If you were able to upload the same key to both accounts, GitHub would have to guess your intention for each repo, and that guess would often be wrong. You can upload multiple keys to each account, but there may not be overlapping keys between the accounts.

I could simply decide to have two separate machines, a work machine and a personal machine, each with different SSH keys. However then I would have to switch machines when I want to change which project I'm contributing to. However, expecting me to carry around 2 laptops is silly. That isn't to say that some companies should adopt such a policy, especially ones that have higher security requirements, but that's not my situation.

What do I do?

I set up a fake hostname in .ssh/config to indicate which SSH key to use.

Here's an example:

git clone [email protected]:StackExchange/httpunit.git
git clone [email protected]:StackExchange/httpunit.git
              ^^^^^^^^^^^^^^^ What??? That's not a machine!

There is no such machine as home-github.com or work-github.com. However, if you look at my .ssh/config file you'll find a "Host" entry that sets the actual hostname and sets a bunch of parameters, including which SSH key to use:

# Personal GitHub account:
Host home-github.com
  HostName github.com
  User git
  IdentityFile ~/.ssh/id_ed25519-homegithub
  PreferredAuthentications publickey
  PasswordAuthentication no
  IdentitiesOnly yes

# Work GitHub account:
Host work-github.com
  HostName github.com
  User git
  IdentityFile ~/.ssh/id_ed25519-workgithub
  PreferredAuthentications publickey
  PasswordAuthentication no
  IdentitiesOnly yes

On any machine that I use git, I simply make sure that SSH keys specified by the two different IdentityFiles exist.

When someone gives me an SSH-based Git URL, I manually doctor the hostname by adding the "home-" or "work-" prefix. After that, everything just works.

If I forget to edit the hostname, I have a default set so that it uses my work key. Originally I set it up so that using the undoctored hostname fails. That way I'd get an error and I'm forced to remember to doctor the hostname. However I found that interfered with systems (usually install scripts) that didn't let me edit the hostname.

Commands like go get are unaffected by this system since they use anonymous HTTP, not Git.

Setting the user.email correctly

The other problem I had was properly configuration git user.name and user.email. My global ~/.gitconfig file sets user.name because I am always Tom Limoncelli. However, I leave user.email unset in that file to force me to set it per-repo.

I've set up Bash aliases to let me easily set the email address:

alias gitmeHome='git config user.email [email protected]`
alias gitmeWork='git config user.email [email protected]`

After I do "get clone", I need to remember to cd into that repo and run either gitmeHome or gitmeWork. If I forget, I get this nice reminder:

$ git commit filename
[master 5667341] yes
 Committer: Tom Limoncelli <[email protected]>
Your name and email address were configured automatically based
on your username and hostname. Please check that they are accurate.
You can suppress this message by setting them explicitly:

    git config --global user.name "Your Name"
    git config --global user.email [email protected]

After doing this, you may fix the identity used for this commit with:

    git commit --amend --reset-author

 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 dirname/filename

By having the bash aliases, setting user.email is without hassle.

One more thing.

This last bit of advice isn't required to make ssh keys work, but I find it useful.

I've found it greatly simplifies my life to have the exact same ~/.ssh/config and ~/.gitconfig file on all my machines.

Manually adding a new Host entry to ~/.ssh/config on the proper machines is a pain. It is easier to copy the entire master file (which I keep in git, of course) is easier. If a particular machine doesn't need that Host entry, who cares? Either I won't use it, or it will fail if I do use it. My Host entries are carefully constructed so that they work if the right keys are on the machine, and fail otherwise.

Having the same ~/.gitconfig on all my machines is a little more tricky. It contains the settings that are the same everywhere. Overrides are set in the repo-specific .git/config file. As you saw above, a setting like user.email is critical, so I have no default in the global file thus the per-repo setting is required.

Posted by Tom Limoncelli in Technical Tips

Someone recently asked me how the flags in "flag flips" are implemented in actual software. There are a couple different techniques that I've seen.

  • Command-line flags: Normal command line flags. To change a flag, you have to kill and restart the process. These flags are usually implemented in libraries that come with the language, though I have a preference for gflags.

  • A/B flags: When doing A/B testing, a cookie is set with a random number (often a value 0-999). The code then uses the number to partition the tests: A=0-499, B=500-999. To enroll 1 percent of all users, one might use A=0-4 and B=5-9 and everyone else gets the default behavior. Often these ranges are set in command-line flags so they are easy to change.

  • Dynamic flags are more difficult: Some sites use a PAXOS-based system so that all processes get the same flag change at the same time. You have one thread call a blocking "return when the value changed" call so you can stash the new value and start using it right away. Google uses Chubby internally, the GIFEE equivalent is called ZooKeeper, and CoreOS has open sourced etcd.

  • More complex systems Facebook has an interesting system, which we talk about in The Practice of Cloud System Administration; Chapter 11 that permits per-user settings such as don't show a certain feature to people that work for TechCrunch.

The person that wrote to me also asked how would I change many machines at the same time. For example, on January 1st all your web pages must switch from Imperial to Metric for all dimensions displayed to the user. How would this be managed?

Well, first you'd have to decide how to deal with computations that are already in flight? I would redefine the problem to be: "all new API calls after midnight will have the new default". The API dispatcher should check the clock and read all flags, which are then passed to the function. That way the flags are invariant. It also makes it easier to write unit tests since the flag setting mechanism is abstracted out of the main code. Also, I say "default" because before midnight you'll want to be able to call the API both ways for testing purposes. By having the dispatcher calculate the default based on the date/time, and letting that trickle down to the derived value, you permit the caller to get the old behavior if needed.

If readers want to recommend libraries that implement these in various languages, please feel to post links in the comments.

Posted by Tom Limoncelli in Technical Tips

Every sysadmin knows that you can protect a server though cryptographic or other means, but if someone has physical access "all bets are off". Right? With physical access they can do physical damage (smash it with a hammer) or pull out a hard disk and read the bits directly. Even security systems that are highly respected (I'm looking at you, Kerberos!) are an "all bets are off"-situation if someone gets the private key through physical access.

Sadly we forget this when it comes to smartphones. We'll plug our phones into any darn USB charger we find... especially when we are desperate! Those Pokemon's ain't gonna catch themselves!

Have we forgotten that our phone is a computer and the USB port gives better access than sitting at a server's console?

This article by Alexey Komarov was a very painful reminder of just how much access a USB port gives to an attacker. USB is a vector for malware and spying. Not just that, but USB is how we upgrade firmware on most phones. The commands to upload and activate new firmware are almost entirely unprotected. Giving USB access is providing unrestricted ability to install new software and firmware. That's crazy!

I was recently reminded of this when I plugged into a USB charging port on an airplane. My iPhone popped up a window asking if I wanted to trust the device. Wait... what?? Why is that power charging port not "charge-only"? Why is it trying to make a data connection to my phone? Oh, it turns out that I could play my iPhone's music though the airplane's audio system. (This, of course, is a stupid feature... the airplane headphones aren't nearly as good as what comes with my phone.)

The airline makes this feature available only in the business/first class cabin. I don't believe in conspiracy theories, but if I was a state that wanted to hijack the phones of important business people, politicians, and government officials, these are the USB ports that I'd be subverting.

So... what can you do?

  1. Use charge-only USB cables. These simply pass on the power wires but not the data wires. They are 100% effective against bad actors. The downside: when you do need data, you need to carry a different set of cables. Available in USB 2.0.

  2. Use a USB-condom. This is a device that plugs in between your normal cable and the computer and blocks the data lines. The downside is that you now have a second device to carry around. The upside is that your phone will charge faster! The PortaPow brand has an extra little circuit that tells the device to go into fast-charge mode! I love this feature! (Available for USB 2.0, 5-pack, or built-into a USB 3.0 cable. In the PortaPow product line, make sure it mentions "3rd Generation" otherwise it may be an older model that is specific to Apple or Android but not both.

  3. Use a USB cable with a "data switch". This cable is normally power-only, which is what you want 90% of the time. However there is a button ("Data Transfer Protection On/Off Switch") you can press that will enable data. An LED indicates the mode. This kind of cable is much safer and secure, plus more convenient for the users. It follows the security principle that if you make the defaults what you want users to do, they're more likely to follow your security policy. Available in Micro USB.

I recommend that all IT departments give out USB cables with "Data Transfer Protection On/Off Switch" as their default. Include one with every new laptop or mobile device that you hand out. For a tiny additional cost you get a lot of benefit.

The USB condoms are useful when you need to support a variety of USB connector types or cable lengths, since it requires that you use your own cable. I keep one in my travel bag. I also put a few in the datacenter so that when someone is tempted to charge their phone by plugging it into one of our servers, we can instead hand them a condom. No matter what type of connector is on their phone, they can use the condom because it connects to the USB B port on the server side.

Lastly, these devices make great gifts for the holidays. For the geek that has everything, they probably haven't thought of this!

Notes:

  • PortaPow
  • Monoprice
  • Thanks to Scott Hazen Mueller for alerting me to the Komarov article!

Posted by Tom Limoncelli in Technical Tips

One of the things my team at StackOverflow does is maintain the CI/CD system which builds all the software we use and produce. This includes the Stack Exchange Android App.

Automating the CI/CD workflow for Android apps is a PITA. The process is full of trips and traps. Here are some notes I made recently.

First, [this is the paragraph where I explain why CI/CD is important. But I'm not going to write it because you should know how important it is already. Plus, Google definitely knows already. That is why the need to write this blog post is so frustrating.]

And therefore, there are two important things that vendors should provide that make CI/CD easy for developers:

  • Rule 1: Builds should work from the command line on a multi-user system.
    1. Builds must work from a script, with no UI popping up. A CI system only has stdin/stdout/sterr.
    2. Multiuser systems protect files that shouldn't be modified from being modified.
    3. The build process should not rely on the internet. If it must download a file during the build, then we can't do builds if that resource disappears.
  • Rule 2: The build environment should be reproducible in an automated fashion.
    1. We must be able to create the build environment on a VM, tear it down, and built it back up again. We might do this to create dozens (or hundreds) of build machines, or we might delete the build VM between builds.
    2. This process should not require user interaction.
    3. It should be possible to automate this process, in a language such as Puppet or Chef. The steps should be idempotent.
    4. This process should not rely on any external systems.

Android builds can be done from the command line. Hw, but the process itself updates files in the build area. Creating the build environment simply can not be automated, without repackaging all of the files (something I'm not willing to do).

Here are my notes from creating a CI/CD system using TeamCity (a commercial product comparable to Jenkins) for the StackOverflow mobile developers:

Step 1. Install Java 8

The manual way:

CentOS has no pre-packaged Oracle Java 8 package. Instead, you must download it and install it manually.

Method 1: Download it from the Oracle web site. Pick the latest release, 8uXXX where XXX is a release number. (Be sure to pick "Linux x64" and not "Linux x86").

Method 2: Use the above web site to figure out the URL, then use this code to automate the downloading: (H/T to this SO post)

# cd /root
# wget --no-cookies --no-check-certificate --header \
    "Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com%2F; oraclelicense=accept-securebackup-cookie" \
    "http://download.oracle.com/otn-pub/java/jdk/8u102-b14/jdk-8u102-linux-x64.rpm"

Dear Oracle: I know you employ more lawyers than engineers, but FFS please just make it possible to download that package with a simple curl or wget. Oh, and the fact that the certificate is invalid means that if this did come to a lawsuit, people would just claim that a MITM attack forged their agreement to the licence.

Install the package:

# yum localinstall jdk-8u102-linux-x64.rpm

...and make a symlink so that our CI system can specify JAVA8_HOME=/usr/java and not have to update every individual configuration.

# ln -sf /usr/java/jdk1.8.0_102 /usr/java/jdk

We could add this package to our YUM repo, but the benfit would be negligible plus whether or not the license permits this is questionable.

EVALUATION: This step violates Rule 2 above because the download process is manual. It would be better if Oracle provided a YUM repo. In the future I'll probably put it in our local YUM repo. I'm sure Oracle won't mind.

Step 2. Party like it's 2010.

The Android tools are compiled for 32-bit Linux. I'm not sure why. I presume it is because they want to be friendly to the few developers out there that still do their development on 32-bit Linux systems.

However, I have a few other theories: (a) The Android team has developed a time machine that lets them travel back to 2010 because I happen to know for a fact that Google moved to 64-bit Linux internally around 2011; they created teams of people to find and eliminate any 32-bit Linux hosts. Therefore the only way the Android team could actually still be developing on 32-bit Linux is if they either hidden their machines from their employer, or they have a time machine. (b) There is no "b". I can't imagine any other reason, and I'm jealous of their time machine.

Therefore, we install some 32-bit libraries to gain backwards compatibility. We do this and pray that the other builds happening on this host won't get confused. Sigh. (This is one area where containers would be very useful.)

# yum install -y glibc.i686 zlib.i686 libstdc++.i686

EVALUATION: B-. Android should provide 64-bit binaries.

Step 3. Install the Android SDK

The SDK has a comand-line installer. The URL is obscured, making it difficult to automate this download. However you can find the current URL by reading this web page, then clicking on "Download Options", and then selecting Linux. The last time we did the the URL was: https://dl.google.com/android/android-sdk_r24.4.1-linux.tgz

You can install this in 1 line:

cd /usr/java && tar xzpvf /path/to/android-sdk_r24.4.1-linux.tgz

EVALUATION: Violates Rule 2 because it is not in a format that can easily be automated. It would be better to have this in a YUM repo. In the future I'll probably put this tarfile into an RPM with an install script that untars the file.

Step 4. Install/update the SDK modules.

Props to the Android SDK team for making an installer that works from the command line. Sadly it is difficult to figure out which modules should be installed. Once you know the modules you need, specifying them on the command line is "fun"... which is my polite way of saying "ugly."

First I asked the developers which modules they need installed. They gave me a list, which was wrong. It wasn't their fault. There's no history of what got installed. There's no command that shows what is installed. So there was a lot of guess-work and back-and-forth. However, we finally figured out which modules were needed.

The command to list all modules is:

/usr/java/android-sdk/tools/android list sdk -a

The modules we happened to need are:

  1- Android SDK Tools, revision 25.1.7
  3- Android SDK Platform-tools, revision 24.0.1
  4- Android SDK Build-tools, revision 24.0.1
  6- Android SDK Build-tools, revision 23.0.3
  7- Android SDK Build-tools, revision 23.0.2
  9- Android SDK Build-tools, revision 23 (Obsolete)
 19- Android SDK Build-tools, revision 19.1
 29- SDK Platform Android 7.0, API 24, revision 2
 30- SDK Platform Android 6.0, API 23, revision 3
 39- SDK Platform Android 4.0, API 14, revision 4
141- Android Support Repository, revision 36
142- Android Support Library, revision 23.2.1 (Obsolete)
149- Google Repository, revision 32

If that list looks like it includes a lot of redundant items, you are right. I don't know why we need 5 versions of the build tools (one which is marked "obsolete") and 3 version of the SDK. However I do know that if I remove any of those, our builds break.

You can install these with this command:

/usr/java/android-sdk/tools/android update sdk \
    --no-ui --all --filter 1,3,4,6,7,9,19,29,30,39,141,142,149

However there's a small problem with this. Those numbers might be different as new packages are added and removed from the repository.

Luckily there is a "name" for each module that (I hope) doesn't change. However the names aren't shown unless you specify the -e option:

# /usr/java/android-sdk/tools/android list sdk -a -e

The output looks like:

Packages available for installation or update: 154
----------
id: 1 or "tools"
     Type: Tool
     Desc: Android SDK Tools, revision 25.1.7
----------
id: 2 or "tools-preview"
     Type: Tool
     Desc: Android SDK Tools, revision 25.2.2 rc1
 ...
 ...

Therefore a command that will always install that set of modules would be:

/usr/java/android-sdk/tools/android update sdk --no-ui --all \
    --filter tools,platform-tools,build-tools-24.0.1,\
    build-tools-23.0.3,build-tools-23.0.2,build-tools-23.0.0,\
    build-tools-19.1.0,android-24,android-23,android-14,
    extra-android-m2repository,extra-android-support,\
    extra-google-m2repository

Feature request: The name assigned to each module should be listed in the regular listing (without the -e) or the normal listing should end with a note: "For details, add the -e flag."

EVALUATION: Great! (a) Thank you for the command-line tool. The docs could be a little bit better (I had to figure out the -e trick) but I got this to work. (b) Sadly, I can't automate this with Puppet/Chef because they have no way of knowing if a module is already installed, therefore I can't make an idempotent installer. Without that, the automation would blindly re-install the modules every time it runs, which is usually twice an hour. (c) I'd rather have these individual modules packaged as RPMs so I could just install the ones I need. (d) I'd appreciate a way to list which modules are installed. (e) update should not re-install modules that are already installed, unless a --force flag is given. What are we, barbarians?

Step 4: Install license agreements

The software won't run unless you've agreed to the license. According to Android's own website you do this by asking a developer to do it on their machine, then copy those files to the CI server. Yes. I laughed too.

EVALUATION: There's no way to automate this. In the future I will probably make a package out of these files so that we can install them on any CI machine. I'm taking suggestions on what I should call this package. I think android-sdk-lie-about-license-agreements.rpm might be a good name.

Step 5: Fix stupidity.

At this point we though we were done, but the app build process was still breaking. Sigh. I'll save you the long story, but basically we discovered that the build tools want to be able to write to /usr/java/android-sdk/extras

It isn't clear if they need to be able to create files in that directory or write within the subdirectories. Fuck it. I don't have time for this shit. I just did:

chmod 0775 /usr/java/android-sdk/extras
chown $BUILDUSER /usr/java/android-sdk
chown -R $BUILDUSER /usr/java/android-sdk/extras

("$BUILDUSER" is the username that does the compiles. In our case it is teamcity because we use TeamCity.)

Maybe I'll use my copious spare time some day to figure out if the -R is needed. I mean... what sysadmin doesn't have tons of spare time to do science experiments like that? We're all just sitting around with nothing to do, right? In the meanwhile, -R works so I'm not touching it.

EVALUATION: OMG. Please fix this, Android folks! Builds should not modify themselves! At least document what needs to be writable!

Step 6: All done!

At this point the CI system started working.

Some of the steps I automated via Puppet, the rest I documented in a wiki page. In the future when we build additional CI hosts Puppet will do the easy stuff and we'll manually do the rest.

I don't like having manual steps but at our scale that is sufficient. At least the process is repeatable now. If I had to build dozens of machines, I'd wrap all of this up into RPMs and deploy them. However then the next time Android produces a new release, I'd have to do a lot of work wrapping the new files in an RPM, testing them, and so on. That's enough effort that it should be in a CI system. If you find that you need a CI system to build the CI system, you know your tools weren't designed with automation in mind.

Hopefully this blog post will help others going through this process.

If I have missed steps, or if I've missed ways of simplifying this, please post in the comments!

P.S. Dear Android team: I love you folks. I think Android is awesome and I love that you name your releases after desserts (though I was disappointed that "L" wasn't Limoncello.... but that's just me being selfish.). I hope you take my snark in good humor. I am a sysadmin that wants to support his developers as best he can and fixing this problems with the Android SDK would really help. Then we can make the most awesome Android apps ever.... which is what we all want. Thanks!

Posted by Tom Limoncelli in DevOpsTechnical Tips

Wouldn't it be nice if you could write a program that could reach into an Apache config file (or an AptConf file, or an /etc/aliases file, Postfix master.cf, sshd/ssh config, sudoers, Xen conf, yum or other) make a change, and not ruin the comments and other formatting that exists?

That's what Augeas permits you to do. If a config file's format as been defined in the Augeas "lens" language, you can then use Augeas to parse the file, pull out the data you want, plus you can add, change or delete elements too. When Augeas saves the file it retains all comments and formatting. Any changes you made retain the formatting you'd expect.

Augeas can be driven from a command-line tool (augtool) or via the library. You can use the library from Ruby, Puppet, and other systems. There is a project to rewrite Puppet modules so that they use Augeas (augeasproviders.com/providers)

Version 1.5 of Augeas was released this week. The number of formats it understands (lenses) has increased (see the complete list here). You can also define your own lens, but that requires an understanding of parsing concepts (get our your old CS textbook, you'll need a refresher). That said, I've found friendly support via their mailing list when I've gotten stuck writing my own lens.

The project homepage is http://augeas.net/ and the new release was just announced on the mailing list (link).

Posted by Tom Limoncelli in Technical Tips

It makes me sad to see people type more than they have to. With these aliases, you reduce the 4 most common commands to 2 letter abbreviations:

git config --global alias.co checkout
git config --global alias.br branch
git config --global alias.ci commit
git config --global alias.st status

NOTE: This updates your ~/.gitconfig file and adds aliasses "co", "br", "ci", and "st".

If you collaborate with others, git pull makes a messy log. Instead, always type git pull --rebase --ff-only. This will make the merge history a lot more linear when possible, otherwise it falls back to the normal pull behavior. Of course, if you set this alias git p is all you need to remember:

git config --global alias.p "pull --rebase --ff-only"

These last aliases pretty-print the output of git log five different ways. They make the logs colorful, beautiful, and much more useful. To be honest, I haven't spent the time to review the git manual to figure out how they work. I don't care. I copied them from someone else, who copied them from someone else. They work great. Thanks to the anonymous person that gave them to me. These aliases will help you love git logs:

git config --global alias.lg "log --graph --pretty=format:'%Cred%h%Creset -%C(yellow)%d%Creset %s %Cgreen(%cr) %C(bold blue)<%an>%Creset' --abbrev-commit --date=relative"
git config --global alias.lgg "log --graph --pretty=format:'%Cred%h%Creset -%C(yellow)%d%Creset %s %Cgreen(%cr) %C(bold blue)<%an>%Creset' --abbrev-commit --date=relative --name-only"
git config --global alias.ll 'log --pretty=format:"%C(yellow)%h%Cred%d\\ %Creset%s%Cblue\\ [%cn]" --decorate --numstat'
git config --global alias.ld 'log --pretty=format:"%C(yellow)%h\\ %C(green)%ad%Cred%d\\ %Creset%s%Cblue\\ [%cn]" --decorate --date=short --graph'
git config --global alias.ls 'log --pretty=format:"%C(green)%h\\ %C(yellow)[%ad]%Cred%d\\ %Creset%s%Cblue\\ [%cn]" --decorate --date=relative'

To install all these aliases:

Option A: Run the above commands. A copy has been placed at this link so you can download the script and run it.

Option B: Copy them out of my .gitconfig which you can access via this link

Option C: If you trust me, and you shouldn't, you can run:

curl https://raw.githubusercontent.com/TomOnTime/tomutils/master/gitstuff/install-fav-git-aliases.sh.txt | sh -x

Posted by Tom Limoncelli in Technical Tips

A program I wrote that worked for quite some time started failing. It turns out someone tried to use it to process a file with text encoded as UTF16. The file came from a Windows system and, considering things like UoW, this situation is just going to start happening more frequently.

Golang has a great package for dealing with various UTF encodings. That said, it still took me a few hours to figure out how to make an equivalent of ioutil.ReadFile(). I wrapped up what I learned and made it into a module. Everything should just work like magic.

  • Instead of using os.Open(), use utfutil.OpenFile().
  • Instead of ioutil.ReadFile(), use utfutil.ReadFile().

The module is available on Github: https://github.com/TomOnTime/utfutil

Posted by Tom Limoncelli in Technical Tips

Someone recently commented that with Github it is "a pain if you want to have a work and personal identity."

It is? I've had separate work and personal Github accounts for years. I thought everyone knew this trick.

When I clone a URL like [email protected]:TomOnTime/tomutils.git I simply change github to either github-home or github-work. Then I have my ~/.ssh/config file set with those two names configured to use different keys:

# TomOnTime
Host home-github.com
  HostName github.com
  User git
  IdentityFile ~/.ssh/id_rsa-githubpersonal
  PreferredAuthentications publickey
  PasswordAuthentication no
  IdentitiesOnly yes

# tlimoncelli
Host work-github.com
  HostName github.com
  User git
  IdentityFile ~/.ssh/id_rsa-githubwork
  PreferredAuthentications publickey
  PasswordAuthentication no
  IdentitiesOnly yes

I also have things set up so that if I leave the name alone, my work-owned machines default to the work key, and my personal machines default to my personal key.

As far as the web user interface, rather than switching between accounts by logging out and logging back in all the time, I simply use Chrome's multi-user feature. Each user profile has its own cookie jar, maintains its own set of bookmarks, color themes, and so on. One user is my "work" profile. It is green (work==money==green), has bookmarks that are work-related, and is logged into my work Github account. The other is my "home" profile. It is blue (I live in a blue house), has my personal bookmarks, and is logged into my personal Github account.

Having each profile be a very different color makes it very easy to tell which profile I am in. This prevents me from accidentally using my work profile for personal use or vice-versa.

I know some people do something similar by using different browsers but I like this a lot more.

Once I set this up using multiple accounts on Github was easy!

Posted by Tom Limoncelli in Technical Tips

There are many ways to specify scheduled items. Cron has 10 8,20 * 8 1-5 and iCalendar has RRULE and Roaring Penguin brings us REMIND. There's a new cross-platform DSL called Schyntax, created by my Stack Overflow coworker Bret Copeland.

The goal of Schyntax is to be human readable, easy to pick up, and intuitive. For example, to specify every hour from 900 UTC until 1700 UTC, one writes hours(9..17)

What if you want to run every five minutes during the week, and every half hour on weekends? Group the sub-conditions in curly braces:

{ days(mon..fri) min(*%5) } { days(sat..sun) min(*%30) }

It is case-insensitive, whitespace-insensitive, and always UTC.

You can read more examples in his blog post: Schyntax Part 1: The Language. In Schyntax Part 2: The Task Runner (aka "Schtick") you'll see how to create a scheduled task runner using the Schyntax library.

You can easily use the library from JavaScript and C#. Bret is open-sourcing it in hopes that other implementations are created. Personally I'd love to see a Go implementation.

The Schynatx syntax and sample libraries are on Github.

Posted by Tom Limoncelli in Technical TipsTime Management

TOML vs. JSON

[This is still only draft quality but I think it is worth publishing at this point.]

Internally at Stack Exchange, Inc. we've been debating the value of certain file formats: YAML, JSON, INI and the new TOML format just to name a few.

[If you are unfamiliar with TOML, it is Tom's Obvious, Minimal Language. "Tom", in this case, is Tom Preston-Werner, founder and former CEO of GitHub. The file format is still not reached version 1.0 and is still changing. However I do like it a lot. Also, the name of the format IS MY FREAKIN' NAME which is totally awesome. --Sincerely, Tom L.]

No one format is perfect for all situations. However while debating the pros and cons of these formats something did dawn on me: one group is for humans and another is for machines. The reason there will never be a "winner" in this debate is that you can't have a single format that is both human-friendly and machine-friendly.

Maybe this is obvious to everyone else but I just realized:

  1. The group that is human-friendly is easy to add comments to, and tolerant of ambiguity, is often weakly typed (only differentiating between ints and strings).

  2. The group that is machine-friendly is difficult (or impossible) to add comments, is less forgiving about formatting, and use often strongly typed.

As an example of being unforgiving about formatting, JSON doesn't permit a comma on the last line of a list.

This is valid JSON:

{
   "a": "apple", 
   "alpha": "bits", 
   "j": "jax"
}

This is NOT valid JSON:

{
   "a": "apple", 
   "alpha": "bits", 
   "j": "jax",
}

Can you see the difference? Don't worry if you missed it because it just proves you are a human being. The difference is the "j" line has a comma at the end. This is forbidden in JSON. This catches me all the time because, well, I'm human.

It also distracts me because diffs are a lot longer as a result. If I add a new value, such as "p": "pebbles" the diff looks very different:

$ diff x.json  xd.json 
4c4,5
<    "j": "jax"
---
>    "j": "jax",
>    "p": "pebbles"

However if JSON did permit a trailing comma (which it doesn't), the diffs would look shorter and be more obvious.

$ diff y.json yd.json 
4a5
>    "p": "pebbles",

This is not just a personal preference. This has serious human-factors consequences in an operational environment. It is difficult to safely operate a large complex system and one of the ways we protect ourselves if by diff'ing versions of configuration files. We don't want to be visually distracted by little things like having to mentally de-dup the "j" line.

The other difference is around comments. One camp permits them and another camp doesn't. In operations often we need to be able to temporarily comment out a few lines, or include ad hoc messages. Operations people communicate by leaving breadcrumbs and todo items in files. Rather than commenting out some lines I could delete them and use version control to bring them back, but that is much more work. Also, often I write code in comments for the future. For example, as part of preparation for a recent upgrade, we added the future configuration lines to a file but commented them out. By including them, they could be proofread by coworkers. It was suggested that if we used JSON we would simply add a key to the data structure called "ignore" and update the code to ignore any hashes with that key. That's a lot of code to change to support that. Another suggestion was that we add a key called "comment" with a value that is the comment. This is what a lot of JSON users end up doing. However the comments we needed to add don't fit into that paradigm. For example we wanted to add comments like, "Ask so-and-so to document the history of why this is set to false" and "Keep this list sorted alphabetically". Neither of those comments could be integrated into the JSON structures that existed.

On the other hand, strictly formatted formats like JSON are, in theory, faster to parse. Supporting ambiguity slows things down and leads to other problems. In the case of JSON, it is just plain so widely supported there are many reasons to use it just for that reason.

Some formats have typed data, others assume all data are strings, others distinguish between integer and string but go no further. YAML, if you implement the entire standard, has a complex way of representing specific types and even supports repetition with pointers. All of that turns YAML's beautifully simple format into a nightmare unsuitable for human editing.

I'm not going to say "format XYZ is the best and should be used in all cases" however I'd like to summarize the attributes of each format:

* Format JSON YAML TOML INI
M Formal standard YES YES soon no
M Strongly typed YES YES string/int no
M Easy to implement
the entire standard
YES no YES YES
H Awesome name! no no YES no
H Permits comments no start of line only YES usually
H diffs neatly no YES (I think) YES YES
H Can be
programmatically
updated without losing
format or comments
yes-ish NO soon NO

The * column indicates if this quality is important for machines (M) or humans (H). NOTE: This chart is by no means complete.

Personally I'm trying to narrow the file formats in our system down to two: one used for machine-to-machine communication (that is still human readable), and the other that is human-generated (or at least human-updated) for machine consumption (like configuration files). (Technically there's a 3rd need: Binary format for machine-to-machine communication, such as ProtoBufs or CapnProto.)

I'm very optimistic about TOML and look forward to seeing it get to a 1.0 standard. Of course, the fact that I am "Tom L." sure makes me favor this format. I mean, how could I not like that, eh?

Update: 2015-07-01: Updated table (TOML is typed), and added row for "Awesome name".

Recently we were having the most difficult time planning what should have been a simple upgrade. There is a service we use to collect monitoring information (scollector, part of Bosun). We were making a big change to the code, and the configuration file format was also changing.

The new configuration file format was incompatible with the old format.

We were concerned with a potential Catch-22 situation. Which do we upgrade first, the binary or the configuration file? If we put the new RPM in our Yum repo, machines that upgrade to this package will not be able to read their configuration file and that's bad. If we convert everyone's configuration file first, any machine that restarts (or if the daemon is restarted) will find the new configuration file and that would also be bad.

The configuration files (old and new) are generated by the same configuration management system that deploys the new RPMs (we use Puppet at Stack Exchange, Inc.). So, in theory we could specify particular RPM package versions and make sure that everything happens in a coordinated manner. Then the only problem would be newly installed machines, which would be fine because we could pause that for an hour or two.

But then I realized we were making a lot more work for ourselves by ignoring the old Unix adage: If you change the file format, change the file name. The old file was called scollector.conf; the new file would be scollector.toml. (Yes, we're using TOML).

Now that the new configuration file would have a different name, we simply had Puppet generate both the old and new file. Later we could tell it to upgrade the RPM on machines as we slowly roll out and test the software. By doing a gradual upgrade, we verify functionality before rolling out to all hosts. Later we would configure Puppet to remove the old file.

This reminds me of the fstab situation in Solaris many years ago. Solaris 1.x had an /etc/fstab file just like Linux does today. However, Solaris 2.x radically changed the file format (mostly for the better). They could have kept the filename the same, but they followed the adage and for good reason. Many utilities and home-grown scripts manipulate the /etc/fstab file. They would all have to be rewritten. It is better for them to fail with a "file not found" error right away, then work away and modify the file incorrectly.

This technique, of course, is not required if a file format changes in an upward-compatible way. In that case, the file name can stay the same.

I don't know why I hadn't thought of that much earlier. I've done this many times before. However the fact that I didn't think of it made me think it would be worth blogging about it.

Posted by Tom Limoncelli in Technical Tips

Someone asked me in email for advice about how to move many machines to a new corporate standard. I haven't dealt with desktop/laptop PC administration ("fleet management") in a while, but I explained this experience and thought I'd share it on my blog:

I favor using "the carrot" over "the stick". The carrot is making the new environment better for the users so they want to adopt it, rather than using management fiat or threats to motivate people. Each has its place.

The more people feel involved in the project the more likely they are to go along with it. If you start by involving typical users by letting them try out the new configuration in a test lab or even loaning them a machine for a week, they'll feel like they are being listened to and will be your partner instead of a roadblock.

Once I was in a situation where we had to convert many PCs to a corporate standard.

First we made one single standard PC. We let people try it out and find problems. We resolved or found workarounds to any problems or concerns raised.

At that point we had a rule: all new PCs would be built using the standard config. No regressions. The number of standard PCs should only increase. If we did that and nothing more, eventually everything would be converted as PCs only last 3 years.

That said, preventing any back-sliding (people installing PCs with the old configuration by mistake, out of habit, or wanting an "exception") was a big effort. The IT staff had to be vigilant. "No regressions!" was our battlecry. Management had to have a backbone. People on the team had to police ourselves and our users.

We knew waiting for the conversion to happen over 3 years was much too slow. However before we could accelerate the process, we had to get those basics correct.

The next step was to convert the PCs of people that were willing and eager. The configuration was better, so some people were eager to convert. Updates happened automatically. They got a lot of useful software pre-installed. We were very public about how the helpdesk was able to support people with the new configuration better and faster than the old configuration.

Did some people resist? Yes. However there were enough willing and eager people to keep us busy. We let those "late adopters" have their way. Though, we'd mentally prepare them for the eventual upgrade by saying things like (with a cheerful voice), "Oh, we're a late adopter! No worries. We'll see you in a few months." By calling them "late adopter" instead of "resistor" or "hard cases" it mentally reframed the issue as them being "eventual" not "never".

Some of our "late adopters" volunteered to convert on their own. They got a new machine and didn't have a choice. Or, they saw that other people were happy with the new configuration and didn't want to be left behind. Nobody wants to be the only kid on the block without the new toy that all the cool kids have.

(Oh, did I mention the system for installing PCs the old way is broken and we can't fix it? Yeah, kind of like how parents tell little kids the "Frozen" disc isn't working and we'll have to try again tomorrow.)

Eventually those conversions were done and we had the time and energy to work on the long tail of "late adopters". Some of these people had verified technical issues such as software that didn't work on the new system. Each of these could be many hours or days helping the user make the software work or finding replacement products. In some cases, we'd extract the user's disk into a Virtual Machine (p2v) so that it could run in the old environment.

However eventually we had to get rid of the last few hold-outs. The support cost of the old machine was $x and if there are 100 remaining machines, $x/100 isn't a lot of money. When there are 50 remaining machines the cost is $x/50. Eventually the cost is $x/1 and that makes that last machine very very expensive. The faster we can get to zero, the better.

We announced that unconverted machines would be unsupported after date X, and would stop working (the file servers wouldn't talk to them) by date Y. We had to get management support on X and Y, and a commitment to not make any exceptions. We communicated the dates broadly at first, then eventually only the specific people affected (and their manager) received the warnings. Some people figured out that they could convince (trick?) their manager into buying them a new PC as part of all this... we didn't care as long as we got rid of the old configuration. (If I was doing this today, I'd use 802.11x to kick old machines off the network after date Z.)

One excuse we could not tolerate was "I'll just support it myself". The old configuration didn't automatically receive security patches and "self-supported machines" were security problems waiting to happen. The virtual machines were enough of a risk.

Speaking of which... the company had a loose policy about people taking home equipment that was discarded. A lot of kids got new (old) PCs. We were sure to wipe the disks and be clear that the helpdesk would not assist them with the machine once disposed. (In hindsight, we should have put a sticker on the machine saying that.)

Conversion projects like this pop up all the time. Sometimes it is due to a smaller company being bought by a larger company, a division that didn't use centralized IT services adopting them, or moving from an older OS to a newer OS.

If you are tasked with a similar conversion project you'll find you need to adjust the techniques you use depending on many factors. Doing this for 10 machines, 500 machines, or 10,000 machines all require adjusting the techniques for the situation.

If you manage server farms instead of desktop/laptop PC fleets similar techniques work.

Posted by Tom Limoncelli in Technical Tips

Short version: My mailing list server no longer generates bounce messages for unknown accounts, thus eliminating the email backscatter is generates.

Longer version:

I have a host set up exclusively for running mailing lists using Mailman and battling spam has been quite a burden. I finally 'gave up' and made all the lists "member's only". Luckily that is possible with the email lists being run there. If I had any open mailing lists, I wouldn't have been so lucky. The result of this change was that it eliminated all spam and I was able to disable SpamAssassin and other measures put in place. SpamAssassin has been using more and more CPU time and was letting more and more spam through.

That was a few years ago.

However then the problem became Spam Backscatter. Spammers were sending to nearly every possible username in hopes of getting through. Each of these attempts resulted in a bounce message being sent to the (forged) email address the attempt claimed to come from. It got to the point where 99% of the email traffic on the machine were these bounces. The host was occasionally being blocked as punishment for generating so many bounces. Zero of these bounces were "real"... i.e. the bounce was going to an address that didn't actually send the original message and didn't care about the contents of the bounce message.

These unwanted bounce messages are called "Spam Backscatter".

My outgoing mail queue was literally filled with these bounce messages, being re-tried for weeks until Postfix would give up. I changed Postfix to delete them after a shorter amount of time, but the queue was still getting huge.

This weekend I updated the system's configuration so that it just plain doesn't generate bounces to unknown addresses on the machine. While this is something you absolutely shouldn't do for a general purpose email server (people mistyping the addresses of your users would get very confused) doing this on a highly specialized machine makes sense.

I can now proudly say that for the last 48 hours the configuration has worked well. The machine is no longer a source of backscatter pollution on the internet. The mail queue is empty. It's a shame my other mail servers can't benefit from this technique.

Posted by Tom Limoncelli in System NewsTechnical Tips

I have some PDFs that have to be reviewed in Adobe Reader, because they include "comments" that OS X "Preview" can't display and edit.

This alias has saved me hours of frustration:

alias reader='open -a /Applications/Adobe\ Reader.app'

Now I can simply type "reader" instead of, say, "cat", and view the PDF:

reader Limoncelli_Ch13_jh.pdf

For those of you that are unfamiliar with Adobe Acrobat Reader, it is Adobe's product for distributing security holes to nearly every computer system in the world. It is available for nearly every platform, which makes it a very convenient way to assure that security problems can be distributed quickly and globally. Recently Adobe added the ability to read and display PDFs. Previously I used Oracle Java to make sure all my systems were vulnerable to security problems, but now that Reader can display PDFs, they're winning on the feature war. I look forward to Oracle's response since, as I've always said, when it comes to security, the free market is oh so helpful.

Posted by Tom Limoncelli in RantsTechnical Tips

How not to use Cron

A friend of mine told me of a situation where a cron job took longer to run than usual. As a result the next instance of the job started running and now they had two cronjobs running at once. The result was garbled data and an outage.

The problem is that they were using the wrong tool. Cron is good for simple tasks that run rarely. It isn't even good at that. It has no console, no dashboard, no dependency system, no API, no built-in way to have machines run at random times, and its a pain to monitor. All of these issues are solved by CI systems like Jenkins (free), TeamCity (commercial), or any of a zillion other similar systems. Not that cron is all bad... just pick the right tool for the job.

Some warning signs that a cron job will overrun itself: If it has any dependencies on other machines, chances are one of them will be down or slow and the job will take an unexpectedly long time to run. If it processes a large amount of data, and that data is growing, eventually it will grow enough that the job will take longer to run than you had anticipated. If you find yourself editing longer and longer crontab lines, that alone could be a warning sign.

I tend to only use cron for jobs that have little or no dependencies (say, only depend on the local machine) and run daily or less. That's fairly safe.

There are plenty of jobs that are too small for a CI system like Jenkins but too big for cron. So what are some ways to prevent this problem of cron job overrun?

It is tempting to use locks to solve the problem. Tempting but bad. I once saw a cron job that paused until it could grab a lock. The problem with this is that when the job overran there was now an additional process waiting to run. They ended up with zillions of processes all waiting on the lock. Unless the job magically started taking less time to run, all the jobs would never complete. That wasn't going to happen. Eventually the process table filled and the machine crashed. Their solution (which was worse) was to check for the lock and exit if it existed. This solved the problem but created a new one. The lock jammed and now every instance of the job exited. The processing was no longer being done. This was fixed by adding monitoring to alert if the process wasn't running. So, the solution added more complexity. Solving problems by adding more and more complexity makes me a sad panda.

The best solution I've seen is to simply not use cron when doing frequent, periodic, big processes. Just write a service that does the work, sleeps a little bit, and repeats.

while true ; do
   process_the_thing
   sleep 600
done

Simple. Yes, you need a way to make sure that it hasn't died, but there are plenty of "watcher" scripts out there. You probably have one already in use. Yes, it isn't going to run precisely n times per hour, but usually that's not needed.

You should still monitor whether or not the process is being done. However you should monitor whether results are being generated rather than if the process is running. By checking for something that is at a high level of abstraction (i.e. "black box testing"), it will detect if the script stopped running or the program has a bug or there's a network outage or any other thing that could go wrong. If you only monitor whether the script is running then all you know is whether the script is running.

And before someone posts a funny comment like, "Maybe you should write a cron job that restarts it if it isn't running". Very funny.

Posted by Tom Limoncelli in Technical Tips

I write a lot of small bash scripts. Many of them have to run on MacOS as well as FreeBSD and Linux. Sadly MacOS comes with a bash 3.x which doesn't have many of the cooler features of bash 4.x.

Recently I wanted to use read's "-i" option, which doesn't exist in bash 3.x.

My Mac does have bash 4.x but it is in /opt/local/bin because I install it using MacPorts.

I didn't want to list anything but "#!/bin/bash" on the first line because the script has to work on other platforms and on other people's machines. "#!/opt/local/bin/bash" would have worked for me on my Mac but not on my Linux boxes, FreeBSD boxes, or friend's machines.

I finally came up with this solution. If the script detects it is running under an old version of bash it looks for a newer one and exec's itself with the new bash, reconstructing the command line options correctly so the script doesn't know it was restarted.

#!/bin/bash
# If old bash is detected. Exec under a newer version if possible.
if [[ $BASH_VERSINFO < 4 ]]; then
  if [[ $BASH_UPGRADE_ATTEMPTED != 1 ]]; then
    echo '[Older version of BASH detected.  Finding newer one.]'
    export BASH_UPGRADE_ATTEMPTED=1
    export PATH=/opt/local/bin:/usr/local/bin:"$PATH":/bin
    exec "$(which bash)" --noprofile "$0" """$@"""
  else
    echo '[Nothing newer found.  Gracefully degrading.]'
    export OLD_BASH=1
  fi
else
  echo '[New version of bash now running.]'
fi

# The rest of the script goes below.
# You can use "if [[ $OLD_BASH == 1]]" to
# to write code that will work with old
# bash versions.

Some explanations:

  • $BASH_VERSINFO returns just the major release number; much better than trying to parse $BASH_VERSION.
  • export BASH_UPGRADE_ATTEMPTED=1 Note that the variable is exported. Exported variables survive "exec".
  • export PATH=/opt/local/bin:/usr/local/bin:"$PATH":/bin We prepend a few places that the newer version of bash might be. We postpend /bin because if it isn't found anywhere else, we want the current bash to run. We know bash exists in /bin because of the first line of the script.
  • exec $(which bash) --noprofile "$0" """$@"""
    • exec This means "replace the running process with this command".
    • $(which bash) finds the first command called "bash" in the $PATH.
    • "$(which bash)" By the way... this is in quotes because $PATH might include spaces. In fact, any time we use a variable that may contain spaces we put quotes around it so the script can't be hijacked.
    • --noprofile We don't want bash to source .bashrc and other files.
    • "$0" The name of the script being run.
    • """$@""" The command line arguments will be inserted here with proper quoting so that if they include spaces or other special chars it will all still work.
  • You can comment out the "echo" commands if you don't want it to announce what it is doing. You'll also need to remove the last "else" since else clauses can't be empty.

Enjoy!

Posted by Tom Limoncelli in Technical Tips

People say things like, "Can you just send me a copy of data?"

If people are taking your entire database as a CSV file and processing it themselves, your API sucks.

(Overheard at an ACM meeting today)

Posted by Tom Limoncelli in Technical Tips

DKhMYli.jpg[ This is a guest post from Dan O’Boyle, who I met at a LOPSA-NJ meeting. I asked him to do a guest post about this subject because I thought the project was something other schools would find useful ]

I’m a systems engineer for a moderately sized school district in NJ.  We own a number of different devices, but this article is specifically about the AcerOne line of netbooks.  I was recently tasked with finding a way to breath new life into about 500 of these devices.  The user complaints on using these models ranged from “constant loss of wireless connectivity” to the ever descriptive “slow”.  The units have 1 gig of ram, and our most recent image build had them domain joined, running windows 7N 32bit.  

These machines were already running a very watered down Windows experience.  I considered what the typical user experience was - They would boot the device, login to windows, login to Chrome (via Google Apps for Education accounts) and then begin their browsing experience.  Along the way they would lose wireless connection (due to a possibly faulty Windows driver), experience CPU and memory bottlenecks due to antivirus and other background windows processes, and generally have a bad time.  The worst part was I couldn’t see a way to streamline this experience short of removing windows.  It turns out that was exactly the solution we needed.

Chromium OS is the open source version of Google’s ChromeOS operating system. The project provides instructions on how to build your own distro and a fairly responsive development community.  Through the community, I was able to find information on 2 major build distributors - Arnold the bat and Hexxah.  Hexxah’s builds seem to get a bit less attention than Arnolds, so after testing both I decided to use one of Arnolds most recent builds.

The AcerOnes took the build without issue.  A few gotcha’s to be aware of are hard drive size, unique driver needs and method of deployment.  Before I describe those problems, I’ll need to explain a bit about our planned method of deployment.

Individual Device configuration:

Configuring the OS on one device took about an hour from download to tweaking.  After copying the build to a USB stick, I installed it to the local HDD of my AcerOne.  I noticed that the wireless card was not detected by default.  This is typically due to a driver issue, and can often be solved by adding drivers to the /lib/firmware directory.  With the wireless card up and running, I added flash/java/PDF/mp3 support with this script (Note that the script is listed to work with Hexxah’s builds but also works with Arnolds.  The default password on arnold’s builds is password.)

Deployment:

Finally, I was ready to try cloning my machine to distribution.  My first successful attempt was using Clonezilla to make a local Clonezilla repo to USB.  This was effective, but it wasn’t pretty.  To distribute this build out to multiple buildings I needed to boot the ISO created by clonezilla over PXE, and given that some of my AcerOnes had 2gig of ram, and some only had 1 many of the devices wouldn’t be able to load the ISO locally into RAM to perform the install.

The next attempt I made was using FOG.  FOG was able to capture the image and store it on a PXE server.  FOG boots machines into a small linux kernel, then issues commands through that kernel to perform disk operations.  This method would work even on my 1gig machines.  At this point I discovered the hard disk problem mentioned earlier. I had originally build my image on a 250gig HDD.  some of my machines only had a 160gig drive.  Even though the image is much smaller than that, (about 4gig) FOG felt that the smaller HDD wouldn’t be able to handle the image and refused to deploy.  This can be solved by ensuring that your build machine has a smaller HDD than any machine you intend to deploy to.

Final Deploy time:

Overall I was able to take the 1 hour configure time it took for me to setup 1 machine, and cut it down to about 5min for a technician in the field.  Stored information about the wireless networks I pre-configured on the master device seems to be in a protected area on the disk that FOG couldn’t read.  The end result is that a technician must image a unit, then enter wireless key information after it’s deployed.

The user experience on the new “ChromiumBooks” has been right on target so far. The devices boot in about 40 seconds. Most of that time is the hardware boot process. Once that is complete ChromiumOS still loads in under 8 seconds. Users are immediately able to login to their Google Apps for Education accounts and begin browsing.

The linux driver for the wifi cards seems to be more stable than the windows driver, and I have much fewer reports of “wifi drop offs”.

Overall, getting rid of windows has been great for these devices.

If you liked this story, or want to shoot me some questions feel free to find me at www.selfcommit.com.

Posted by Dan O'Boyle in Technical Tips

Hey fellow sysadmins! Please take 5 minutes to make sure your DNS servers aren't open to the world for recursive queries. They can be used as amplifiers in DDOS attacks.

https://www.isc.org/wordpress/is-your-open-dns-resolver-part-of-a-criminal-conspiracy/

The short version of what you need to do is here.

Posted by Tom Limoncelli in Technical Tips

Reverting in "git"

I'm slowly learning "git". The learning curve is hard at first and gets better as time goes on. (I'm also teaching myself Mercurial, so let's not start a 'which is better' war in the comments).

Reverting a file can be a little confusing in git because git uses a different model than, say, SubVersion. You are in a catch-22 because to learn the model you need to know the terminology. To learn the terminology you need to know the model. I think the best explanations I've read so far have been in the book Pro Git, written by Scott Chacon and published by Apress. Scott put the entire book up online, and for that he deserves a medal. You can also buy a dead-tree version.

How far back do you want to revert a file? To like it was the last time you did a commit? The last time you did a pull? Or revert it back to as it is on the server right now (which might be neither of those)

Revert to like it was when I did my last "git commit":

git checkout HEAD -- file1 file2 file3

Revert to like it was when I did my last "pull":

git checkout FETCH_HEAD -- file1 file2 file3

Revert to like it is on the server right now:

git fetch
git checkout FETCH_HEAD -- file1 file2 file3

How do these work?

The first thing you need to understand is that HEAD is an alias for the last time you did "git commit".

FETCH_HEAD is an alias for the last time you did a "git fetch". "git fetch" pulls the lastest release from the server, but hides it away. It does not merge it into your workspace. "git merge" merges the recently fetched files into your current workspace. "git pull" is simply a fetch followed by a merge. I didn't know about "git fetch" for a long time; I happily used "git pull" all the time.

You can set up aliases in your ~/.gitconfig file. They act exactly like real git commands. Here are the aliases I have:

[alias]
  br = branch
  st = status
  co = checkout
  revert-file = checkout HEAD --
  revert-file-server = checkout FETCH_HEAD --
  lg = log --graph --pretty=format:'%Cred%h%Creset -%C(yellow)%d%Creset %s %Cgreen(%cr) %C(bold blue)<%an>%Creset' --abbrev-commit --date=relative

This means I can do git br instead of "git branch", saving me a lot of typing. git revert-file file1 file2 file3 is just like the first example above. git revert-file-server is a terrible name, but it basically diffs between the last fetch and my current workspace. git lg outputs a very pretty log of recent changes (I stole that from someone who probably stole it from someone else. Don't ask me how it works).

To add these aliases on your system, find or add a [alias] stanza to your ~/.gitconfig file and add them there.

Posted by Tom Limoncelli in Technical Tips

Matt Simmons of the Standalone Sysadmin blog asked about labeling network cables in a datacenter on the LOPSA-Tech mailing list which brought up a number of issues.

He wrote:

So, my current situation is that I'm working in a datacenter with 21 racks arranged in three rows, 7 racks long. We have one centralized distribution switch and no patch panels, so everything is run to the switch which lives in the middle, roughly. It's ugly and non-ideal and I hate it a bunch, but it is what it is. And it looks a lot like this.

Anyway, so given this really suboptimal arrangement, I want to be able to more easily identify a particular patch cable because, as you can imagine, tracing a wire is no fun right now.

He wanted advice as to whether the network cables should be labeled with exactly what the other end is connected to, including hostname and port number, or use a unique ID on each cable so that as they move around they don't have to be relabeled.

We write about this in the Data Centers chapter of The Practice of System and Network Administration but I thought I'd write a bit more for this blog.

My reply is after the bump...


Posted by Tom Limoncelli in Technical Tips

If you use the Ganeti command line you probably have used gnt-instance list and gnt-node list. In fact, most of the gnt-* commands have a list subcommand. Here's some things you probably didn't know.

Part 1: Change what "list" outputs

Unhappy with how verbose gnt-instance list is? The -o option lets you pick which fields are output. Try this to just see the name:

gnt-instance list -o name

I used to use awk and tail and other Unix commands to extract just the name or just the status. Now I use -o name,status to get exactly the information I need.

I'm quite partial to this set of fields:

gnt-instance list --no-headers -o name,pnode,snodes,admin_state,oper_state,oper_ram

The --no-headers flag means just output the data, no column headings.

What if you like the default fields that are output but want to add others to them? Prepend a + to the option:

gnt-node list --no-headers -o +group,drained,offline,master_candidate 

This will print the default fields plus the node group, and the three main status flags nodes havee: is it drained (no instances can move onto it), offline (the node is essentially removed from the cluster), and whether or not the node can be a master.

How does one find the list of all the fields one can output? Use the list-fields subcommand. For each gnt-* command it lists the fields that are available with that list command. That is, gnt-instance list-fields shows a different set of names than gnt-node list-fields.

Putting all this together I've come up with three bash aliases that make my life easier. They print a lot of information but (usually) fit it all on an 80-character wide terminal:

alias i='gnt-instance list --no-headers -o name,pnode,snodes,admin_state,oper_state,oper_ram | sed -e '\''s/.MY.DOMAIN.NAME//g'\'''
alias n='gnt-node list --no-headers -o +group,drained,offline,master_candidate | sed -e '\''s/.MY.DOMAIN.NAME//g'\'''
alias j='gnt-job list | tail -n 90 | egrep --color=always '\''^|waiting|running'\'''

(Change MY.DOMAIN.NAME to the name of your domain.)

Part 2: Filter what's output

The -F option has got to be the least-known about feature of the Ganeti command line tools. It lets you restrict what nodes or instances are listed.

List the instances that are using more than 3 virtual CPUs:

gnt-instance list -F 'oper_vcpus > 3'

List the instances that have more than 6G of RAM (otherwise known as "6144 megabytes"):

 gnt-instance list -F 'be/memory > 6144'

The filtering language can handle complex expressions. It understands and, or , ==, <, > and all the operations you'd expect. The ganeti(7) man page explains it all.

Which nodes have zero primary instances? Which have none at all?

bc..
gnt-node list --filter 'pinst_cnt 0' gnt-node list -F 'pinst_cnt 0 and sinst_cnt == 0'

Strings must be quoted with double-quotes. Since the entire formula is in single-quotes this looks a bit odd but you'll get used to it quickly.

Which instances have node "fred" as their primary?

gnt-instance list --no-header -o name  -F  'pnode == "fred" '

(I included a space between " and ' to make it easier to read. It isn't needed otherwise.)

Which nodes are master candidates?

gnt-node list --no-headers -o name -F 'role == "C" '

Do you find typing gnt-cluster getmaster too quick and easy? Try this command to find out who the master is:

gnt-node list --no-headers -o name -F 'role == "M" '

Like most gnt-* commands it must be run on the master, so be sure to use gnt-cluster getmaster to find out who the master is and run the command there.

If you use the "node group" feature of Ganeti (and you probably don't) you can find out which nodes are in node group foo:

gnt-node list -o name -F 'group == "foo" '

and which instances have primaries that are in group foo:

 gnt-instance list --no-header -o name  -F  "pnode.group == "foo"'

It took me forever to realize that, since snodes is a list, one has to use in instead of ==. Here's a list of all the instances whose secondary is in node group "bar":

gnt-instance list --no-header -o name  -F '"bar" in snodes.group'

("snodes" is plural, "pnode" is singular")

To recap:

  1. The following commands have a list-fields subcommand and list accepts -o and -F options: gnt-node , gnt-instance , gnt-job , gnt-group , gnt-backup .
  2. -o controls which fields are output when using the list subcommand.
  3. -F specifies a filter that controls which items are listed.
  4. The field names used with -o and -F are different for each gnt-* command.
  5. Use the list-fields subcommand to find out what fields are available for a command.
  6. The filtering language is documented in ganeti(7). i.e. view with: man 7 ganeti
  7. The man pages for the individual gnt-* commands give longer explanations of what each field means.
  8. In bash , filters have to be in single quotes so that the shell doesn't interpret <, >, double-quotes, and other symbols as bash operators.

Enjoy!

Posted by Tom Limoncelli in GanetiTechnical Tips

What's wrong with this as a way to turn a hostname into a FQDN?

FQDN=$(getent hosts "$SHORT" | awk '{ print $2 }')

Answer: getent can return multiple lines of results. This only happens if the system is configured to check /etc/hosts before DNS and if /etc/hosts lists the hostname multiple times. There may be other ways this can happen too, but that's the situation that bit me. Of course, there shouldn't be multiple repeated lines in /etc/hosts but nothing forbids it.

As a result you can end up with FQDN="hostname.dom.ain hostname.dom.ain which, and I'm just guessing here, is going to cause problems elsewhere in your script.

The solution is to be a bit more defensive and only take the first line of output:

FQDN=$(getent hosts "$SHORT" | head -1 | awk '{ print $2 }')

Of course, there is still error-checking you should do, but I'll leave that as an exercise to the reader. (Hint: You can check if $? is non-zero; you can also check if FQDN is the null string.)

Technically the man page covers this situation. It says the command "gathers entries" which, being plural, is a very subtle hint that the output might be multiple lines. I bet most people reading the man page don't notice this. It would be nice if the man page explicitly warned the user that the output could be multiple lines long.

P.S. I'm sure the first comment will be a better way to do the same thing. I look forward to your suggestions.

Posted by Tom Limoncelli in Technical Tips

Here's a good strategy to improve the reliability of your systems: Buy the most expensive computers, storage, and network equipment you can find. It is the really high-end stuff that has the best "uptime" and "MTBF".

Wait... why are you laughing? There are a lot of high-end, fault-tolerant, "never fails" systems out there. Those companies must be in business for a reason!

Ok.... if you don't believe that, let me try again.

Here's a good strategy to improve the reliability of your systems: Any time you have an outage, find who caused it and fire that person. Eventually you'll have a company that only employs perfect people.

Wait... you are laughing again! What am I missing here?

Ok, obviously those two strategies won't work. System administration is full of examples of both. At the start of "the web" we achieved high uptimes by buying Sun E10000 computers costing megabucks because "that's just how you do it" to get high performance and high uptimes. That strategy lasted until the mid-2000's. The "fire anyone that isn't perfect" strategy sounds like something out of an "old school" MBA textbooks. There are plenty of companies that seem to follow that rule.

We find those strategies laughable because the problem is not the hardware or the people. Hardware, no matter how much or how little you pay for it, will fail. People, no matter how smart or careful, will always make some mistakes. Not all mistakes can be foreseen. Not all edge cases are cost effective to prevent!

Good companies have outages and learn from them. They write down those "lessons learned" in a post-mortem document that is passed around so that everyone can learn. (I've written about how to do a decent postmortem before.)

If we are going to "learn something" from each outage and we want to learn a lot, we must have more outages.

However (and this is important) you want those outages to be under your control.

If you knew there was going to be an outage in the future, would you want it at 3am Sunday morning or 10am on a Tuesday?

You might say that 3am on Sunday is better because users won't see it. I disagree. I'd rather have it at 10am on Tuesday so I can be there to observe it, fix it, and learn from it.

In school we did this all the time. It is called a "fire drill". The first fire drill of the school year we usually did a pretty bad job. However, the second one was much better. The hope is that if there was a real fire it will be after we've gotten good at it.

Wouldn't you rather just never have fires? Sure, and when that is possible let me know. Until then, I like fire drills.

Wouldn't you rather have computer systems that never fail? Sure, and when that's possible let me know. Until then I like sysadmin fire drills.

Different companies call them different things. Jesse Robins at Twitter calls them GameDay" exercises. John Allspaw at Etsy calls refers to "resilience testing" in his new article on ACM Queue. Google calls them something else.

The longer you go without an outage, the more rusty you get. You actually improve your uptime by creating outages periodically so that you don't get rusty. It is better to have a controlled outage than waiting for the next outage to find you out of practice.

Fire drills don't have to be visible to the users. In fact, they shouldn't be. You should be able to fail over a database to the hot spare without user-visible affects.

Systems that are fault tollerant should be peridically tested. Just like you test your backups by doing an occasional full restore (don't you?) you should periodically fail over that datbase server, web server, RAID system, and so on. Do it in a controlled way: plan it, announce it, make contingency plans, and so on. Afterwords write up a timeline of what happened, what mistakes were made, and what can be done to improve this next time. For each improvement file a bug. Assign someone to hound people until the list of bugs are all closed. Or, if a bug is "too expensive to fix", have management sign off on that decision. I believe that being unwilling to pay to fix a problem ("allocate resources" in business terms) is equal to saying "I'm willing to take the risk that it won't happen." So make sure they understand what they are agreeing to.

Most importantly: have the right attitude. Nobody should be afraid to be mentioned in the "lesson's learned" document. Instead, people should be rewarded, publically, for finding problems and taking responsibility to fix them. Give a reward, even a small one, to the person that fixes the most bugs filed after a fire drill. Even if the award is a dorky certificiate to hang on their wall, a batch of cookies, or getting to pick which restaurant we go to for the next team dinner, it will mean a lot. Receiving this award should be something that can be listed on the person's next performance review.

The best kind of fire drill tests cross-team communication. If you can involved 2-3 teams in planning the drill you have potential to learn a lot more. Does everyone involved know how to contact each other? Is the conference bridge big enough for everyone? If the managers of all three teams have to pretend to be unavailable during the outage, are the three teams able to complete the drill?

My last bit of advice is that fire drills need management approval. The entire management chain needs to be aware of what is happening and understand the business purpose of doing all this.

John's article has a lot of create advice about explaining this to management, what push-back you might expect, and so on. His article, Fault Injection in Production is so well written even your boss will understand it. (ha ha, a little boss humor there)

[By the way... ACM Queue is really getting 'hip' lately by covering these kind of "DevOps" topics. I highly recommend visiting queue.acm.org periodically.]

A co-worker watched me type the other day and noticed that I use certain Unix commands for purposes other than they are intended. Yes, I abuse Unix commands.

Posted by Tom Limoncelli in Best of BlogTechnical Tips

Queue Magazine (part of ACM) has published my description of OpenFlow. It's basically "the rant I give at parties when someone asks me to explain OpenFlow and why it is important". I hope that people actually involved in OpenFlow standardization and development forgive me for my simplifications and possibly sloppy use of terminology but I think the article does a good job of explaining OF to people that aren't involved in networking:

OpenFlow: A Radical New Idea in Networking

I hope that OpenFlow is adopted widely. It has some cool things in it.

Enjoy.

Tom

Posted by Tom Limoncelli in Technical Tips

A co-worker of mine recently noticed that I tend to use rsync in a way he hadn't seen before:

rsync -avP --inplace $FILE_LIST desthost:/path/to/dest/.

Why the "slash dot" at the end of the destination?

I do this because I want predictable behavior and the best way to achieve that is to make sure the destination is a directory that already exists. I can't be assured that /path/to/dest/ exists, but I know that if it exists then "." will exist. If the destination path doesn't exist, rsync makes a guess about what I intended, and I don't write code that relies on "guesses". I would rather the script fail in a way I can detect (shell variable $?) rather than have it "guess what I meant"; which is difficult to detect.

What? rsync makes a guess? Yes. rsync changes its behavior depending on a number of factors:

  • is there one source file or multiple source files?
  • is the destination a directory, a file, or doesn't exist?

There are many permutations there. You can eliminate most of them by having a destination directory end with "slash dot".

For example:

  • Example A: rsync -avP file1 host:/tmp/file
  • Example B: rsync -avP file1 file2 host:/tmp/file

Assume that host:/tmp/file exists. In that case, Example A copies the file and renames it in the process. Example B will fail because rsync's author (and I think this is the right decision) decided that it would be stupid to copy file1 to /tmp/file and then copy file2 over it. This is the same behavior as the Unix cp command: If there are multiple files being copied then the last name on the command line has to be a directory otherwise it is an error. The behavior changes based on the destination.

Let's look at those two examples if the destination name doesn't exist:

  • Example C: rsync -avP file1 host:/tmp/santa
  • Example D: rsync -avP file1 file2 host:/tmp/santa

In these examples assume that /tmp/santa doesn't exist. Example C is similar to Example A: rsync copies the file to /tmp/santa i.e. it renames it as it copies. Example B, however, rsync will assume you want it to create the directory so that both files have some place to go. The behavior changes due to the number of source files.

Remember that debugging, by definition, is more difficult than writing code. Therefore, if you write code that relies on the maximum of your knowledge, you have, by definition, written code that is beyond your ability to debug.

Therefore, if you are a sneaky little programmer and use your expertise in the arcane semantics and heuristics of rsync, congrats. However, if one day you modify the script to copy multiple files instead of one, or if the destination directory doesn't exist (or unexpectedly does exist), you will have a hard time debugging the program.

How might a change like this happen?

  • Your source file is a variable $SOURCE_FILES and occasionally there is only one source file. Or the variable represents one file but suddenly it represents multiple.
  • The script you've been using for years gets updated to copy two files instead of one.
  • Over time the list of files that need to be copied shrinks and shrinks and suddenly is just single file that needs to be copied.
  • Your destination directory goes away. In the example that my coworker noticed, the destination was /tmp. Well, everyone knows that /tmp always exists, right? I've seen it disappear due to typos, human errors, and broken install scripts. If /tmp disappeared I would want my script to fail.

It is good rsync hygiene to end destinations with "/." if you intend it to be a directory that exists. That way it fails loudly if the destination doesn't exist since rsync doesn't create the intervening subdirectories. I do this in scripts and on the command line. It's just a good habit to get into.

Tom

P.S. One last note. Much of the semantics described about change if you add the "-R". They don't get more consistent, they just become different. If you use this option make sure you do a lot of testing to be sure you cover all these edge cases.

Posted by Tom Limoncelli in Technical Tips

I don't think I really understood SSH "Agent Forwarding" until I read this in-depth description of what it is and how it works:

http://www.unixwiz.net/techtips/ssh-agent-forwarding.html

In fact, I admit I had been avoiding using this feature because it adds a security risk and it is best not to use something risky without knowing the internals of why it is risky.

Now that I understand it and can use it, I find it saves me a TON of time. Highly recommended (when it is safe to use, of course!)

Tom

Posted by Tom Limoncelli in Technical Tips

QueueICPC_coercion.jpg

ACM Queue is hosting an online programming competition on its website from January 15 through February 12, 2012.

Using either Java, C++, C#, Python, or JavaScript, code an AI to compete against other participant's programs in a territory-capture game called, "Coercion".

The competition is open to everyone.

Details at: http://queue.acm.org/icpc/

Posted by Tom Limoncelli in Technical Tips

Yesterday on the SysAdvent calendar Aleksey Tsalolikhin has an article about configuration management. It includes a comparison of how to the same in in various languages: bash, CFEngine, chef and Puppet. Seeing how the languages differ is very interesting!

SysAdvent: December 19 - Configuration Management

Posted by Tom Limoncelli in Technical Tips

A great explanation about "yield" followed by a discussion of coroutines and more:

In the sequel, he goes into even more detail and the uses all the information to write an operating system in Python.

Posted by Tom Limoncelli in PythonTechnical Tips

Fabric is a new tool for ssh'ing to many hosts. It has some nice properties, such as lazy execution. You write the description of what is to be done in Python and Fabric takes care of executing it on all the machines you specify. Once you've used it a bunch of times you'll accumulate many "fab files" that you can re-use. You can use it to create large systems too. The API is simple but powerful.

The tutorial gives you a good idea of how it works: http://docs.fabfile.org/en/1.2.2/tutorial.html

It is written using the Paramiko module which is my favorite way to do SSH and SSH-like things from Python.

The Fabric homepage is: http://www.fabfile.org

Thanks to Joseph Kern for this tip!

Posted by Tom Limoncelli in Technical Tips

The Google flags parser (available for Python and C++) is very powerful. I use it for all my projects at work (of course) and since it has been open sourced, I use it for personal projects too.

While I support open source 100% I rarely get to submit much code into other people's projects (I contribute to documentation more than code... go figure). So, even though it is only a few lines of new code, I do want to point out that the 1.6 release of the Python library has actual code from me.

One of the neat features of this flags library is that you can specify a file to read the flags from. That is, if your command line is too long, you can stick all or some of the flags in a file and specify "--flagfile path/to/file.flags" to have them treated as if you put them on the command line. Imagine having one flags file that you use in production and another one that points the server to a test database using a higher level of debug verbosity and enabling beta features. You can specify multiple files even with overlapping flags and it does the right thing, keeping the last value.

My patch was pretty simple. I discovered, through a painful incident, that if the flagfile were silently skipped if they were not readable. No warning, no error message. (You can imagine that my discovery was during a frantic "why is this not working???" afternoon.). Anyway... now you get an error instead and the program stops (in python terms... it raises an exception). I think the unit tests are bigger than the actual code but I'm glad the patch was accepted.  I hope nobody was depending on this bug as a "feature". Seriously... nobody would turn off flags via "chmod 000 filename.flags", right? So far I haven't gotten any complaints.

Anyway... if you write code in C++ or Python I highly recommend you give gflags a try. Both are available under the New BSD License on Google Code:

Enjoy!

--Tom

Posted by Tom Limoncelli in Technical Tips

  1. On a Mac, if you SHIFT-CLICK the green dot on a window it opens it as wide and tall as possible (instead of the application-defined behavior)

  2. Even though "ls -l" displays a files permissions as "-rw-r--r--", you can't use "-rw-r--r--" in a chmod command. This is probably one of the most obvious but overlooked UI inconsistencies in Unix that nobody has fixed after all these years. Instead we force people to learn octal and type 0644. Meanwhile every book on Unix/Linux spends pages explaining octal just for this purpose. Time would have been better spent contributing a patch to chmod.

  3. If a network problem always happens 300 seconds after an event (like a VPN coming up or a machine connecting to the network) the problem is ARP, which has to renew every 300 seconds. Similarly, if it times out after exactly 2 hours, the problem is your routing system which typically expires routes after 2 hours of not hearing them advertised.

  4. Git rocks. I should have converted from SubVersion to Git years ago. Sadly I like the name SubVersion better. I hear Hg / Mercurial is better than Git, but Git had better marketing.

  5. Keep all your Unix "dot files" in sync with http://wiki.eater.org/ocd (and I'm not just saying that because my boss wrote it).

  6. People that use advanced Python-isms should not complain when I use features that have been in bash forever and, in fact, were in /bin/sh before most of us knew how to read.

  7. Years ago IETF started telling protocol inventors to avoid using broadcasts and use "local multicast" instead because it will help LAN equipment vendors scale to larger and larger LANs. If your LAN network vendor makes equipment that goes south when there is a lot of multicast traffic because it is "slow path'ed" through the CPU, remind them that They're Doing It Wrong.

  8. The best debugging tool in the world is "diff". Save the output /tmp/old. As you edit your code, write the output to /tmp/new then do "diff /tmp/old /tmp/new". When you see the change you want, you know you are done. Alternatively edit /tmp/old to look like the output you want. You've fixed the bug when diff has no output.

  9. Attend your local sysadmin conference. Regional conferences are your most cost effective career accelerator. You will learn technical stuff that will help you retain your job, do your job better, get promoted, or find a new job. Plus, you'll make local friends and contacts that will help you more than your average call to a vendor tech support line. There are some great ones in the Seattle and NJ/NY/Philly area all listed here.

Posted by Tom Limoncelli in CommunityTechnical Tips

True story:

My first job out of college we made our own patch cables. Usually we'd make them "on demand" as needed for a new server or workstation. My (then) boss didn't want to buy patch cables even though we knew that we weren't doing a perfect job (we were software people, eh?). Any time we had a flaky server problem it would turn out to be the cable... usually one made by my (then) boss. When he left the company the first policy change we made was to start buying pre-made cables.

That was during the days of Category 3 cables. You can make a Category 3 cable by hand without much skill. With Category 5 and 6 the tolerances are so tight that just unwinding a pair too far (for example, to make it easier to crimp) will result in enough interference that you'll see errors. It isn't just "having the right tools". An Ohm Meter isn't the right testing tool. You need to do a series of tests that are well beyond simple electrical connectivity.

That's why it is so important to make sure the cables are certified. It isn't enough to use the right parts, you need to test it to verify that it will really work. There are people that will install cable in your walls and not do certification. Some will tell you they certified it but they really just plug a computer at each end; that's not good enough. I found the best way to know the certification was really done is have them produce a book of printouts, one from each cable analysis. Put it in the contract: No book, no payment. (and as a fun trick... the next time you do have a flaky network connection, check the book and often you'll find it just barely passed. You might not know how to read the graph, but you'll see the line dip closer to the "pass" line than on the other graphs.)

If your boss isn't convinced, do the math. Calculate how much you are paid in 10 minutes and compare that to the price of the pre-made cable.

Posted by Tom Limoncelli in Technical Tips

I needed a way to backup a single server to a remote hard disk. There are many scripts around, and I certainly could have written one myself, but I found Duplicity and now I highly recommend it:

http://duplicity.nongnu.org

Duplicity uses librsync to generate incremental backups that are very small. It generates the backups, GPG encrypts them, and then sends them to another server by all the major methods: scp, ftp, sftp, rsync, etc. You can backup starting at any directory, not just at mountpoints and there is a full language for specifying files you want to exclude.

Installation: The most difficult part is probably setting up your GPG keys if you've never set them up before. (Note: you really, really, need to protect the private key. It is required for restores. If you lose your machine due to a fire, and don't have a copy of the private key somewhere, you won't be able to do a restore. Really. (I burned mine on a few CDs and put them in various hidden places.)

The machine I'm backing up is a virtual machine in a colo. They don't offer backup services, so I had to take care of it myself. The machine runs FreeBSD 8.0-RELEASE-p4 and it works great. The code is very portable: Python, GPG, librsync, etc. Nothing digs into the kernel or raw devices or anything like that.

I wrote a simple script that loops through all the directories that I want backuped, and runs:

duplicity --full-if-older-than 5W --encrypt-key="${PGPKEYID}" $DIRECTORY scp://myarchives@mybackuphost/$BACKUPSET$dir

The "--full-if-older-than 5W" means that it does an incremental backup, but a full back every 35 days. I do 5W instead of 4W because I want to make sure no more than 1 full backup happens every billing cycle. I'm charged for bandwidth and fear that two full dumps in the same month may put me over the limit.

My configuration: I'm scp'ing the files to another machine, which has a cheap USB2.0 1T hard disk. I set it up so that I can ssh from the source machine to the destination machine without need of a password ("PubkeyAuthentication yes"). In the example above "myarchives" is the username that I'm doing the backup to, and "mybackuphost" is the host. Actually I just specify the hostname and use a .ssh/config entry to set the default username to be "myarchives". That way I can specify "mybackuphost" in other shell scripts, etc. SSH aliases FTW!

Restores: Of course, I don't actually care about backups. I only care about restores. When restoring a file, duplicity figures out which full and incremental backups need to be retrieved and decrypted. You just specify the date you want (default "the latest") and it does all the work. I was impressed at how little thinking I needed to do.

After running the system for a few days it was time to do a restore to make sure it all worked.

The restore syntax is a little confusing because the documentation didn't have a lot of examples. In particular, the most common restore situation is not restoring the full backupset, but "I mess up a file, or think I messed it up, so I want to restore an old version (from a particular date) to /tmp to see what it used to look like."

What confused me: 1) you specify the path to the file (or directory) but you don't list the path leading up to the mountpoint (or directory) that was backuped. In hindsight that is obvious but it caught me. What saved me was that when I listed the files, they were displayed without the mountpoint. 2) You have to be very careful to specify where you put the backup set. You specify that on the command line as the source, and you specify the file to be restored in the "--file-to-restore" option. You can't specify the entire thing on the command line and expect duplicity to guess where to split it.

So that I don't have to re-learn the commands at a time when I'm panicing because I just deleted a critical file, I've made notes about how to do a restore. With some changes to protect the innocent, they look like:

Step 1. List all the files that are backuped to the "home/tal" area:

duplicity list-current-files scp://mybackuphost/directoryname/home/tal

To list what they were like on a particular date, add: --restore-time "2002-01-25"

Step 2. Restore a file from that list (not to the original place):

duplicity restore --encrypt-key=XXXXXXXX --file-to-restore=path/you/saw/in/listing scp://mybackuphost/directoryname/home/tal /tmp/restore

Assume the old file was in "/home/tal/path/to/file" and the backup was done on "/home/tal", you need to specify --file-to-restore as "path/to/file", not "/home/tal/path/to/file". You can list a directory to get all files. The /tmp/restore should be a directory that already exists.

To restore the files as of a particular date, add: --restore-time "2002-01-25"

Conclusion: Duplicity is a great piece of engineering. It is very fast, both because they make good use of librsync to make the backups small, but also because they store indexes of what files were backuped so that the entire backup doesn't have to be read just to get a file list. The backup files are small, split across many small files so that not a lot of temp space is required on the source machine. The tools are very easy to use: they do all the machinations about full and incremental sets, so you can focus on what to backup and what to restore.

Caveats: Like any backup system, you should do a "firedrill" now and then and test your restore procedure. I recommend you encapsulate your backup process in a shell script so that you do it the same way every time.

I highly recommend Duplicity.

http://duplicity.nongnu.org

Posted by Tom Limoncelli in Technical Tips

Google Forms

Someone asked me how I did my survey in a way that the data went to a Google spreadsheet automatically. The forms capability is built into the spreadsheet system. You can even do multi-page forms with pages selected based on previous answers.

More info here

Posted by Tom Limoncelli in Technical Tips

A coworker debugged a problem last week that inspired me to relay this bit of advice:

Nothing happens at "random times". There's always a reason why it happens.

I once had a ISDN router that got incredibly slow now and then. People on the far side of the router lost service for 10-15 seconds every now and then.

The key to finding the problem was timing how often the problem happened. I used a simple once-a-second "ping" and logged the times that the outages happened.

Visual inspection of the numbers turned up no clues. It looked random.

I graphed how far apart the outages happened. The graph looked pretty random, but there were runs that were always 10 minutes apart.

I graphed the outages on a timeline. That's where I saw something interesting. The outages were exactly 10 minutes apart PLUS at other times. I wouldn't have seen that without a graph.

What happens every 10 minutes and other times too? In this case, the router recalculated its routing table every time it got a route update. The route updates came from its peer router exactly every 10 minutes plus any time an ISDN link went up or down. The times I was seeing a 10-minute gap was when we went an entire 10 minutes with no ISDN links going up or down. With so many links, and the fact that they were home users intermittently using their connections, meant that it was pretty rare to go the full 10 minutes with no updates. However, by graphing it the periodic outages were visible.

I've seen other outages that happened 300 seconds after some other event: a machine connects to the network, etc. A lot of protocols do things in 300 second (5 minute) intervals. The most common is ARP: A router expires ARP entries every 300 seconds. Some vendors extend the time any time they receive a packet from the host, others expire the entry and send another ARP request.

What other timeouts have you found to be clues of particular bugs? Please post in the comments!

Posted by Tom Limoncelli in Technical Tips

CSS Positioning

I admit it. I use tables for positioning in HTML. It is easy and works.

However, I just read this excellent tutorial on CSS positioning and I finally understand what the heck all that positioning stuff means.

http://www.barelyfitz.com/screencast/html-training/css/positioning/

I promise not to use tables any more.

I highly recommend this tutorial.

Posted by Tom Limoncelli in Technical Tips

Being a long-time "vi" user I find that I am constantly surprised by the little (and not-so-little) enhancements vim has added. One of them is the "inner" concept.

Any vi user knows that "c" starts a c change and the next keystroke determines what will be changed. "cw" changes from where the cursor is until the end of the word. For example, "c$" chances from where the cursor is to the end of the line. Think of a cursor movement command, type it after "c" and you are pretty sure that you will change from where the cursor is to.... wherever you've directed.

"d" works the same way. "dw" deletes word. "d$" deletes to the end of the line. "d^" deletes to the beginning of the line ("^"? "$"? gosh, whoever invented this stuff must have known a lot about regular expressions).

VIM adds the concept of "inner text". Text is structured. We put things in quotes, in parenthesis, between mustaches (that's "{" and "}") and so on. The text between those two things are the "inner text".

So suppose we have:

<span style="clean">Interesting quote here.</span>

but we want to change the style from "clean" to "unruly". Move the cursor anywhere between the quotes and type ci then a quote (read that as "change inner quote"). VIM will seek out the opening and closing quotes that surround the cursor and the next stuff you type will replace it.

It works for all three kinds of quotes (single, double, and backtick), it works for all the various braces: ( { and <. You can type the opening or the closing brace, they both do the same thing.

Therefore you can move the cursor to the word "style" in the above example and type "ci<" to change everything within that tag.

I find this particularly useful when editing python code. I'm often using ci' to change a single quoted string.

If there is an "inner", you'd expect there is an "outer" too, right? (How many of you tried typing co" to see if it worked?) Well, there is an there isn't.

In VIM the opposite of "inner" is "block". A block is kind of special. It don't just include the opening and closing elements plus sometimes a the space or two that follow. Given this text:

  • The quick <span class="foo">>brown</span> fox.

If the cursor is in the <span> element, "cb<" will replace the entire element from the < all the way to the >. The whitespace after the element is also replaced for text-related things like change word (caw) and change sentence (cas).

Not having to move the cursor to the beginning of an element to change the entire thing is a great time saver. It is these little enhancements that makes using VIM so much more pleasant that using VI.

Give it a try!

More information about this is in the "Text Objects" section of Michael Jakl's excellent VIM tutorial.

--Tom

P.S. My second favorite thing about VIM? gVIM (The graphical version of VIM) preserves TABs when you use the windowing system to cut and paste.

Posted by Tom Limoncelli in Technical Tips

Remember when you were a little kid and had a clubhouse? Did you let someone in only if they knew "the secret knock"? Lately people have talked about various implementations for doing that with ssh. The technique, called "Port Knocking" permits SSH if someone has touched various ports recently. For example, someone has to ping the machine, then telnet to port 1234, then for the next 2 minutes they can ssh in.

This can be difficult to implement securely, as this video demonstrates: http://www.youtube.com/watch?v=9IrCgCKrv8U

IBM's Developerworks recently posted an article about tightening SSH security. The topic also came up on the mailing list for the New Jersey LOPSA chapter.

I had an idea that I haven't seen published before.

I have a Unix (FreeBSD 8.0) system that is live on the open internet and it is so locked down that I don't permit passwords. To SSH to the machine you have to pre-arranged to set up SSH keys for "passwordless" connections. However, it does not run a firewall because it is literally running with no ports open (except ssh). There is nothing to firewall.

Problem: What if I am stuck and need to log in remotely with a password?

Most of the portknocking techniques I've seen leverage the firewall running on the system. I didn't want to enable a firewall, so I came up with this.

Idea #1: A CGI script to grant access.

Connect to a particular URL, it runs SSH on port 9999 with a special configuration file that permits passwords:

/etc/ssh/sshd_config:

PasswordAuthentication no
PermitEmptyPasswords no
PermitRootLogin no
UsePAM no

/etc/ssh/sshd_config-port9999:

Port 9999
AllowAgentForwarding no
AllowTcpForwarding no
GatewayPorts no
LoginGraceTime 30
MaxAuthTries 3
X11Forwarding no
PermitTunnel no
PasswordAuthentication no
PermitEmptyPasswords no
PermitRootLogin yes
UsePAM yes

Translation: If someone is going to get special access on port 9999, they can't use it to set up tunnels or gateways. It is just for either quick access; enough to fix your SSH keys.

The CGI script is essentially runs:

/usr/sbin/sshd -p 9999 -d

Which permits a single login session on port 9999.

Try #2:

FreeBSD defaults to an inetd that uses Tcpwrappers.

So, try #2 was similar to #1 but appends info to /etc/hosts.allow so that the person has to come from the same IP address as the recent web connection. The problem with that is sometimes people connect to the web via proxies, and adding the proxy to the hosts.allow list isn't going to help.

Try #3:

We all know that you can't run two daemons on the same port number, right? Wrong.

You can have multiple daemons listening on the same port number if they are listening on different interfaces. If two conflict, the connection goes to the "most specific" listening daemon.

What does that mean? That means you can have sshd with one configuration listening on port .22 (any interface, port 22) and another listening on 10.10.10.10.22 (port 22 of the interface configured for address 10.10.10.10). But I only have one interface, you say? I disagree. You have 127.0.0.1 plus your primary IP address, plus any IPv6 addresses. Heck, even if you really only had one IP address, "" and a specific address can both be listening to port 22 at the same time.

That's what the "*" on "netstat -l" means. "Any interface."

So, back to our port knocking configuration.

The normal port 22 sshd runs with a configuration that disables all passwords (only permits SSH keys).

/etc/ssh/sshd_config:

Port 22
ListenAddress 0.0.0.0
ListenAddress ::
PAMAuthenticationViaKBDInt no
PasswordAuthentication no
PermitEmptyPasswords no
PermitRootLogin no
UsePAM no

And the CGI script enables a sshd with this configuration:

/etc/ssh/sshd_config-permit-passwords:

Port 22
ListenAddress 64.23.178.12
ListenAddress fe80::5154:ff:fe25:1234
PAMAuthenticationViaKBDInt no
PasswordAuthentication no
PermitEmptyPasswords no
PermitRootLogin no
UsePAM yes

The wrapper simply runs:

/usr/sbin/sshd -d -f /etc/ssh/sshd_config-permit-passwords

That's all there is to it!

Posted by Tom Limoncelli in Technical Tips

Google Chrome supports multiple profiles. The feature is just hidden until it is ready for prime-time. It is really easy to set up on the Linux and Windows version of Chrome. On the Mac it takes some manual work.

I'm sure eventually the Mac version will have a nice GUI to set this up for you. In the meanwhile, I've written a script that does it:

chrome-with-profile-1.0.tar.gz

Tom

Posted by Tom Limoncelli in Mac OS XTechnical Tips

xed 2.0.2 released!

xed is a perl script that locks a file, runs $EDITOR on the file, then unlocks it.

It also checks to see if the file is kept under RCS control. If not, it offers to make it so. RCS is a system that retains a history of a file. It is the predecessor to GIT, SubVersion, CVS and such. It doesn't store the changes in a central repository; it comes from a long-gone era before servers and networks. It simply stores the changes in a subdirectory called "RCS" in the same directory as the file. (and if it can't find that directory, it puts the information in the same directory as the file: named the same as the file with ",v" at the end.)

[More about this little-known tool after the jump.]

Posted by Tom Limoncelli in Technical Tips

I wrote about upgrading to IPv6 in the past, but I have more to say.

The wrong way: I've heard a number of people say they are going to try to convert all of their desktops to IPv6. I think they are picking the wrong project. While this is a tempting project, I think it is a mistake (well-intentioned, but not a good starter project). Don't try to convert all your desktops and servers to IPv6 as your first experiment. There's rarely any immediate value to it (annoys management), it is potentially biting off more than you can chew (annoys you), and mistakes affect people that you have to see in the cafeteria every day (annoys coworkers).

Instead copy the success stories I've detailed below. Some use a "outside -> in" plan, others pick a "strategic value".

Story 1: Work from the outside -> in

The goal here is to get the path from your ISP to your load balancer to use IPv6; let the load balancer translate to IPv4 for you. The web servers themselves don't need to be upgraded; leave that for phase 2.

It is a small, bite-sized project that is achievable. It has a real tangible value that you can explain to management without being too technical: "the coming wave of IPv6-only users will have faster access to our web site. Without this, those users will have slower access to our site due to the IPv4/v6 translaters that ISPs are setting up as a bandaid.". That is an explanation that a non-technical executive will understand.

It also requires only modest changes: a few routers, some DNS records, and so on. It is also a safe place to make changes because your external web presence has a good dev -> qa -> production infrastructure that you can leverage to test things properly (it does, right?).

Technically this is what is happening:

At many companies web services are behind a load balancer or reverse proxy.

ISP -> load balancer -> web farm

If your load balancer can accept IPv6 connections but send out IPv4 connections to the web farm, you can offer IPv6 service to external users just by enabling IPv6 the first few hops into your network; the path to your load balancer. As each web server becomes IPv6-ready, the load balancer no longer needs to translate for that host. Eventually you're entire web farm is native IPv6. Doing this gives you a throttle to control the pace of change. You can make small changes; one at a time; testing along the way.

The value of doing it this way is that it gives customers IPv6 service early, and requires minimal changes on your site. We are about 280 days away from running out of IPv4 addresses. Around that time ISPs will start to offer home ISP service where IPv6 is "normal" and attempts to use IPv4 will result in packets being NATed at the carrier level. Customers in this situation will get worse performance for sites that aren't offering their services over IPv6. Speed is very important on the web. More specifically, latency is important.

[Note: Depending on where the SSL cert lives, that load balancer might need to do IPv6 all the way to the frontends. Consult your load balancer support folks.]

Sites that are offering their services over IPv6 will be faster for new customers. Most CEOs can understand simple, non-technical, value statements like, "new people coming onto the internet will have better access to our site" or "the web site will be faster for the new wave of IPv6-only users."

Of course, once you've completed that and shown that the world didn't end, developers will be more willing to test their code under IPv6. You might need to enable IPv6 to the path to the QA lab or other place. That's another bite-sized project. Another path will be requested. Then another. Then the desktop LAN that the developer use. Then it makes sense to do it everywhere. Small incremental roll-outs FTW!

During Google's IPv6 efforts we learned that this strategy works really well. Most importantly we've learned that it turned out to be pretty easy and not expensive. Is IPv6 code in routers stable? Well, we're now sending YouTube traffic over IPv6. If you know of a better load test for the IPv6 code on a router, please let me know! (Footnote: "Google: IPv6 is easy, not expensive Engineers say upgrading to next-gen Internet is inexpensive, requires small team")

Story 2: Strategic Upgrades

In this story we are more "strategic".

Some people run into their boss's office and say, "OMG we have to convert everything to IPv6". They want to convert the routers, the DNS system, the DHCP system, the applications, the clients, the desktops, the servers.

These people sound like crazy people. They sound like Chicken Little claiming that the sky is falling.

These people are thrown out of their boss's office.

Other people (we'll call these people "the successful ones") go to their boss and say, "There's one specific thing I want to do with IPv6. Here's why it will help the company."

These people sound focused and determined. They usually get funding.

Little does the boss realize that this "one specific thing" requires touching many dependencies. That includes the routers, the DNS system, the DHCP system, and so on. Yes, the same list of things that the "crazy" person was spouting off about.

The difference is that these people got permission to do it.

According to a presentation I saw them give in 2008, Comcast found their 'one thing" to be: Settop box management. Every settop box needs an IP address so they can manage it. That's more IP addresses than they could reasonably get from ARIN. So, they used IPv6. If you get internet service from Comcast, the settop box on your TV set is IPv6 even though the cable modem sitting next to it providing you internet service is IPv4. They had to get IPv6 working for anything that touches the management of their network: provisioning, testing, monitoring, billing. Wait, billing? Well, if you are touching the billing system, you are basically touching a lot of things. Ooh, shiny dependencies. (There used to be a paper about this at http://www.6journal.org/archive/00000265/01/alain-durand.pdf but the link isn't working. I found this interview with the author but not the original paper.)

Nokia found their "one thing" to be: power consumption. Power consumption, you say? Their phones waste battery power by sending out pings to "keep the NAT session alive". By switching to IPv6 they didn't need to send out pings. No NAT, no need to keep the NAT session alive. Their phones can turn off their antenna until they have data to send. That saves power. In an industry where battery life is everything, any CxO or VP can see the value. A video from Google's IPv6 summit details Nokia's success in upgrading to IPv6.

Speaking of phones, T-Mobile's next generation handset will be IPv6-only. Verizon's LTE handsets are required to do IPv6. If you have customers that access your services from their phone, you have a business case to start upgrading to IPv6 now.

In the long term we should be concerned with converting all our networks and equipment to IPv6. However the pattern we see is that successful projects have picked "one specific thing to convert", and let all the dependencies come along for the ride.

Summary:

In summary, IPv6 is real and real important. We are about a year away from running out of IPv4 addresses at which point ISPs will start offering IPv6 service with translation for access to IPv4-only web sites. Successful IPv6 deployment projects seem to be revealing two successful patterns and one unsuccessful pattern. The unsuccessful pattern is to scream that the sky is falling and ask for permission to upgrade "everything". The sucessful patterns tend to be one of:

  • Find one high-value (to your CEO) reason to use IPv6: There are no simple solutions but there are simple explanations. Convert just that one thing and keep repeating the value statement that got the project approved. There will be plenty of dependencies and you will end up touching many components of your network. This will lead the way to other projects.
  • Work 'from the outside -> in": A load balancer that does IPv6<->IPv4 translation will let you offer IPv6 to external customers now, gives you a "fast win" that will bolster future projects, and gives you a throttle to control the speed at which services get native support.

I'd love to hear from readers about their experiences with convincing management to approve IPv6 projects. Please post to the comments section!

-Tom

P.S. Material from the last Google IPv6 conference is here: http://sites.google.com/site/ipv6implementors/2010/agenda

Posted by Tom Limoncelli in Technical Tips

A friend of mine who is an old-time Unix/Linux user asked me for suggestions on how to get used to Mac OS X.

The first mistake that Unix users make when they come to OS X is that they try to use X Windows (X11) because it is what they are used to. My general advice: Don't use X windows. Switching between the two modes is more work for your hands. Stick with the non-X11 programs until you get used to them. Soon you'll find that things just "fit together" and you won't miss X11.

Terminal is really good (continued lines copy and paste correctly! resize and the text reformats perfectly!). I only use X windows when I absolutely have to. Oh, and if you do use X11 and find it to be buggy, install the third-party X replacement called XQuartz (sadly you'll have to re-install it after any security or other updates)

Now that I've convinced you to stick with the native apps, here's why:

  1. pbcopy <file

Stashes the contents of "file" into your paste buffer.

  1. pbpaste >file

Copies the paste buffer to stdout.

  1. pbpaste | sed -e 's/foo/bar/g' | pbcopy

Changes all occurances of "foo" to "bar" in the paste buffer.

  1. "open" emulates double-clicking on an icon.

    open file.txt

If you had double-clicked on file.txt, it would have bought it up in TextEdit, right? That's what happens with "open file.txt". If you want to force another application, use "-a":

open -a /Applications/Microsoft\ Office\ 2008/Microsoft\ Word.app file.txt

Wonder how to start an ".app" from Terminal? Double click it:

open /Applications/Microsoft\ Office\ 2008/Microsoft\ Word.app

Want to find a directory via "cd"ing around on the Terminal, but once you get there you want to use the mouse?

cd /foo/bar/baz
open .

I use this so much I have an alias in my .bash_profile:

alias here="open ."

Now after "find"ing and searching and poking around, once I get to the right place I can type "here" and be productive with the mouse.

  1. Want to use a Unix command on a file you see on the desktop? Drag the icon onto the terminal.

type: od (space) -c (space)

Then drag an icon onto that Terminal. The entire path appears on the command line. If the path has spaces or other funny things the text will be perfectly quoted.

  1. Dislike the File Open dialog? Type "/" and it will prompt you to type the path you are seeking. Name completion works in that prompt. Rock on, File Dialog!

  2. Word processors, spread sheets, video players and other applications that work with a file put an icon of that file in the title bar. That isn't just to be pretty. The icon is useful. CMD-click on it to see the path to the file. Select an item in that path and that directory is opened on the Desktop.

That icon in the title bar is draggable too! Want to move the file to a different directory? You don't have to poke around looking for the source directory so you can drag it to the destination directory. Just drag the icon from the title bar to the destination directory. The app is aware of the change too. Lastly, drag the icon from the title bar into a Terminal window. It pastes the entire path to the file just like in tip 5.

  1. If you want to script the things that Disk Util does, use "hdiutil" and "diskutil". You can script ripping a DVD and burning it onto another one with "hdiutil create" then "diskutil eject" then "hdiutil burn".

  2. rsync for Mac OS has an "-E" option that copies all the weird Mac OS file attributes including resource forks. ("rsync -avPE src host:dest")

  3. "top" has stupid defaults. I always run "top -o cpu". In fact, put this in your profile:

    alias top='top -o cpu'

  4. For more interesting ideas, read the man pages for:

    screencapture mdutil say dscl dot_clean /usr/bin/util pbcopy pbpaste open diskutil hdiutil

Enjoy!

P.S. others have recommended this list: http://superuser.com/questions/52483/terminal-tips-and-tricks-for-mac-os-x

Posted by Tom Limoncelli in Mac OS XTechnical Tips

I try not to use this blog to flog my employer's products but I just used the open source "Google Command Line" program and I have to spread the word... this really rocked.

I wanted to upload a bunch of photos to Picasa. I didn't want to sit there clicking on the web interface to upload each one, I didn't want to import them into iPhoto and then use the Picasa plug-in to upload them. I just wanted to get them uploaded.

Google CL to the rescue! It is a program that lets you access many google properties from the command line. It works on Mac, Linux and Windows. After a (rather simple) install process I was ready to give it a try.

Here's the command line that I typed:

$ google picasa create --title "2010-08-09-Hobart-Tasmania-SAGE-AU" ~/Desktop/PHOTOS-AU/*

I was expecting it to ask me for a username and password but I was surprised when it my web browser popped up, asked me to authorize this script to have permission to log in (just like third-party apps that authenticate against Google), and when I was back at the command line I pressed "return" to continue. The upload began and finished a few minutes later.

In addition to picasa, the command can also access blogger, youtube, docs, contacts and calendar.

Posted by Tom Limoncelli in Technical Tips

Google App Inventor

At the SAAD-NYC event last night I explained how Google App Inventor lets you make apps for Android phones without knowing how to program. It was beta tested "mainly in schools with groups that included sixth graders, high school girls, nursing students and university undergraduates who are not computer science majors."

He said, "Why haven't you written about this amazing thing on your blog?"

I dunno! So here. I'm mentioning it now.

(I think the NY Times article is the best overview.)

Happy, Jim?

Posted by Tom Limoncelli in Technical Tips

Oh that's how they get such amazing speed on a web server! http://www.eecs.harvard.edu/~mdw/papers/seda-sosp01.pdf

In the future, all servers will work like this.

Well worth reading.

Posted by Tom Limoncelli in Technical Tips

(Reposting this announcement from Dan)

Fellow SysAds etc.-

First, I'd like to make sure you are all aware of the Configuration Management Summit next week in Boston on June 24 (details are at http://www.usenix.org/events/config10/). The first Configuration Management Summit aims to bring together developers, power users, and new adopters of open source configuration management tools for automating system administration. Configuration management is a growth area in the IT industry, and open source solutions, with cost savings and an active user community, are presenting a serious challenge to today's "big vendor" products. Representatives from Bcfg2, Cfengine, Chef, and Puppet will all be participating in the summit - this will be a valuable opportunity if you have been contemplating a configuration management solution for your systems.

There is also a special one-day training on Cfengine being taught by Mark Burgess on June 25 (details are at http://www.usenix.org/events/config10/#tut_cfengine). This class might be a review session for anyone on this mailing list, but it will also offer useful insights for people who are not new to Cfengine. Additionally, If you have colleagues who need to come up to speed on Cfengine quickly, this class will be an excellent opportunity for them to learn Cfengine directly from the author.

If you are interested in either event, you can register at http://www.usenix.org/events/confweek10/registration/ (and if you have questions, you can email me directly). I hope to see you in Boston!

Daniel Klein
Education Director
USENIX

Posted by Tom Limoncelli in ConferencesTechnical Tips

Lance Albertson wrote up a great description of how Ganeti Virtualization Manager performed under pressure during a power outage:

Nothing like a power outage gone wrong to test a new virtualization cluster. Last night we lost power in most of Corvallis and our UPS & Generator functioned properly in the machine room. However we had an unfortunate sequence of issues that caused some of our machines to go down, including all four of our ganeti nodes hosting 62 virtual machines went down hard. If this had happened with our old xen cluster with iSCSI, it would have taken us over an hour to get the infrastructure back in a normal state by manually restarting each VM.

But when I checked the ganeti cluster shortly after the outage, I noticed that all four nodes rebooted without any issues and the master node was already rebooting virtual machines automatically and fixing all of the DRBD block devices.

Ganeti is a management layer that makes it easy to set up large clusters of Xen or KVM (or other) virutalized machines. He has written a great explanation of what is Ganeti and its benefits too.

I use Ganeti for tons of projects at work.

Posted by Tom Limoncelli in Technical Tips

Dear readers in the United States,

I'm sorry. I have some bad news.  That tiny computer closet that has no cooling will overheat next weekend.

Remember that you aren't cooling a computer room, you are extracting the heat.  The equipment generates heat and you have to send it somewhere. If it stays there, the room gets hotter and hotter.

For the past few months you've been lucky.  That room benefited from the fact that the rest of the building was relatively chilly. The heat was drawn out to the rest of the building. During the winter, each weekend the heat was turned off (or down) and your uninsulated computer room leaked heat to the rest of the building. Now it's springtime, nearly summer.  The building A/C is on during the week. When it shuts down for the weekend the building is hot; hotter than your computer room.  The leaking that you were depending on is not going to happen.

Last weekend the temperature of your computer room got warm on Saturday and hot on Sunday. However, it was ok.

This weekend it will get hot on Saturday and very hot on Sunday. It will be ok.

However, next weekend is Memorial Day weekend. The building's cooling will be off for three days. Saturday will be hot. Sunday will very very hot.  Monday will be hot enough to kill a server or two.

If you have some cooling, Monday you'll discover that it isn't enough.  Or the cooling system will be overloaded and any weak, should-have-been-replaced, fan belts will finally snap.

How do we get into this situation?

Telecom closets don't have any cooling because they have no active components. It's just a room where wires connect to wires. That changed in the 1990s when phone systems changed. Now that telecom closet has a PBX, and an equipment rack.  If there is an equipment rack, why not put some PC servers into it? If there is one rack, why not another rack? By adding one machine at a time you never realize how overloaded the system has gotten.

Even if you have proper cooling, I bet you have more computers in that room than you did last year.

So what can you do now to prevent this problem?
  • Ask your facilities person to check the belts on your cooling system.
  • Set up monitoring so you'll be alerted if the room gets above 33 degrees C. (You probably don't have time to buy a environmental monitor, but chances are your router and certain servers have a temperature gauge on or near the hottest part of the equipment. It is most likely hotter than 33 degrees C during normal operation, but you can detect if it goes up relative to a baseline.)
  • Clean (remove dust from) the air vent screens, the fans, and any drives. That dust makes every mechanical component work harder. More stress == more likely to break.
  • Inventory the equipment in the room and shut off the unused equipment (I bet you find at least one server)
  • Inventory the equipment and rank by priority what you can power off if the temperature gets too high.
If you do have a system that overheats, remember that you can buy or rent temporary cooling systems very easily.

I don't generally make product endorsements, but at a previous company we had an overheating problem and it was cheaper and faster to buy a Sunpentown 9000 BTU unit at Walmart than to wait around for a rental. In fact, it was below my spending limit to purchase two and tell the CFO after the fact. I liked the fact that it self-evaporated the water that accumulated; I needed to exhaust hot air, not hot air and water.

Most importantly, be prepared. Have monitoring in place. Have a checklist of what to shut down in what order.

Good luck! Stay cool!

Tom

P.S. I wrote about this 2 years ago.

Posted by Tom Limoncelli in Technical Tips

Posted by Tom Limoncelli in Technical Tips

Posted by Tom Limoncelli in Technical Tips

Previously I wrote about the Google Apps shortname service which lets you set up a tinyurl service for your enterprise.

The article implies that the service can be used without using the FQDN. This is not true. In other words, I had said that "go.example.com/lunch" could be shortened to "go/lunch".

There is a workaround that makes it work. It is difficult to configure, but I've set up a Community Wiki on ServerFault.com that explains all the steps. As a wiki, I hope people can fill in the items I left blank, particularly specific configuration snippets for ISC BIND, Windows DHCP server, Linux DHCP clients, and so on.

The new article is here: How to set up Google ShortName service for my domain, so that the FQDN isn't needed

Posted by Tom Limoncelli in Technical Tips

Update 2010-01-26: There is a follow-up article to this here

Update 2009-12-20: Enabling the service wasn't working for a few days. It is now working again. It does not require Premier service. Any Google Apps customer should be able to use it.

Where I work we have a service called "go" which is a tinyURL service. The benefit of it being inside our domain is huge. Since "go" (the shortname) is found in our DNS "search path", you can specify "go" links without entering the FQDN.

That means we can enter "go/payroll" in your browser to get to the payroll system and "go/lunchmenu" to find out what's for lunch today. That crazy 70-char long URL that is needed to get to that third-party web-based system we use? I won't name the vendor, but let me just say that I now get there via "go/expense".

Posted by Tom Limoncelli in Technical Tips

SysAdvent has begun!

SysAdvent has started its second year.  SysAdvent is a project to count down the 24 days leading to Christmas with a sysadmin tip each day.  Last year Jordan Sissel wrote all 24 days (amazing job, dude!). This year he has enlisted guest bloggers to help out. You might see a post of mine one of these days.

While I don't celebrate the holiday that the event is named after, I'm glad to participate.

Check out this and last year's postings on the SysAdvent Blog: sysadvent.blogspot.com


Posted by Tom Limoncelli in CommunityTechnical Tips

Last week I mentioned that that if you have a service that requires a certain SLA, it can't depend on things of lesser SLA.

My networking friends balked and said that this isn't a valid rule for networks. I think that violations of this rule are so rare they are hard to imagine. Or, better stated, networking people do this so naturally that it is hard to imagine violating this rule.

However, here are 3 from my experience:

Matt Simmons interviews me about "Design Patterns for System Adminsitrators".

This is a tutorial that I've never taught before. You can see it first at LISA 2009 in November.

In case you missed it, Matthew Sacks interviewed me about my other LISA tutorial. That tutorial also has a lot of new material.

Sysadmins have a love-hate relationship with shared libraries. They save space, they make upgrades easier, and so on.  However, they also cause many problems.  Sometimes they cause versioning problems (Windows DLLs), security problems, and (at least when they were new) performance problems.  I won't go into detail, just mention them on a technical email list and you'll get an earful.

Here's one example that hits me a lot. On my Linux box, if I run an update of Firefox, my current Firefox browser keeps running. However, the next time it needs to load a shared library, it is now loading the upgraded version which is incompatible and my Firefox goes bonkers and/or crashes. On the Mac and Windows this doesn't happen because the installer waits for you to close any Firefox instances before continuing.

Google Chrome browser does its updates in the background while you use it. The user doesn't have to wait for any painful upgrade notification. Instead, the next time they run Chrome they are simply told that they are now running the newest release. I call this a "parent-friendly" feature because the last time I visited my mom much of her software had been asking to be upgraded for months.  I wish it could have just upgraded itself and kept my mom's computer more secure. ACM has an article by the Chrome authors about why automatic upgrades are a key security issue. (with graphs of security attacks vs. upgrade statistics)

However, if Google Chrome upgrades itself in place, how does it keep running without crashing? Well, it turns out, they use a technique called the LinuxZygote.  The libraries they need are loaded at startup into a process which then fork()s any time they need, for example, a renderer. The Zygote pattern is usually done for systems that have a slow startup time. However, they claim that in their testing there was no performance improvement. They do this to make the system more stable.

Read the (brief) article for more info: LinuxZygote


Posted by Tom Limoncelli in Technical Tips

(I'm setting up Debian PPC on an old PowerBook G4.)

The installation went really well.  I downloaded the stable 5.0.2 DVD image, burned it onto a DVD from my Mac (note: Safari warned that the file system might be corrupted, but I ran "md5" on the .iso and the output matched what the web site said it should be) and it booted without incident and I was able to go through the entire installation without fail.  I am cheating a little since I'm not doing a multi-boot.  I hear that is more difficult.

When the machine booted the first time I was able to log in!  Sadly, the touchpad wasn't working, and there was only so much I could do from the keyboard.

Using TAB and SPACEBAR I was able to navigate around a little.  Sometimes I would get into a corner where TAB nor SPACEBAR was really helpful.

Luckily you can always log out of an X11 session by pressing CTRL-OPTION-BACKSPACE. Warning: this zaps the entire X11 window session.  All your apps are instantly killed. You are logged out.  Don't press it unless you mean it.  (And, yes, the keyboard sequence is an homage to CTRL-ALT-DEL).  While this wasn't the best option, sometimes it was all I had.

To fix these problems I thought the best thing to do would be to SSH to it from another machine.  The default Debian configuration doesn't include openssh-server, just the -client.  This is wise from a security standpoint, but wasn't helping me fix the machine.

From the initial login screen I was able to set up a "Failsafe" xterm window.  From there I could become root.  "apt-get install ssh" tried to the right thing, but it couldn't get access the DVD drive.

"ls /dev" wasn't showing very much.  No /dev/sd* or hd* or sr0 (CD-ROM) at all.  This was distressing.  My touchpad wasn't working, my CD-ROM (well, DVD) wasn't showing up.

I couldn't load new packages if the DVD didn't work.  I couldn't fix the machine if I couldn't SSH in.  Ugh.

I searched a lot of web sites for information about how to fix this and nearly gave up.

Finally I remember that in the old days zapping the "PRAM" fixed a heck of a lot of problems.  The PRAM is a battery-backed bit of RAM (or NVRAM) that stores a few critical settings like boot parameters and such.  To zap the PRAM, you boot while holding these four keys: Command, Option, P and R.  It takes some practice.

After zapping the PRAM Debian booted and the mouse and touchpad magically worked.  When I logged in, I could see that the DVD was working.  "apt-get install ssh" worked without a hitch.  The DVD had automatically been detected and mounted.  I was impressed!

"ls /dev" now showed many, many more devices.

Later I installed SSH ("apt-get install ssh"), configure my SSH keys so I can log in easily from my primary computer, and even added the Ethernet MAC address to my DHCP server so that it always gets the same IP address.

To be honest, I don't know if zapping the PRAM fixed it or it was the reboot.  udevd may not have started (I forgot to check).  Either way, I was very happy that things worked.  I started up a web browser, went to www.google.com and when it came up it felt like home.


Posted by Tom Limoncelli in Technical Tips

You know that here at E.S. we're big fans of monitoring. Today I saw on a mailing list a post by Erinn Looney-Triggs who wrote a module for Nagios that uses dmidecode to gather a Dell's serial number then uses their web API to determine if it is near the end of the warantee period. I think that's an excellent way to prevent what can be a nasty surprise.

Link to the code is here: Nagios module for Dell systems warranty using dmidecode (Github project here)

What unique things do you monitor for on your systems?

Posted by Tom Limoncelli in Technical Tips

Google has enabled IPv6 for most services but ISPs have to contact them and verify that their IPv6 is working properly before their users can take advantage of this.

I'm writing about this to spread the word.  Many readers of this blog work at ISPs and hopefully many of them have IPv6 rolled out, or are in the process of doing so.

Technically here's what happens:  Currently DNS lookups of www.google.com return A records (IPv4), and no AAAA records (IPv6).  If you run an ISP that has rolled out IPv6, Google will add you (your DNS servers, actually) to a white-list used to control Google's DNS servers.  After that, DNS queries of www.google.com will return both an A and AAAA record(s).

What's the catch?  The catch is that they are enabling it on a per-ISP basis. So, you need to badger your ISP about this.

Why not just enable it for all ISPs?  There are some OSs that have default configurations that get confused if they see an AAAA record yet don't have full IPv6 connectivity.  In particular, if you have IPv6 enabled at your house, but your ISP doesn't support IPv6, there is a good chance that your computer isn't smart enough to know that having local IPv6 isn't the same as IPv6 connectivity all the way across the internet.  Thus, it will send out requests over IPv6 which will stall as the packets get dropped by the first non-IPv6 router (your ISP).

Thus, it is safer to just send AAAA records if you are on an ISP that really supports IPv6.  Eventually this kind of thing won't be needed, but for now it is a "better safe than sorry" measure.  Hopefully if a few big sites do this then the internet will become "safe" for IPv6 and everyone else won't need to take such measures.

If none of this makes sense to you, don't worry. It is really more important that your ISP understands.  Though, as a system administrator it is a good idea to get up to speed on the issues.  I can recommend 2 great books:
The Google announcement and FAQ is here: Google announces "Google over IPv6". Slashdot has an article too.
Everyone from Slashdot to people I talk with on the street are shocked, shocked, shocked, by the report in the New York Times that TXTing costs carriers almost nothing, even though they've been raising the price dramatically.  (SMS is "Short Message Service", the technical name for what Americans call "TXTing" and what the rest of the world calls "SMS".)

People have asked me, "Is this true?" (it is) so I thought this would be a good time to explain how all of this works.

The phone system uses a separate network for "signaling" i.e. messages like "set up a phone call between +1-862-555-1234 and  +353(1)555-1234".  The fact that it is a separate network is for security.  When signally was "in band" it was possible for phone users to play the right tones and act just like an operator (see Phreaking).  It is also for speed reasons; one wants absolute priority for signaling data.

The protocol is called "SS7" (Signaling System 7).  Like most teleco protocols it is difficult to parse and ill-defined.  This is how telcos keep new competition from starting.  They hype SS7 as something so complicated that only rocket scientists could ever understand it.  Of course, it is an ITU standard, so it isn't a secret how it works.  You just have to pay a lot of money to get a copy of the standard. In fact, once Cisco had a working SS7 software stack the downfall of Lucent/AT&T/others was only years away.  Heck, Cisco published a book demystifying SS7.  It turns out the emperor had no clothes and Cisco wanted everyone to know.  SS7 is big and scary, but only as bad as most protocols. I guess SMTP or SNMP would be scary too if you had never seen a protocol before. (Remember that non-audio networks are still "new" to the telecom world, or at least their executives.)

SS7 is all about setting up "connections".  When I dial a number, SS7 packets are sent out that query databases to translate the phone number I want to dial to a physical address to connect to, then an SS7 query goes out to request that all the phone switches from point A to point B allocate bandwidth and start letting audio through.  The nomenclature dates back to what was used when phone calls were set up by ladies sitting in front of switchboards.

What makes international dialing work is that there are SS7 gateways between all the carriers.  They don't charge each other for this bandwidth because it is just the cost of doing business.  The logs of what calls are actually made is used to create billing records, and the carrier do charge each other for the actual calls.  Thus, there is no charge for the SS7 packets between AT&T and O2 (O2 is a big cell provider in Europe), but O2 does back-bill AT&T for the phone call that was made. (This is called "Settlement" and my previous employer processed 80% of the world's settlement records on behalf of the phone companies.)

Setting up a connection for an SMS would be silly.  An entire connection for just a 160-byte message?  No way.  That's more trouble than it is worth.  Therefore, SMS is the only service where the actual service is provided over SS7.  The 160-byte limit comes from a limit in SS7 packet size.

However, the phone companies don't really do anything for free.  The SMS records are used to construct billing data and the companies certainly do back-bill each other for SMS carried by each other's networks.  If you SMS from AT&T to O2, there is settlement going on after the fact. However, SMS between two AT&T customers has no real cost.

"Multimedia SMS" (photos) are not sent over SS7, though SS7 is used to setup/teardown the connection just like a phone call.  If they were smart they'd use SS7 to just transmit an email address and then send the photo over the internet.  It would probably be cheaper.  (Though, when has a telco has a well-run email system?  Sigh.)

So, SMS is "free" because it rides on the back of pre-existing infrastructure.  The "cost" is due to the false economics created to "extract value" out of the system (i.e. "charge money").

If they were doing it all from scratch, they could probably run it all over the internet for "free" too.  Heck, it wouldn't be much bandwidth even if people learned to type 100x faster.

Why was SMS permitted to use SS7 unlike any other service? The real reason, I'm told, wasn't entirely technical.  It was due to the fact that the telecos thought that nobody would actually use the service. Little did they know that it would catch on among teens and then spread!

More info:

Posted by Tom Limoncelli in Technical Tips

Amazon's Kindle

I got a demo of Amazon's Kindle the other day and was very impressed. I hadn't realized that it had a built-in cellphone-based data connection so you could always download more content. The speed was a little slow, but for reading a book I think it was perfect. I'm considering getting one.

Today I got email from Amazon reminding me that if I shill for them on my blog, readers can get a $100 discount. You just have to apply for an Amazon credit card and use this link.

Do I feel bad about shilling for Amazon? Well, not if it gets my readers a $100 discount. It is a product that friends of mine are happy with and I'm impressed by the demos I've seen.

Posted by Tom Limoncelli in Technical Tips

April Showers bring May Flowers. What does May bring? Three-day weekends that make A/C units fail!

This is a good time to call your A/C maintenance folks and have them do a check-up on your units. Check for loose or worn belts and other problems. If you've added more equipment since last summer your unit may now be underpowered. Remember that if your computers consume 50Kw of power, your A/C units should be using about the same (or more) to cool those computers. That's the laws physics speaking, I didn't invent that rule. The energy it takes to create heat equals the energy required to remove that much heat.

Why do A/C units often fail on a 3-day weekend? During the week the office building has its own A/C. The computer room's A/C only has to remove the heat generated by the equipment in the room. On the weekends the building's A/C is powered off and now the 6 sides (4 walls, floor and ceiling) of the computer room are getting hot. Heat seeps in. Now the computer room's A/C unit has more work to do.

A 3-day weekend is 84 hours (Friday 6pm until Tuesday 6am). That's a lot of time to be running continuously. Belts wear out. Underpowered units overheat and die. Unlike a home A/C unit which turns on for a few minutes out of every hour, a computer-room A/C unit ("industrial unit") runs 12-24 hours out of every day. Industrial cooling costs more because it is an entirely different beast. Try waving your arms for 5 minutes per hour vs. 18 hours a day.

Most countries have a 3-day weekend in May. By the 2nd or 3rd day the A/C unit is working as much as a typical day during the summer. If it is about to break, this is the weekend it will break.

To prevent a cooling emergency make sure that your monitoring system is also watching the heat and humidity of your room. There are many SNMP-accessible units for less than $100. Dell recommends machines shouldn't run in a room that is hotter than 35c. I generally recommend that your monitoring system alert you at 33c; if you see now sign of it improving on its own in the next 30 minutes, start powering down machines. If that doesn't help, power them all off. (The Practice of System and Network Administration has tips about creating a "shutdown list"). Having the ability to remotely power off machines can save you a trip to the office. Most Linux systems have a "poweroff" command that is like "halt" but does the right thing to tell the motherboard to literally power off. If the server doesn't have that feature (because you bought it in the 1840s?) shutting it down and leaving it sitting at a "press any key to boot" prompt often generates little heat compared to a machine that is actively processing. If powering off the non-critical machines isn't enough, shut down critical equipment but not the equipment involved in letting you access the monitoring systems (usually the network equipment). That way you can bring things back up remotely. Of course, as a last resort you'll need to power off those bits of equipment too.

Having cooling emergency? Cooling units can be rented on an emergency basis to help you through a failed cooling unit, or to supplement a cooling unit that is underpowered. There are many companies looking to help you out with a rental unit.

If you have a small room that needs to be cooled (a telecom closet that now has a rack of machines) I've had good luck with a $300-600 unit available at Walmart. For $300-600 it isn't great, but I can buy one in less than an hour without having to wait for management to approve the purchase. Heck, for that price you can buy two and still be below the spending limit of a typical IT manager. The Sunpentown 1200 and the Amcor 12000E are models that one can purchase for about $600 that re-evaporates any water condensation and exhausts it with the hot air. Not having to empty a bucket of water every day is worth the extra cost. The unit is intended for home use, so don't try to use it as a permanent solution. (Not that I didn't use one for more than a year at a previous employer. Ugh.) It has one flaw... after a power outage it defaults to being off. I guess that is typical of a consumer unit. Be sure to put a big sign on it that explains exactly what to do to turn it back on after a power outage. (The sign I made says step by step what buttons to press, and what color each LED should be if it is running properly. I then had a non-system administrator test the process.)

In summary: test your A/C units now. Monitor them, especially on the weekends. Be ready with a backup plan if your A/C unit breaks. Do all this and you can prevent an expensive and painful meltdown.

Posted by Tom Limoncelli in Best of BlogTechnical Tips

HostDB 1.002 released!

A few years ago I released HostDB, my simple system for generating DNS domains. The LISA paper that announced it was called: HostDB: The Best Damn host2DNS/DHCP Script Ever Written.

I just released 1.002 which adds some new features that make it easier to generate MX records for domain names with no A records, and not generate NS records for DNS masters. Other bug fixes and improvements are included.

HostDB is released under the GPL, supported on the HostDB-fans mailing list, and supported by the community. This recent release includes patches contributed by Sebastian Heidl.

HostDB 1.002 is now available for download.

Posted by Tom Limoncelli in Technical Tips

Managing Xen instances is a drag. So my buddies in the Google Zürich office built a system for managing them . Now life is great! The team I manage has put Xen clusters all over the world, all managed with Ganeti. It rocks. I'm proud to see it is available to everyone now under a GPLv2 license.

When I first heard the name, I thought it sounded like an new kind of Italian dessert. But what do you expect from a guy with a last name like "Limoncelli"?

Posted by Tom Limoncelli in Technical Tips

Hardware didn't used to have passwords. Your lawnmower didn't have a password, your car didn't have a password, and your waffle iron didn't have a password.

But now things are different. Hardware is much smarter and now often requires a password. Connecting to the console of a Cisco router asks for a password. A Honda Prius has an all-software entry system.

Posted by Tom Limoncelli in Technical Tips

There is an anti-spam technique called "Grey Listing" which has almost completely eliminated spam from my main server. What's left still goes through my SpamAssassin and Amavis-new filters, but they have considerably much less work to do.

The technique is more than a year old but I've only installed a greylist plug in recently and I'm impressed at how well it works. I hope by writing this article other people that have procrastinated will decide to install a greylist system.

Posted by Tom Limoncelli in Technical Tips

If you write to a file that is SUID (or SGID) the SUID (and SGID) bits on the file are removed as a security precaution against tampering (unless uid 0 is doing the writing).

(See FreeBSD 5.4 source code, sys/ufs/ffs/ffs_vnops.c:739)

Posted by Tom Limoncelli in Technical Tips

The Jifty buzz

Everyone that has seen me speak knows that I love RT for tracking user requests. I was IMing with the author of RT today and he said that for his next product he realized he should first write a good tool that lets him make AJAXy applications without having to do all the work manually. He's done that, and its called Jifty. Now he's building apps based on that. The first one has as many features as RT but is 1/10th the code base. Awesome! Sounds like Jifty is going to be a big hit! (You can find Jifty in CPAN already.)

Oh, and what's the new app called? Hiveminder.

Let the rumors fly! :-)

Posted by Tom Limoncelli in Technical Tips

It's obvious but I didn't think of one particular reason why until the end of this journey.

Read more...

Posted by Tom Limoncelli in Technical Tips

techtarget.com reports:
The problem is, directing cold air is like trying to herd cats. Air is unpredictable. Your cooling unit is sucking in air, cooling it and then throwing it up through a perforated floor. But you have little control over where that air is actually ending up.
Two different vendors are promoting more aggressive cooling systems for modern racks.

Posted by Tom Limoncelli in Technical Tips

Ars Technica has an excellent article about MSH.

If you love perl and/or bash, you'll be interested in reading this tutorial. It gives some excellent examples that explain the language.

Posted by Tom Limoncelli in Technical Tips

"When I see a person I don't recognize in the office, I always smile, stop, introduce myself, and ask for the person's name. I then ask to read it off his ID badge "to help me remember it. I'm a visual learner." New people think I'm being friendly. I'm really checking for trespassers."
This and other great tips can be found in here.

Posted by Tom Limoncelli in Technical Tips

Posted by Tom Limoncelli in Technical Tips

A while back I recommend BlastWave as a great source of pre-built binaries for Solaris. Their service has saved me huge amounts of time.

Sadly, they are running low on funds. It's expensive to keep a high-profile web site like this up and running. Corporate donors are particularly needed.

I just donated $50. I hope you consider donating to them too. Otherwise, in less than 48 hours, they may have to shut down.

Posted by Tom Limoncelli in Technical Tips

Solaris package tip

Since I'm more of an OS X/FreeBSD/Linux person lately, I've gotten a bit out of touch with Solaris administration. I was quite pleasently surprised to find CSW - Community SoftWare for Solaris which includes hundreds of pre-built packages for Solaris. More importantly, it provided the three I really needed and didn't have time to build. :-)

The system is really well constructed. I highly recommend it to everyone. Give this project your support!

Posted by Tom Limoncelli in Technical Tips

Credits