Recently in Google Category

Today is my last day at Google. After 7 years I'm looking forward to doing nothing for a while, writing a book or two (oh yeah, I have a big announcement: I've signed 2 book contracts! More info soon!), and I'm getting married.

Please, no speculation on why I'm leaving. I was at Bell Labs 7 years too. It's just time.

(FunFact: I found a draft of a "goodbye message" I wrote. The file's datestamp was Nov 10, 2010.)

The annoying thing about job hunting is that usually you have to take random days off from your current job claiming "something came up" or taking vacation days or faking sick days. It is disruptive to coworkers, especially if you are in a team environment with lots of meetings. This time will be different: I'll be free to go at my own pace. (I'm looking only in NYC at this time.)

Officially I'm taking 4 days of vacation so that my last day is April 1st. Yes, my last day is April Fools Day. This is not a PR stunt to promote the April Tools RFCs book but wouldn't have that been hilarious if it was?

Tom

My 60-minute talk on Ganeti from the Usenix LISA '12 conference has been posted:

https://www.usenix.org/conference/lisa12/ganeti-your-private-virtualization-cloud-way-google-does-it

Ganeti is a cluster virtual server management software tool built on top of existing virtualization technologies such as Xen or KVM and other Open Source software. Ganeti takes care of disk creation, migration, OS installation, shutdown, startup, and can be used to preemptively move a virtual machine off a physical machine that is starting to get sick. It doesn't require a big expensive SAN, complicated networking, or a lot of money. The project is used around the world by many organizations; it is sponsored by Google and hosted at http://code.google.com/p/ganeti.

Thanks to Usenix for making these videos available to the public for free!

Posted by Tom Limoncelli in Google

The photos look like "IBM meets Willy Wonka's Chocolate Factory".

For the first time, the company has invited cameras inside its top secret facility in North Carolina. Our tour guide is Google's senior vice president, Urs Hoelzle, who's in charge or building and running all of Google's data centers. 'Today we have 55,200 servers on this floor. Each of these servers is basically like a PC, except somewhat more powerful.'

The Wired article by Steven Levy:

http://www.wired.com/wiredenterprise/2012/10/ff-inside-google-data-center/

The Google announcement:

http://googleblog.blogspot.nl/2012/10/googles-data-centers-inside-look.html

Walk through it using StreetView:

Video Tour:

Posted by Tom Limoncelli in Google

[A Google SRE is a Site Reliability Engineer. That is, the sysadmin that manage the servers that provide Google's internal and external services. i.e. not the ones that do office IT like helpdesk and maintaining the printers. This is now what some people call "devops" but it preceded the term by a few years.]

Someone recently asked me if Google's SRE position has change over the years. The answer is 'yes and no'.

Yes, the job has changed because there is more diversity in the kind of work that SREs do. Google has more products and therefore more SRE teams. Each team is unique but we all work under the same mission, executive management, and best practices. All teams strive to use the same best practices for release engineering, operational improvements, debugging, monitoring, and so on. Yes, since each SRE team is responsible for a different product with different needs, you'll find each one can be unique priorities. I do like the fact that there is so much sharing of tools; something one team invents usually helps all teams. My team might find X is a priority while others don't: we make a tool that makes X better and share it; soon everyone is using it.

On the other hand, no, the job hasn't changed because the skill-set we look for in new recruits hasn't changed: understanding of the internals of networking, unix, storage and security with the ability to automate what you do.

Another thing that hasn't changed is that SREs generally don't work at the physical layer but we must understand the physical layer: The product(s) we manage are run from datacenters around the world and we don't get to visit them personally. You don't spend time cabling up new machines, configuring network ports, or fighting with a vendor over which replacement part needs to be shipped. We have awesome datacenter technicians that take care of all that (note: since we build our own machines even the way we handle repairs is different). The project I'm on has machines that are in countries I've never been to. News reporters tend to not understand this.... I'm in the NYC office and I think it is adorable to read articles written by misguided reporters that assume their Gmail messages are kept at the corner of 14th and 8th Ave..

On the subject of what we look for when recruiting new SREs: we look for experience with scale (number of machines, quantity of disk, number of queries per second). Other companies don't have the scale we have, so we can't expect candidates to have experience with our kind of scale; instead we look for people that have the potential to step up to our scale. We write our own software stack (web server, RPC, etc) so we can't look for people that have experience with those tools; instead we look for people that are technical enough to be able to learn, use, and debug them.

At our scale we can't do very much manually. A "1 in a million" task that would be done manually at most companies has to be automated at Google because it probably happens 100 times a day. Therefore, SREs spend half their time writing code to eliminate what they do the other half of their day. When they "automate themselves out of a job" it is cause for celebration and they get to pick a new project. There are always more projects.

If you are interested in learning about what kind of internal systems Google has, I highly recommend reading some of the "classic" papers that Google has published over the years. The most useful are:

Those last few papers are recent enough that most people aren't aware of them. In fact, I, myself, only recently read the Dremel paper.

While the papers focus on how the systems work, remember that there are SREs behind each one of them keeping them running. To run something well you must understand its internals.

You might also want to read them just because understanding these concepts is a useful education itself.

Posted by Tom Limoncelli in Google

I'm so proud of my coworker Thomas Bushnell giving an amazing talk at LinuxCon, the Linux Foundation's annual North American technical conference. For the first time Google revealed details about how Google manages thousands of Linux desktops. We start with Ubuntu LTS, add a packages that secure it and let us manage it better, and ta-da!

Read the entire article!

Posted by Tom Limoncelli in Google