Recently in Google Category

Networking geeks: Google made a big announcements about BBR this week. Here's a technical deep-dive: (Hint: if you would read ACM Queue like I keep telling you to, you'd have known about this before all your friends.)

Someone on Facebook asked me for a "explain it like I'm 5 years old" explanation. Here's my reply:

Short version: Google changed the TCP implementation (their network stack) and now your youtube videos, Google websites, Google Cloud applications, etc. download a lot faster and smoother. Oh, and it doesn't get in the way of other websites that haven't made the switch. (Subtext: another feature of Google Cloud that doesn't exist at AWS or Azure. Nothing to turn on, no extra charge.)

ELI5 version: TCP tries to balance the need to be fast and fair. Fast... transmitting data quickly. Fair... don't hog the internet, share the pipe. Being fair is important. In fact, it is so important that most TCP implementations use a "back off" algorithm that results in you getting about 1/2 the bandwidth of the pipe... even if you are the only person on it. That's TCP's dirty little secret: it under-utilizes your network connection by as much as 50%.

Backoff schemes that use more than 1/2 the pipe tend to crowd out other people, thus are unfair. So, in summary, the current TCP implementations prioritize fairness over good utilization. We're wasting bandwidth.

Could we do better? Yes. There are better backoff algorithms but they are so much work that they are impractical. For years researchers have tried to make better schemes that are easy to compute. (As far back as the 1980s researchers built better and better simulations so they could experiment with different backoff schemes.)

Google is proposing a new backoff algorithm called BBR. It has reached the holy grail: It is more fair than existing schemes. If a network pipe only has one user, they basically use the whole thing. If many users are sharing a pipe, it shares it fairly. You get more download speed over the same network. Not only that, it doesn't require changes to the internet, just the sender.

And here's the real amazing part: it works if you implement BBR on both the client and the server, but it works pretty darn good if only change the sender's software (i.e. Google updated their web frontends and you don't have to upgrade your PC). Wait! Even more amazing is that it doesn't ruin the internet if some people use it and some people use the old methods.

They've been talking about it for nearly a year at conferences and stuff. Now they've implemented it at,, and so on. You get less "buffering.... buffering..." even on mobile connections. BBR is enabled "for free" for all Google Cloud users.

With that explanation, you can probably read the ACM article a bit easier. Here's the link again:

Disclaimer: I don't own stock in Google, Amazon or Microsoft. I don't work for any of them. I'm an ex-employee of Google. I use GCP, AWS and Azure about equally (nearly zero).

Posted by Tom Limoncelli in Google

Today is my last day at Google. After 7 years I'm looking forward to doing nothing for a while, writing a book or two (oh yeah, I have a big announcement: I've signed 2 book contracts! More info soon!), and I'm getting married.

Please, no speculation on why I'm leaving. I was at Bell Labs 7 years too. It's just time.

(FunFact: I found a draft of a "goodbye message" I wrote. The file's datestamp was Nov 10, 2010.)

The annoying thing about job hunting is that usually you have to take random days off from your current job claiming "something came up" or taking vacation days or faking sick days. It is disruptive to coworkers, especially if you are in a team environment with lots of meetings. This time will be different: I'll be free to go at my own pace. (I'm looking only in NYC at this time.)

Officially I'm taking 4 days of vacation so that my last day is April 1st. Yes, my last day is April Fools Day. This is not a PR stunt to promote the April Tools RFCs book but wouldn't have that been hilarious if it was?


My 60-minute talk on Ganeti from the Usenix LISA '12 conference has been posted:

Ganeti is a cluster virtual server management software tool built on top of existing virtualization technologies such as Xen or KVM and other Open Source software. Ganeti takes care of disk creation, migration, OS installation, shutdown, startup, and can be used to preemptively move a virtual machine off a physical machine that is starting to get sick. It doesn't require a big expensive SAN, complicated networking, or a lot of money. The project is used around the world by many organizations; it is sponsored by Google and hosted at

Thanks to Usenix for making these videos available to the public for free!

Posted by Tom Limoncelli in Google

The photos look like "IBM meets Willy Wonka's Chocolate Factory".

For the first time, the company has invited cameras inside its top secret facility in North Carolina. Our tour guide is Google's senior vice president, Urs Hoelzle, who's in charge or building and running all of Google's data centers. 'Today we have 55,200 servers on this floor. Each of these servers is basically like a PC, except somewhat more powerful.'

The Wired article by Steven Levy:

The Google announcement:

Walk through it using StreetView:

Video Tour:

Posted by Tom Limoncelli in Google

[A Google SRE is a Site Reliability Engineer. That is, the sysadmin that manage the servers that provide Google's internal and external services. i.e. not the ones that do office IT like helpdesk and maintaining the printers. This is now what some people call "devops" but it preceded the term by a few years.]

Someone recently asked me if Google's SRE position has change over the years. The answer is 'yes and no'.

Yes, the job has changed because there is more diversity in the kind of work that SREs do. Google has more products and therefore more SRE teams. Each team is unique but we all work under the same mission, executive management, and best practices. All teams strive to use the same best practices for release engineering, operational improvements, debugging, monitoring, and so on. Yes, since each SRE team is responsible for a different product with different needs, you'll find each one can be unique priorities. I do like the fact that there is so much sharing of tools; something one team invents usually helps all teams. My team might find X is a priority while others don't: we make a tool that makes X better and share it; soon everyone is using it.

On the other hand, no, the job hasn't changed because the skill-set we look for in new recruits hasn't changed: understanding of the internals of networking, unix, storage and security with the ability to automate what you do.

Another thing that hasn't changed is that SREs generally don't work at the physical layer but we must understand the physical layer: The product(s) we manage are run from datacenters around the world and we don't get to visit them personally. You don't spend time cabling up new machines, configuring network ports, or fighting with a vendor over which replacement part needs to be shipped. We have awesome datacenter technicians that take care of all that (note: since we build our own machines even the way we handle repairs is different). The project I'm on has machines that are in countries I've never been to. News reporters tend to not understand this.... I'm in the NYC office and I think it is adorable to read articles written by misguided reporters that assume their Gmail messages are kept at the corner of 14th and 8th Ave..

On the subject of what we look for when recruiting new SREs: we look for experience with scale (number of machines, quantity of disk, number of queries per second). Other companies don't have the scale we have, so we can't expect candidates to have experience with our kind of scale; instead we look for people that have the potential to step up to our scale. We write our own software stack (web server, RPC, etc) so we can't look for people that have experience with those tools; instead we look for people that are technical enough to be able to learn, use, and debug them.

At our scale we can't do very much manually. A "1 in a million" task that would be done manually at most companies has to be automated at Google because it probably happens 100 times a day. Therefore, SREs spend half their time writing code to eliminate what they do the other half of their day. When they "automate themselves out of a job" it is cause for celebration and they get to pick a new project. There are always more projects.

If you are interested in learning about what kind of internal systems Google has, I highly recommend reading some of the "classic" papers that Google has published over the years. The most useful are:

Those last few papers are recent enough that most people aren't aware of them. In fact, I, myself, only recently read the Dremel paper.

While the papers focus on how the systems work, remember that there are SREs behind each one of them keeping them running. To run something well you must understand its internals.

You might also want to read them just because understanding these concepts is a useful education itself.

Posted by Tom Limoncelli in Google

I'm so proud of my coworker Thomas Bushnell giving an amazing talk at LinuxCon, the Linux Foundation's annual North American technical conference. For the first time Google revealed details about how Google manages thousands of Linux desktops. We start with Ubuntu LTS, add a packages that secure it and let us manage it better, and ta-da!

Read the entire article!

Posted by Tom Limoncelli in Google

  • Don't Miss Out - Register Today