Awesome Conferences

Has the job of a Google SRE changedover the years?

[A Google SRE is a Site Reliability Engineer. That is, the sysadmin that manage the servers that provide Google's internal and external services. i.e. not the ones that do office IT like helpdesk and maintaining the printers. This is now what some people call "devops" but it preceded the term by a few years.]

Someone recently asked me if Google's SRE position has change over the years. The answer is 'yes and no'.

Yes, the job has changed because there is more diversity in the kind of work that SREs do. Google has more products and therefore more SRE teams. Each team is unique but we all work under the same mission, executive management, and best practices. All teams strive to use the same best practices for release engineering, operational improvements, debugging, monitoring, and so on. Yes, since each SRE team is responsible for a different product with different needs, you'll find each one can be unique priorities. I do like the fact that there is so much sharing of tools; something one team invents usually helps all teams. My team might find X is a priority while others don't: we make a tool that makes X better and share it; soon everyone is using it.

On the other hand, no, the job hasn't changed because the skill-set we look for in new recruits hasn't changed: understanding of the internals of networking, unix, storage and security with the ability to automate what you do.

Another thing that hasn't changed is that SREs generally don't work at the physical layer but we must understand the physical layer: The product(s) we manage are run from datacenters around the world and we don't get to visit them personally. You don't spend time cabling up new machines, configuring network ports, or fighting with a vendor over which replacement part needs to be shipped. We have awesome datacenter technicians that take care of all that (note: since we build our own machines even the way we handle repairs is different). The project I'm on has machines that are in countries I've never been to. News reporters tend to not understand this.... I'm in the NYC office and I think it is adorable to read articles written by misguided reporters that assume their Gmail messages are kept at the corner of 14th and 8th Ave..

On the subject of what we look for when recruiting new SREs: we look for experience with scale (number of machines, quantity of disk, number of queries per second). Other companies don't have the scale we have, so we can't expect candidates to have experience with our kind of scale; instead we look for people that have the potential to step up to our scale. We write our own software stack (web server, RPC, etc) so we can't look for people that have experience with those tools; instead we look for people that are technical enough to be able to learn, use, and debug them.

At our scale we can't do very much manually. A "1 in a million" task that would be done manually at most companies has to be automated at Google because it probably happens 100 times a day. Therefore, SREs spend half their time writing code to eliminate what they do the other half of their day. When they "automate themselves out of a job" it is cause for celebration and they get to pick a new project. There are always more projects.

If you are interested in learning about what kind of internal systems Google has, I highly recommend reading some of the "classic" papers that Google has published over the years. The most useful are:

Those last few papers are recent enough that most people aren't aware of them. In fact, I, myself, only recently read the Dremel paper.

While the papers focus on how the systems work, remember that there are SREs behind each one of them keeping them running. To run something well you must understand its internals.

You might also want to read them just because understanding these concepts is a useful education itself.

Posted by Tom Limoncelli in Google

No TrackBacks

TrackBack URL: