Awesome Conferences

September 2012 Archives

I'll be doing a 30-minute talk about how to convince your boss to take IPv6 seriously at the Australia IPv6 Summit. I'll be presenting via video conference.

If you are in Australia and/or are concerned with IPv6, please attend this awesome conference. Registration is still open!

Info here: http://www.ipv6.org.au/summit/program.php

Posted by Tom Limoncelli in Speaking

Posted by Tom Limoncelli in Speaking

This came up in discussion recently. Here's how I differentiate between a junior and senior sysadmin:

  1. A senior person understands the internal workings of the systems he/she administers and debugs issues from a place of science, not guessing or rote memorization.

  2. A senior person has enough experience to know a problem's solution because he or she has seen and fixed it before (but is smart enough to check that assumption since superficial symptoms can be deceiving).

  3. A senior person automates their way out of problems rather than "working harder". They automate themselves out of a job constantly so they can be re-assigned to more interesting projects.

But most importantly...

A senior person demonstrates technical leadership by creating the processes that other people can follow, thereby enabling delegation and multiplying their effectiveness. Maybe the senior person is the only one technical enough to work out the procedure for replacing a bad disk on a server, but they document it in a way that less experience people can do the task. Maybe the senior person is the only one technical enough to set up a massive monitoring system, but they document how to add new devices so that everyone can add to what is monitored. Therefore they multiply their effectiveness because they use their knowledge not to do work, but to make it possible that an army of people can do the work instead. Good documentation is the first step to automating a process, so by working out the process, they start the "guess work -> repeatable -> automated" life-cycle that repetitive tasks should follow.

The old way is to maintain your "status" by hoarding information. You are the only person that knows how to do things and that is your power base. The new way is you maintain your "status" by sharing information. Everyone looks up to you because it is your documentation that taught them how to do their job. As they learn by following documentation that you wrote they get better at what they do and soon they are senior too. However now you are the senior person that helped everyone get to where they are today. In terms of corporate status, there is nothing better than that!

Tom

Posted by Tom Limoncelli in Career Advice

Weathering the Unexpected by Kripa Krishnan, Google

For the first time ever Google discusses our "DiRT" (Disaster Recovery Test) procedure. This is the week of hell where systems are taken down with little or no notice to verify that all the failure protection systems work.

Oh yeah... and the funny sidebar at the end was written by me :-)

Enjoy!

P.S. (I take credit for cajoling Kripa into writing the article. I think she did a bang-up job! Go Kripa!!)

Posted by Tom Limoncelli in Technical Management

If you live near Princeton, Trenton or New Brunswick and haven't been to the New Jersey chapter of LOPSA meetings then... what are you waiting for? Seriously, folks! They have free pizza! What could be better than pizza, soda, and geekery?

I'll be the speaker at the Thu, Oct 4, 2012 meeting. My topic will be: "Deploying IPv6 in the enterprise: How to convince your boss to approve your big plan"

The New Jersey chapter is the only LOPSA chapter that hosts its own annual conference. It's a great bunch of folks and I encourage you all to attend.

Information and directions are on their website.

Posted by Tom Limoncelli in Speaking

Spiceworks interview

I've been interviewed on SpiceWorks. Read it here:

Check it out!

Interesting trivia: The IT department where my S.O. works uses Spiceworks' ticket system.

Posted by Tom Limoncelli in Book News

I'll be the speaker at the Thursday, October 11, 2012 NYLUG meeting in Manhattan (Chelsea, 14th and 9th ave).

http://nylug.org

I'll be talking about the Ganeti open source project which I'm involved in.

The title of the talk will be: "Ganeti Virtualization Management: Improving the Utilization of Your Hardware and Your Time"

If you are in or near NYC, I hope to see you there! Seating is limited. Please RSVP.

http://nylug.org

Tom

Posted by Tom Limoncelli in Speaking

LOPSA-LA has a dinner on Tuesday, October 16, 2012. I'll be in the area for MacTech and they've asked me to give an after-dinner talk about Time Management.

  • When: Tue, Oct 16, 7pm - 9pm.
  • Location: Sheraton Universal Hotel's Californias Restaurant (333 Universal Hollywood Drive, Universal City, CA 91608)

  • Topic: Time Management for Sysadmins: Impossible or are other people to blame?

  • In this talk I'll explain why the fact that you can't manage your time effectively is everyone else's fault, not yours. I'll blame Darwin, your boss, your users, and maybe even your mom. There are a few solutions, which I'll discuss briefly. Then I'll take Q&A.

If you are in the Los Angeles area I hope to see you there!

More info here: http://www.lopsala.org

Posted by Tom Limoncelli in Speaking

I've gotten a lot of positive feedback about The Limoncelli Test. So much so, that Peter Grace and I have put all the material on a website called http://www.OpsReportCard.com.

We hope to add resources that help you achieve these 32 points of enlightenment but for now it is mostly the same as The Test. We're also considering making selling an ebook based on the material. Post to the comments section here if you like that idea.

We hope you enjoy it!

http://www.OpsReportCard.com.

Tom

Posted by Tom Limoncelli

I moderated a discussion with Jesse Robbins, Kripa Krishnan, John Allspaw about Learning to Embrace Failure. This is the first time you'll see Google reveal what they've been doing since 2006. Read the entire discussion in the new issue of ACM Queue magazine: Resilience Engineering: Learning to Embrace Failure

Participants include Jesse Robbins, the architect of GameDay at Amazon, where he was officially called the Master of Disaster. Robbins used his training as a firefighter in developing GameDay, following similar principles of incident response. He left Amazon in 2006 and founded the Velocity Web Performance and Operations Conference, the annual O'Reilly meeting for people building at Internet scale. In 2008, he founded Opscode, which makes Chef, a popular framework for infrastructure automation. Running GameDay operations on a slightly smaller scale is John Allspaw, senior vice president of technological operations at Etsy. Allspaw's experience includes stints at Salon.com and Friendster before joining Flickr as engineering manager in 2005. He moved to Etsy in 2010. He also recently took over as chair of the Velocity conference from Robbins. Google's equivalent of GameDay is run by Kripa Krishnan, who has been with the program almost from the time it started six years ago. She also works on other infrastructure projects, most of which are focused on the protection of users and their data.

The full article is here: http://queue.acm.org/detail.cfm?id=2371297

This is the 2nd of 3 articles on the subject that I'm involved with. Part 1 was published last week. Part 3 is kind of a surprise and will be out in less than a month. Watch my blog for the announcement.

Posted by Tom Limoncelli in Technical Management

Earlier today, the RIPE NCC (Réseaux IP Européens Network Coordination Centre) announced it is down to its last "/8" worth of IPv4 addresses. This means that it is no longer possible to obtain new IPv4 addresses in Europe, the former USSR, or the Middle East, ...

http://arstechnica.com/information-technology/2012/09/europe-officially-runs-out-of-ipv4-addresses/

I'll be doing my "Convince your boss to deploy IPv6" talk at the New Jersey chapter of LOPSA meeting next month. That's thursday, oct 4th near Princeton, NJ.

Posted by Tom Limoncelli in IPv6

Here's a good strategy to improve the reliability of your systems: Buy the most expensive computers, storage, and network equipment you can find. It is the really high-end stuff that has the best "uptime" and "MTBF".

Wait... why are you laughing? There are a lot of high-end, fault-tolerant, "never fails" systems out there. Those companies must be in business for a reason!

Ok.... if you don't believe that, let me try again.

Here's a good strategy to improve the reliability of your systems: Any time you have an outage, find who caused it and fire that person. Eventually you'll have a company that only employs perfect people.

Wait... you are laughing again! What am I missing here?

Ok, obviously those two strategies won't work. System administration is full of examples of both. At the start of "the web" we achieved high uptimes by buying Sun E10000 computers costing megabucks because "that's just how you do it" to get high performance and high uptimes. That strategy lasted until the mid-2000's. The "fire anyone that isn't perfect" strategy sounds like something out of an "old school" MBA textbooks. There are plenty of companies that seem to follow that rule.

We find those strategies laughable because the problem is not the hardware or the people. Hardware, no matter how much or how little you pay for it, will fail. People, no matter how smart or careful, will always make some mistakes. Not all mistakes can be foreseen. Not all edge cases are cost effective to prevent!

Good companies have outages and learn from them. They write down those "lessons learned" in a post-mortem document that is passed around so that everyone can learn. (I've written about how to do a decent postmortem before.)

If we are going to "learn something" from each outage and we want to learn a lot, we must have more outages.

However (and this is important) you want those outages to be under your control.

If you knew there was going to be an outage in the future, would you want it at 3am Sunday morning or 10am on a Tuesday?

You might say that 3am on Sunday is better because users won't see it. I disagree. I'd rather have it at 10am on Tuesday so I can be there to observe it, fix it, and learn from it.

In school we did this all the time. It is called a "fire drill". The first fire drill of the school year we usually did a pretty bad job. However, the second one was much better. The hope is that if there was a real fire it will be after we've gotten good at it.

Wouldn't you rather just never have fires? Sure, and when that is possible let me know. Until then, I like fire drills.

Wouldn't you rather have computer systems that never fail? Sure, and when that's possible let me know. Until then I like sysadmin fire drills.

Different companies call them different things. Jesse Robins at Twitter calls them GameDay" exercises. John Allspaw at Etsy calls refers to "resilience testing" in his new article on ACM Queue. Google calls them something else.

The longer you go without an outage, the more rusty you get. You actually improve your uptime by creating outages periodically so that you don't get rusty. It is better to have a controlled outage than waiting for the next outage to find you out of practice.

Fire drills don't have to be visible to the users. In fact, they shouldn't be. You should be able to fail over a database to the hot spare without user-visible affects.

Systems that are fault tollerant should be peridically tested. Just like you test your backups by doing an occasional full restore (don't you?) you should periodically fail over that datbase server, web server, RAID system, and so on. Do it in a controlled way: plan it, announce it, make contingency plans, and so on. Afterwords write up a timeline of what happened, what mistakes were made, and what can be done to improve this next time. For each improvement file a bug. Assign someone to hound people until the list of bugs are all closed. Or, if a bug is "too expensive to fix", have management sign off on that decision. I believe that being unwilling to pay to fix a problem ("allocate resources" in business terms) is equal to saying "I'm willing to take the risk that it won't happen." So make sure they understand what they are agreeing to.

Most importantly: have the right attitude. Nobody should be afraid to be mentioned in the "lesson's learned" document. Instead, people should be rewarded, publically, for finding problems and taking responsibility to fix them. Give a reward, even a small one, to the person that fixes the most bugs filed after a fire drill. Even if the award is a dorky certificiate to hang on their wall, a batch of cookies, or getting to pick which restaurant we go to for the next team dinner, it will mean a lot. Receiving this award should be something that can be listed on the person's next performance review.

The best kind of fire drill tests cross-team communication. If you can involved 2-3 teams in planning the drill you have potential to learn a lot more. Does everyone involved know how to contact each other? Is the conference bridge big enough for everyone? If the managers of all three teams have to pretend to be unavailable during the outage, are the three teams able to complete the drill?

My last bit of advice is that fire drills need management approval. The entire management chain needs to be aware of what is happening and understand the business purpose of doing all this.

John's article has a lot of create advice about explaining this to management, what push-back you might expect, and so on. His article, Fault Injection in Production is so well written even your boss will understand it. (ha ha, a little boss humor there)

[By the way... ACM Queue is really getting 'hip' lately by covering these kind of "DevOps" topics. I highly recommend visiting queue.acm.org periodically.]

AT&T Survey

I got a survey from AT&T Wireless that asked a lot of questions comparing my experiences between WiFi and 3G on my AT&T mobile phone.

If I were to reverse-engineer what they were getting at, either (a) they want to figure out why I dislike WiFi so they can fix those problems and encourage people to move traffic off their over-stressed 3G network, or (b) they need data to back up their coming campaign to bad-mouth WiFi and tell everyone to pay for their over-priced 3G.

Based on the tone of the questions, I really think it is "b".

I'm so glad I no longer work in the telecom world.

http://www.mactech.com/conference/sessions

I'll be speaking on Thursday. Don't miss this great conference, October 17-19, 2012 in Los Angeles.

Posted by Tom Limoncelli in Conferences

The coursework would be very focused on understanding the internals of each layer of the stack. To make a comparison to the auto industry: Your training won't result in you being a mechanic that can follow the manufacturer's manual: you will be the person that can write the manual because that's how much you understand how the car works.

But the real change I'd like to see is how the labs are done.

  • When you enter the school they give you 12 virtual machines on their VMware cluster (or Ganeti cluster).

  • In phase one you go through a progression that ends with turning those 6 machines into 2 load balancers, 3 web servers, a replicated database, a monitoring host, etc. (this is done as a series of labs that start with setting up one web server, then building up to this final configuration).

  • At that point the CS department turns on a traffic generator and your system now gets a steady stream of traffic. There is a leader-board showing show has the best uptime.

  • Phase 2 you set up a dev and qa clone of what you did in Phase 1, but you do it with Puppet or cfengine. Eventually those tools have to manage your live system too, and you have to make that change while the system is in use.

  • Once you have a dev->qa->live system and your uptime stats become 20% of your grade.

  • Another element I'd like to have is that there is a certain point in which everyone has to run other people's system using only the operational documentation that the creator left behind.

  • There might be another point in which the best student's cluster is cloned to create a web hosting system that provides real service to the community. Students would run it cooperatively, maintaining everything from the software to the operational docs.

However, by the time you get your degree you'd not only know the technical side of system administration but you'd also have the practical experience that would make you extremely valuable in the market.

Update: To be clear, there should be gobs and gobs of theory. I would want the above to be the labs that match the theory. For example, theory on OS matched with Linux kernel as an example; theory of autonomic computing with cfengine/puppet as an example, and so on and so on.

Posted by Tom Limoncelli in Career Advice

A co-worker watched me type the other day and noticed that I use certain Unix commands for purposes other than they are intended. Yes, I abuse Unix commands.

I'm proud to announce that TM4SA has been selected to be featured on this year's O'Reilly Back-to-School Special.

The special runs this week only, from Sept 4th to the 11th. Save up to 50% on books, videos and courses.

To receive the discount start shopping using this link http://oreil.ly/SUPaaT or use discount code "B2S2".

Happy savings to all students and non-students alike!

Posted by Tom Limoncelli in Book News

Credits