May 2014 Archives

Good Reads, May 2014

A summary of the interesting articles I've found this month.

What is Site Reliability Engineering? An interview with Ben Treynor (Google VP, Site Reliability Engineering) -- SRE isn't just a new name for system administration, it is an entirely new business philosophy.

Distributed Systems and the End of the API -- APIs are like assembly language. Nobody programs in assembly language any more. So what's the high-level equivalent?

Big Cable says broadband investment is flourishing, but their own data says it's falling -- Remember folks, these are the companies that keep telling the media that people don't want gigabit broadband.

The Unreasonable Effectiveness of Checklists -- Checklists are awesome... and save lives.

ILLIAC I Programming Manual (1956) [pdf] -- How did those 5-ton room-sized computers work? Read the manual.

Non-technical read of the month:

Rat Park (comic) -- In comic form, Stuart McMillen explains a drug experiment that should have stopped The War On Drugs.

I have a new book coming out!

It is called "The Practice of Cloud System Administration" and it will be out in November 2014. If you want to be the first to get news about it, click here and tell me your email address. I promise I barely have time to send one email a month.

This month I learned:

  • The reason I couldn't find a bash script that did a git rebase but only if it wouldn't cause a mess is because you don't need a script, just the --ff-only flag. You can set an alias so that git p does it:

    git config --global alias.p "pull --rebase --ff-only"

  • I got an acceptance letter for one of my proposals at Velocity NYC.

  • Jim Steinman, who wrote Meatloaf's Bat Out of Hell also wrote one of my favorite songs The Sisters of Mercy's "This Corrosion".

Posted by Tom Limoncelli in Good Reads

[I emailed these comments to NIST last week. I've never read NIST standards documents before, so my response may be entirely naive, but since it is my tax dollars at work, I thought I'd put in my two cents.]

Subject: Draft SP 800-160 Comments

I read with great interest the DRAFT Systems Security Engineering: An Integrated Approach to Building Trustworthy Resilient Systems http://csrc.nist.gov/publications/PubsDrafts.html#800-160

I'd like to comment on two sections, "2.3.4 Security Risk Management" and "Chapter 3: Lifecycle".

2.3.4 Security Risk Management

This discusses ways to deal with risk: Avoid, Accept, Mitigate, Transfer. This is a very traditional view of risk. It would be more foreward thinking for the document to explain that avoiding risk increases risk over time when it causes people to get out of practice on how to handle the situation when it does arise. This means avoiding a risk adds risk!

For example consider the typical situation where, sadly, an upgrade process causes an outage. There are two paths one can take: (1) Avoidance: In the future avoid any any and all upgrades. (2) Increase: Do the process weekly until the process can be done seamlessly and all staff are fully trained at doing it.

The military understands this. They conduct drills constantly to maintain high skill levels and to flush out processes

I realize that 2.3.4 is describing a different kind of risk, but I'm sure there is some kind of analog can be found. For example, there are certain risks that we accept. If we accept it in one situation, we get out of practice on how to handle when that risk fails. However if we take that same kind of risk in many places, it becomes more apparent how to handle it better, mass-produce the situation, and that leads to dealing with it better. For example I used to keep private keys unencrypted on web servers because this was an acceptable risk. Eventually I had done that so many times that it became worthwhile to establish a key-store that would let me mass produce the process of distributing private keys. I can now change private keys globally very quickly. The system has better logging and such, which lets me track key use better and make smarter decisions. For example, I now am less likely to let a key be used past its expiration date. If I had avoided the risk, I would never had lead to a better way to manage the risk.

Chapter 3: Lifecycle

This section is very complete on the topics it covers but information is lacking with respect to the most important part of a system's lifecycle: upgrades.

The Heartbleed event was a stark reminder that one of the most important parts of security is the ability to confidently upgrade a system rapidly and frequently. When Heartbleed was announced, most people were faced with the following fears: Fear that an upgrade would break something else. Fear that it wasn't well-understood how to upgrade that thing. Fear that the vendor didn't have an upgrade for that thing. Many of my coworkers were told, "Gosh, the last time we upgraded (that system) it didn't go well so we've been avoiding it ever since!"

If an enemy really wanted to destroy the security of the systems that NIST wants to protect, all he has to do is convince everyone to stop upgrading the software. The system will eventually crumble without requiring an actual attack!

This chapter is about "CREATING CREATING CREATING upgrading AND DISPOSAL" of the system. It should be about "creating UPGRADING UPGRADING UPGRADING and disposal" of the system.

Software is not a bicycle. A bicycle is purchased once and all future maintenance is done to retain its initial state. Software is ever changing. It is installed once and forever upgraded.

My concern is that Draft SP 800-160 treats technology systems like bicycles, not like software. This document must discourage this attitude. More and more all systems are fundamentally software, even if they externally appear to be hardware. It has been said that the Boeing 787 Dreamliner is a software product that happens to have wings. I recently inventoried my house and discovered that the majority of the "hardware" in my house is more software than hardware! I recently had to install a firmware upgrade for my PC's mouse!

To be specific, upgrades should be rapid (fast to happen once they've begun), frequent (happen periodically), and prompt (low lead-time between when a vulnerability is published and when the upgrade can start). All three of those attributes are important.

Upgrading a system doesn't happen by accident. It requires planning from the start. Upgradability must be designed in. Each of the 11 phases documented in Chapter 3 should encourage making future upgrades seamless.

For example:

  1. Stakeholder Requirements Definition: Should include "non-functional requirements" (http://en.wikipedia.org/wiki/Non-functional_requirement) including the ability to do upgrades rapidly, frequently, and promptly.
  2. Requirements Analysis: Should include measuring the rapid/frequent/promptness of upgrades.
  3. Architectural Design: Some designs are easier to upgrade than others. For example SOA architectures are easier to upgrade than monolithic systems. Firmware is easier to upgrade than ROMs. ... etc. ...

This chapter should also include a discussion of the "smaller batches" principle. If we do upgrades once a year, the changes in that upgrade is a long, long list: A large batch. If something breaks, we do not know which change caused the problem. If we do upgrades frequently, the "smaller batches" of change means the source of problems can be identified easier. Ideally an upgrade happens after each individual change, thus making it possible to pinpoint the problem immediately. While this frequency may sound unrealistic, many systems are now designed that way. For example.com Etsy has documented their success with this system (and other companies will soon be publishing similar reports).

Problems related to upgrades are a risk that is mitigated by NOT avoiding, but by doing it more frequently. The smaller batches principle demonstrates that. When people do upgrades more frequently they develop skills around it, see more opportunities to optimize the process, and generally automate the process. If we are more confident in our ability to do upgrades, we are less likely to live with older, broken, software. Lastly it reduces what I consider to be the single biggest security hole in any system: lead time before a fix can be installed. When a vendor publishes a security-related update, delaying its deployment widens the window of vulnerability.

Thank you for considering my feedback.

Posted by Tom Limoncelli in Industry

I was only 7 months old when Neil Armstrong became the very first man to walk on the moon. I don't remember it very well.

Today I was reminded that most of what we see of the moon landings are highlights. 10-second little clips. I would like to know what the entire 8 days were like. I'm sure there are audio and video recordings of the entire thing. All of NASAs recordings are public domain, so they must be available somewhere.

Here's my thought for a product. A kit that includes audio and video recordings and other stuff to help you re-live the entire 8 day experience. An audio recording that we would listen to in real time, along with TV inserts of broadcasts as they happened. Plus 1960s recipes and other stuff so a group of people could simulate the entire thing. A group of people could go on "a vacation to 1969" and spend a week living like it was July 1969.

Yes, 8+ days is a very long time but imagine if:

  1. It was done near some other vacation place and they arrange it so that at key times you are near a TV to watch the news. Some days would be more "sit at the TV watching the action" and other days would be unrelated activities but everyone would watch the nightly news together at 6pm to see what Walter Cronkite was telling everyone.

or

  1. They make a simulator so that you are Neil Armstrong, or at least the Flight Director, going through the motions for all 8 days.

or

  1. YouTube could livestream all the audio/video for 9 days straight and everyone could just tune in. All over the world people would "play along", making it a shared experience everyone could enjoy. (It would be like The Yule Log, only a week+ long event that we do every July).

I haven't put a lot of thought into this. There are many logistical challenges. Plus, it could be extremely expensive to do it right. That's why I think a kit that lets people to it themselves during the summer would make more sense.

Anyway... I want to put it out there in case anyone has comments or thoughts about how to make it happen.

Tom

I'm the guest on this week's "Ops All The Things!" podcast. We talk about time management and all sorts of things. Check it out!

http://www.opsallthethings.com/podcast/006-time-management

Posted by Tom Limoncelli in Time Management

Hi! I'd like to buy an IP-KVM switch, please.

"Sure! We got plenty."

Now wait... I have some very specific requirements.

"Shoot."

First, I want it to connect via some kind of pod or something that I can only buy from you. If there is any interoperability between vendors, I'm going to be very upset. I want full vendor lock-in.

"No worries, sir. We have a variety of pods, all highly proprietary. I assure you they won't work with any other vendor. Heck, some of them don't even work with our own products! In fact, if you are switching from another brand we send you a box of bandaids since we know you'll need them after changing all those cables."

How thoughtful! Next issue... I want you to stop making firmware updates in about 6 months. 7 at the most. I don't care if the next Heartbleed only affects KVM switches and permits hackers to get in and set my machine room on fire. No. Firmware. Updates.

"But sir! What if..."

Did you hear me??? No firmware updates! These things connect to my servers at "the bios level"... whatever the f--- marketing people mean by that. As you know every security-related feature and service on a Windows or Linux box has the caveat that "all bets are off" if someone has physical access to the machine. These IP-KVM switches basically give remote people physical access. I don't want any risks! I want to be 100% sure about whether or not people will be able to break into my machines!

"Ok, sir, I'll make sure we stop making firmware updates shortly after you receive the product."

Good. Ok, now one more thing. You tell me that there's no client software on my end because it uses Java. I want to make sure that we're perfectly clear about this. There are many versions of Java. I want to make sure that your system requires me to use a version of Java that is incompatible with the Java that is installed on my machine.

"Sir, I hate to brag but I think we've really out-done ourselves in that department. First, we require a version of Java that is so old, James Gosling himself would be shocked."

Tell me more....

"Next, we give you a choice: If you install the latest version of Java, our code is rejected because we don't include the new security profiles stuff that is required. If you downgrade to an older version, you're machine basically stops functioning."

oh yes! I like it! I like it! What else do you have?

"Our Java support on the Mac is so bad, Oracle has basically done our job for us. No changes need on our part."

Wow! You really thought this all through!

"Well, sir, I hate to brag but we have one more feature that I think is the cherry on top. We only support Java on web browsers that you don't use. Chrome? Never heard of it!"

Good show! IE6 forever! Thank you!

"We're happy to serve, sir."

Great! Now would you now sucker-punch me and leave me bleeding?

"That's all taken care of by our billing department."

Posted by Tom Limoncelli in Rants

My 5-year prediction

I don't make many predictions. However I think two technologies are going to be huge within the next five years.

  • DACs: I'm not saying Bitcoin will be big (though it could be), I'm saying that the underlying technology is revolutionary and may become one the basic data management systems we use in places where today we need a neutral third party. That would be things like: DNS registrations, the stock market, and so on. More info here.
  • CRDTs/CALM: I've been talking about these since 2009, but Chas Emerick's new article makes me confident they're ripe to become very popular very soon. The article is heavy on theory. If you want to see it in action, do the Firebase tutorial.

I hope to write more about these in the coming months. For now I just want to put it out there.

Tom

Posted by Tom Limoncelli in Rants

You've probably seen experiments where a mouse gets cheese as a reward for pulling a lever. If he or she receives the cheese right away, the brain associates work (pulling the lever) with reward (the cheese) and it motivates the mouse. They want to do more work. It improves job satisfaction.

If the mouse received the cheese a month later, the brain won't associate the work with the reward. A year later? Fuggedaboutit!

Now imagine you are a software developer, operations engineer, or system administrator working on a software project. The software is released every 6 months. The hard work you do gets a reward every 6 months. Your brain isn't going to associate the two.

Now imagine monthly or weekly releases. The interval between work and reward is improved. The association is stronger. Motivation and job satisfaction goes up.

Now imagine using a continuous build/test system. You see the results of your work in the form of "test: pass" or "test: fail". Instant gratification.

Now imagine using a continuous deploy system. Every change results in a battery of tests which, if they pass, results in the software being launched into production. The interval is reduced to hours, possibly minutes.

I think that's pretty cool.

Posted by Tom Limoncelli in DevOps

You've probably seen this report:

HealthCare.Gov Looks Like A Bargain Compared With State Exchanges.

The Federal Healthcare Exchange was able to do the job much cheaper than the state-run exchanges. Ironically the states that benefitted the most were those that refused to participate and therefore were served by the Federal exchange.

Personally I think that the insurance companies that got 8.1 million signups should be billed for the cost of those web sites. The bill should include a note saying, "Covered costs: $0. Your responsibility: $X billion." Hilarious, right? (I know, I know... don't quit your day job.)

But we, as sysadmins, know the cost-saving power of centralized IT. Build a system once and use it for as many users as possible. It just plain makes sense.

Now that people are seeing proof that economies of scale saves money in healthcare, imagine other ways we could reduce the cost of medical care in the U.S. without affecting the quality. Who would have predicted that? Oh yeah... Anyone paying attention!

Posted by Tom Limoncelli in Rants

Slides from LOPSA-East

I've uploaded my slides from "Top 5 Time Management Tips for SysAdmins" to SlideShare. They apply to developers too.

Enjoy.

Posted by Tom Limoncelli in LOPSA-East

I'll be teaching tutorials. I'm also on the organizing committee. More info soon. Visit the conference site for details: http://lopsa-east.org

Posted by Tom Limoncelli in AppearancesArchive