Comments on NIST Draft SP 800-160

[I emailed these comments to NIST last week. I've never read NIST standards documents before, so my response may be entirely naive, but since it is my tax dollars at work, I thought I'd put in my two cents.]

Subject: Draft SP 800-160 Comments

I read with great interest the DRAFT Systems Security Engineering: An Integrated Approach to Building Trustworthy Resilient Systems http://csrc.nist.gov/publications/PubsDrafts.html#800-160

I'd like to comment on two sections, "2.3.4 Security Risk Management" and "Chapter 3: Lifecycle".

2.3.4 Security Risk Management

This discusses ways to deal with risk: Avoid, Accept, Mitigate, Transfer. This is a very traditional view of risk. It would be more foreward thinking for the document to explain that avoiding risk increases risk over time when it causes people to get out of practice on how to handle the situation when it does arise. This means avoiding a risk adds risk!

For example consider the typical situation where, sadly, an upgrade process causes an outage. There are two paths one can take: (1) Avoidance: In the future avoid any any and all upgrades. (2) Increase: Do the process weekly until the process can be done seamlessly and all staff are fully trained at doing it.

The military understands this. They conduct drills constantly to maintain high skill levels and to flush out processes

I realize that 2.3.4 is describing a different kind of risk, but I'm sure there is some kind of analog can be found. For example, there are certain risks that we accept. If we accept it in one situation, we get out of practice on how to handle when that risk fails. However if we take that same kind of risk in many places, it becomes more apparent how to handle it better, mass-produce the situation, and that leads to dealing with it better. For example I used to keep private keys unencrypted on web servers because this was an acceptable risk. Eventually I had done that so many times that it became worthwhile to establish a key-store that would let me mass produce the process of distributing private keys. I can now change private keys globally very quickly. The system has better logging and such, which lets me track key use better and make smarter decisions. For example, I now am less likely to let a key be used past its expiration date. If I had avoided the risk, I would never had lead to a better way to manage the risk.

Chapter 3: Lifecycle

This section is very complete on the topics it covers but information is lacking with respect to the most important part of a system's lifecycle: upgrades.

The Heartbleed event was a stark reminder that one of the most important parts of security is the ability to confidently upgrade a system rapidly and frequently. When Heartbleed was announced, most people were faced with the following fears: Fear that an upgrade would break something else. Fear that it wasn't well-understood how to upgrade that thing. Fear that the vendor didn't have an upgrade for that thing. Many of my coworkers were told, "Gosh, the last time we upgraded (that system) it didn't go well so we've been avoiding it ever since!"

If an enemy really wanted to destroy the security of the systems that NIST wants to protect, all he has to do is convince everyone to stop upgrading the software. The system will eventually crumble without requiring an actual attack!

This chapter is about "CREATING CREATING CREATING upgrading AND DISPOSAL" of the system. It should be about "creating UPGRADING UPGRADING UPGRADING and disposal" of the system.

Software is not a bicycle. A bicycle is purchased once and all future maintenance is done to retain its initial state. Software is ever changing. It is installed once and forever upgraded.

My concern is that Draft SP 800-160 treats technology systems like bicycles, not like software. This document must discourage this attitude. More and more all systems are fundamentally software, even if they externally appear to be hardware. It has been said that the Boeing 787 Dreamliner is a software product that happens to have wings. I recently inventoried my house and discovered that the majority of the "hardware" in my house is more software than hardware! I recently had to install a firmware upgrade for my PC's mouse!

To be specific, upgrades should be rapid (fast to happen once they've begun), frequent (happen periodically), and prompt (low lead-time between when a vulnerability is published and when the upgrade can start). All three of those attributes are important.

Upgrading a system doesn't happen by accident. It requires planning from the start. Upgradability must be designed in. Each of the 11 phases documented in Chapter 3 should encourage making future upgrades seamless.

For example:

  1. Stakeholder Requirements Definition: Should include "non-functional requirements" (http://en.wikipedia.org/wiki/Non-functional_requirement) including the ability to do upgrades rapidly, frequently, and promptly.
  2. Requirements Analysis: Should include measuring the rapid/frequent/promptness of upgrades.
  3. Architectural Design: Some designs are easier to upgrade than others. For example SOA architectures are easier to upgrade than monolithic systems. Firmware is easier to upgrade than ROMs. ... etc. ...

This chapter should also include a discussion of the "smaller batches" principle. If we do upgrades once a year, the changes in that upgrade is a long, long list: A large batch. If something breaks, we do not know which change caused the problem. If we do upgrades frequently, the "smaller batches" of change means the source of problems can be identified easier. Ideally an upgrade happens after each individual change, thus making it possible to pinpoint the problem immediately. While this frequency may sound unrealistic, many systems are now designed that way. For example.com Etsy has documented their success with this system (and other companies will soon be publishing similar reports).

Problems related to upgrades are a risk that is mitigated by NOT avoiding, but by doing it more frequently. The smaller batches principle demonstrates that. When people do upgrades more frequently they develop skills around it, see more opportunities to optimize the process, and generally automate the process. If we are more confident in our ability to do upgrades, we are less likely to live with older, broken, software. Lastly it reduces what I consider to be the single biggest security hole in any system: lead time before a fix can be installed. When a vendor publishes a security-related update, delaying its deployment widens the window of vulnerability.

Thank you for considering my feedback.

Posted by Tom Limoncelli in Industry

No TrackBacks

TrackBack URL: http://everythingsysadmin.com/cgi-bin/mt-tb.cgi/1763

3 Comments | Leave a comment

I agree with your thoughtful feedback Tom. I've recently had to go through an Avoid, Accept, Mitigate or Transfer risk assessment with our security team. The approach didn't sit right with me and you're points about avoidance and acceptance increasing risk over time put things in perspective. The three upgrade requirements you highlight, rapidly, frequently, and promptly, also ring true based on my experience. I've encountered, and still do, groups who will not upgrade critical systems more than twice a year out of fear of breaking the systems. Productive conversations about improving their processes and systems are difficult and hard to come by.

In short, good for you for providing quality feedback, Tom. Hopefully NIST gives it due consideration.

I'm late to this, but I think you misunderstand risk avoidance.

Done properly, risk avoidance entails removing the risk entirely, not sticking your head in the sand. In your example about patching, avoidance would not mean to stop patching, but more likely to eliminate the need to patch.

Let's say java patching caused the problem. If java was not necessary for the system to perform its mission, then flash could be removed, thereby avoiding the risk of flash patching causing an outage.

Avoiding patching is more of a risk acceptance strategy, essentially stating that one is willing to accept the risk posed by a specific lack of patching rather than risk outage time.

Mitigating problems caused by patching can be done several ways, such as testing patches before deployment, or havnig a rollback procedure in place.

Ugh, replace java and flash as needed for the post to make sense.

Guess that's what happens when I make blog comments while on a conference call.

Leave a comment