[I emailed these comments to NIST last week. I've never read NIST standards documents before, so my response may be entirely naive, but since it is my tax dollars at work, I thought I'd put in my two cents.]
Subject: Draft SP 800-160 Comments
I read with great interest the DRAFT Systems Security Engineering: An Integrated Approach to Building Trustworthy Resilient Systems http://csrc.nist.gov/publications/PubsDrafts.html#800-160
I'd like to comment on two sections, "2.3.4 Security Risk Management" and "Chapter 3: Lifecycle".
2.3.4 Security Risk Management
This discusses ways to deal with risk: Avoid, Accept, Mitigate, Transfer. This is a very traditional view of risk. It would be more foreward thinking for the document to explain that avoiding risk increases risk over time when it causes people to get out of practice on how to handle the situation when it does arise. This means avoiding a risk adds risk!
For example consider the typical situation where, sadly, an upgrade process causes an outage. There are two paths one can take: (1) Avoidance: In the future avoid any any and all upgrades. (2) Increase: Do the process weekly until the process can be done seamlessly and all staff are fully trained at doing it.
The military understands this. They conduct drills constantly to maintain high skill levels and to flush out processes
I realize that 2.3.4 is describing a different kind of risk, but I'm sure there is some kind of analog can be found. For example, there are certain risks that we accept. If we accept it in one situation, we get out of practice on how to handle when that risk fails. However if we take that same kind of risk in many places, it becomes more apparent how to handle it better, mass-produce the situation, and that leads to dealing with it better. For example I used to keep private keys unencrypted on web servers because this was an acceptable risk. Eventually I had done that so many times that it became worthwhile to establish a key-store that would let me mass produce the process of distributing private keys. I can now change private keys globally very quickly. The system has better logging and such, which lets me track key use better and make smarter decisions. For example, I now am less likely to let a key be used past its expiration date. If I had avoided the risk, I would never had lead to a better way to manage the risk.
Chapter 3: Lifecycle
This section is very complete on the topics it covers but information is lacking with respect to the most important part of a system's lifecycle: upgrades.
The Heartbleed event was a stark reminder that one of the most important parts of security is the ability to confidently upgrade a system rapidly and frequently. When Heartbleed was announced, most people were faced with the following fears: Fear that an upgrade would break something else. Fear that it wasn't well-understood how to upgrade that thing. Fear that the vendor didn't have an upgrade for that thing. Many of my coworkers were told, "Gosh, the last time we upgraded (that system) it didn't go well so we've been avoiding it ever since!"
If an enemy really wanted to destroy the security of the systems that NIST wants to protect, all he has to do is convince everyone to stop upgrading the software. The system will eventually crumble without requiring an actual attack!
This chapter is about "CREATING CREATING CREATING upgrading AND DISPOSAL" of the system. It should be about "creating UPGRADING UPGRADING UPGRADING and disposal" of the system.
Software is not a bicycle. A bicycle is purchased once and all future maintenance is done to retain its initial state. Software is ever changing. It is installed once and forever upgraded.
My concern is that Draft SP 800-160 treats technology systems like bicycles, not like software. This document must discourage this attitude. More and more all systems are fundamentally software, even if they externally appear to be hardware. It has been said that the Boeing 787 Dreamliner is a software product that happens to have wings. I recently inventoried my house and discovered that the majority of the "hardware" in my house is more software than hardware! I recently had to install a firmware upgrade for my PC's mouse!
To be specific, upgrades should be rapid (fast to happen once they've begun), frequent (happen periodically), and prompt (low lead-time between when a vulnerability is published and when the upgrade can start). All three of those attributes are important.
Upgrading a system doesn't happen by accident. It requires planning from the start. Upgradability must be designed in. Each of the 11 phases documented in Chapter 3 should encourage making future upgrades seamless.
- Stakeholder Requirements Definition: Should include "non-functional requirements" (http://en.wikipedia.org/wiki/Non-functional_requirement) including the ability to do upgrades rapidly, frequently, and promptly.
- Requirements Analysis: Should include measuring the rapid/frequent/promptness of upgrades.
- Architectural Design: Some designs are easier to upgrade than others. For example SOA architectures are easier to upgrade than monolithic systems. Firmware is easier to upgrade than ROMs.
This chapter should also include a discussion of the "smaller batches" principle. If we do upgrades once a year, the changes in that upgrade is a long, long list: A large batch. If something breaks, we do not know which change caused the problem. If we do upgrades frequently, the "smaller batches" of change means the source of problems can be identified easier. Ideally an upgrade happens after each individual change, thus making it possible to pinpoint the problem immediately. While this frequency may sound unrealistic, many systems are now designed that way. For example.com Etsy has documented their success with this system (and other companies will soon be publishing similar reports).
Problems related to upgrades are a risk that is mitigated by NOT avoiding, but by doing it more frequently. The smaller batches principle demonstrates that. When people do upgrades more frequently they develop skills around it, see more opportunities to optimize the process, and generally automate the process. If we are more confident in our ability to do upgrades, we are less likely to live with older, broken, software. Lastly it reduces what I consider to be the single biggest security hole in any system: lead time before a fix can be installed. When a vendor publishes a security-related update, delaying its deployment widens the window of vulnerability.
Thank you for considering my feedback.