November 2013 Archives

The proposal deadline for LOPSA's Cascadia IT conference has been extended to 9 DEC.

http://casitconf.org/casitconf14/call-for-proposals/

Posted by Tom Limoncelli

99 percent of all monitoring that I see being done is done wrong.

Most people think of monitoring like this:

  • Step 1: Something goes down.
  • Step 2: I am alerted.
  • Step 3: I fix the problem as fast as I can.
  • Step 4: I get a pat on the back if I was able to fix it "really fast". (i.e. faster than the RTO)

If that's how you think of monitoring, then you are ALWAYS going to have down time. You've got down time "baked into" your process!

Here's how we should think about monitoring:

  • Step 1: I get an alert that something is "a bit off"; something that needs to be fixed or else there will be an outage.
  • Step 2: I fix the thing faster than the "or else".
  • Step 3: Done. Boom. That's it. There is no Step 3. #dropthemic
  • Step 4: We haven't had a user-visible outage in so long that I can't remember what RTO stands for.

The difference here is that in the second scenario there was no down time "baked into" the process. Down time was the exception, not the rule. We have moved from being reactive to proactive.

How do you get there?

  • Step 1: Delete all alerts in your monitoring system.
  • Step 2: Each time there is a user-visible outage, determine what indicators would have predicted that outage.
  • Step 3: Update your monitoring system to collect those indicators and alert as needed.
  • Step 4: Repeat. Eventually you will only have alerts that prevent outages, not respond to outages.

Obviously Step 1 is not tenable. Ok, Ok. You might declare all existing alerts to be legacy; all new alerts can have names that are prefixed "minorityreport" or something. All the legacy alerts should be eliminated over time.

You also need to start designing systems so that they are survivable. A component failure should not equal a user-visible outage. Stop designing networks with single uplinks: have 2 links and alert when the first one goes down. Stop setting up non-redundant storage: use RAID 1 or higher and alert when one disk has failed. Stop using ICMP pings to determine if a web server is down: monitor that page-load times are unusually high, that the SQL database it depends on is getting slow, monitor RAM usage or disk space or 404s or just about anything else. Put all web servers behind a load balancer and monitor if the number of replicas is dangerously low.

See the pattern?

You might be saying to yourself, "But Tom! I can't predict every precursor to an outage!" No, you can't, and that's a bug. That is a bug with the entire freakin' multi-billion dollar software industry. Any time you have a user-visible outage and you can't think of a way to detect it ahead of time: file a bug.

If it is an internally-written service file a feature request to expose the right indicators so your monitoring system can see them.

If it is a commercial product demand that they get involved with the post-mortem process, understand what lead to the outage, and update their product to expose indicators that let you do your job the right way. If they refuse, send them a link to this blog post... not that it will change their mind but I can use the hits.

That should be "the new normal" for operations. Anything else is literally encouraging failure.

P.S. Oh ok... sigh... yes, you can still ping your web server if that makes you feel better. Better yet set up Pingdom or other system but do it as a last resort, not the first line of defense.

Posted by Tom Limoncelli

Evi Nemeth Update

Posted by Tom Limoncelli in Evi Nemeth

Ben Cotton wrote up a summary of my Evil Genius 101 tutorial: https://www.usenix.org/blog/evil-genius-101

Thanks for the great summary, Ben!

(Ben Blogs at FunnelFiasco)

Posted by Tom Limoncelli in LISA

A user recently asked for a lot of disk space. Not just a lot of disk space, but growing at an astounding rate per month. (Not big for some places, but bigger than my current employer was used to providing). It was an archive that would start large and grow in leaps and bounds. It had to be actual disk (not tape or other off-line technology) because the data would be accessed constantly.

He joked that what he really wanted was infinite disk space. I replied, "I can give you infinite storage." and I wasn't joking.

He told me to prove it so I explained:

Your data will start large and grow quickly. If I can install storage capacity faster than you need it, it is indistinguishable from if we actually had infinite storage. I can afford it because the equipment will come from your budget, so it isn't my problem (directly) and if this project is as valuable to the company as you say it is, your management will gladly fund it. (and if they don't fund it, then we know the value of the project is not what you think it is).

To stay ahead of your capacity demands I need to know your predicted growth rate. You'll have to provide estimates based on engineering estimates but I'll have monitoring that will give you "ground truth" as confirmation and will help you make better predictions.

The key is communication and cooperation. The biggest barriers to a project are often funding and capacity planning. If you can solve those two issues everything else falls into place and we can do anything.

That was true when sysadmins were providing megabytes of storage in the 90s and it is true when providing petabytes of storage today.

Posted by Tom Limoncelli in Technical Management

This just in... I'll be having office hours on Thursday from 2-3:30pm at LISA. Stop by for one-on-one time management counseling.

It isn't listed yet on the website but will be soon: https://www.usenix.org/conference/lisa13/hack-space

Posted by Tom Limoncelli

If you can't make it to LISA this year but want to see my devops-tastic, "Evil Genius 101" class, you can buy the livestream: https://www.usenix.org/conference/lisa13/video/usenix-training-video-stream-half-day-lisa-13-evil-genius-101

You can watch many different LISA presentations livestreamed here: https://www.usenix.org/conference/lisa13/live-streaming

Posted by Tom Limoncelli

Tom will be teaching 2 tutorials, doing a book signing, and including the all-new Evil Genius 101 half-day class.

  Tuesday AM: Half-day tutorial: Advanced Time Management: Team Efficiency Updated!
  Tuesday PM: Half-day tutorial: Evil Genius 101 New!
  Thursday, 1-1:30PM: Book Signing in Exhibit Hall C
  Thursday, 2-3:30PM: "Time Management Office Hours" (one-on-one time management counseling) New!
  Friday, 9-10:30AM: Guru Session "Time Management for Sysadmins" (Harding Room)

Hospitals are the mainframe of the medical industry.

Computers used to be rare and expensive. Every bit, every CPU cycle needed to be carefully groomed, petted, and softly whispered sweet things to, protected and managed. The best way to do this was to make one big computer, the central mainframe, and have everyone worship it like a god, accessed only through 24x80 text-only glass video tubes.

Then came PCs. PCs are so cheap you can waste CPU cycles on silly things like... ease-of-use feature, applications that enable communication between people, graphical user interfaces, games, surfing the web, etc.

Medical equipment used to be rare and expensive. Every device required a highly trained specialist to maintain it, operate it, realize and interpret the results.

Medical equipment isn't like that any more. It is computer-controlled, usually self-maintaining, and sometimes even analyzes the results for you. It can be mobile, even personal. There are mobile MRI machines. You can put a FitBit on your arm. "Urgent care" facilities are popping up all over the place. Doctors are self-organizing mini medical centers that focus on particular aspects of care.

The mainframe industry ignored PCs as long as they could, then fought tooth and nail to avoid them, then either adopted to the inevitable future kicking and screaming like little babies or went out of business.

Hospitals are trying to figure out if they will be kicking and screaming like little babies and adapt, or ignore the issue for as long as possible and go out of business.

Good luck to them. I hope they figure something out because... you know... it affects our health and life... not just their balance sheets.

Posted by Tom Limoncelli

I've been talking about SDN and OpenFlow for a while. It is slowly becoming a reality. This article is one of the warning signs: Here's What Happened When Cisco Lost A $1 Billion Deal With Amazon

Let me put the financial impact into more down-to-earth terms.

How does Cisco make money? Well, you buy a switch or router and that's good. Then you buy more and that's good too. Then you grow so large that the routing table has gotten too big to be calculated by the CPU/RAM on all the old equipment. Therefore to buy the next device you also have to buy upgrades for all previous devices. It's like instead of buying 1 item you buy 1+N*M items, where M is the cost of upgrading legacy devices. When N is small this is barely noticed but when N is large... oh it's good to be a Cisco salesperson.

The reason it is so good is that the customer can't buy that CPU/RAM upgrade from anyone but Cisco. These are specialized CPU modules. Being the only supplier you are locked in. They can extract inflated prices because your only choice is Cisco or throw out the entire network and buy from someone else. That's even more expensive.

Yes, you don't need to upgrade everything for every new device. Yes, there are ways to grow a network that minimize route table growth on all equipment. This is a general economic trend; don't get pedantic. On average, adding new devices leads to new upgrades that you can't get anywhere else.

Then came software defined networking (SDN).

Spoiler alert: With SDN you'll be able to buy your network hardware and network OS from two different vendors.

With SDN the routing table is calculated by an external system that does the calculations for all devices and uploads the results to each device. Each device, therefore, is cheap to make. Cost scales linearly with the number of ports. As the network grows the external system that does the calculations needs to get more beefy. The network elements do not. They just keep running.

This external route calculator is a Linux box running either open source or proprietary OpenFlow software. You manage it like you manage any server. It can be a virtual machine running in your private VM system that you keep allocating more RAM and vCPUs or it can be physical boxes that you upgrade with parts that have dozens of companies competing to make. There's no lock-in. Heck, you can even change software vendors and not have to throw away any old hardware.

That's why Cisco is afraid. The amount of money made on a sale is about to go from 1+N*M to 1. That should make Cisco afraid.

Why do I sound so confident? Because we've seen this in the past. A big Cisco switch is like a mainframe and the world of desktop computers is coming to destroy it. Network equipment is the last place in this industry where you are required to buy the hardware and the software from the same company.

In the bad old days you had to buy your network NICs and switches from the same vendor. They had you locked in. Once open protocols came about, you could buy a NIC from anyone and a switch from anyone; companies that didn't adjust their business model went out of business. (and suddenly NICs were built into motherboards! at last!).

Let's talk about mainframes. In the bad bad old old days you had to buy your hardware, OS and applications from the same company. An IBM mainframe ran an IBM operating system and 90% of the applications you can buy for it came from IBM.

In the 1980s C and Unix made a radical change to this... you could write software once and without too much effort get it to run on any Unix or Unix-like operating system. This was the "open systems" movement. You still had a hardware+os lock, but the hardware+os+applications lock was broken. Sun would sell you SPARC+Solaris, HP would sell you H9xx+HPUX, IBM would sell you RS6000+AIX and you could move applications between them. People today forget how radical it was to be able to port software to another OS by recompiling it instead of having to rewrite it from scratch.

In the 1990s/2000s Linux made a radical change to that... you didn't even have to buy the OS from the vendor. This toppled Sun, HP and IBM who had a difficult time adjusting to the concept of selling hardware (Intel/AMD chips on generic motherboards) and leaving the OS to the customer to provide. IBM is a different company now. HP focuses on... service or something (still not sure). Sun went out of business and sold their assets to some guy with an airplane fetish.

We consumers will be the winners. A new era of competition will happen at the hardware level. The "smarts" run on cheap Linux servers, and competition is hot there, driving down prices. The real fun will be the new competition spawned when network elements become a commodity. "Commodity" means lower prices and the reduction of profits.

There will also be competition on the software side. With SDN you can change your software without throwing away your hardware. (Could you imagine someone trying to sell you a replacement IOS for your current Cisco hardware today? You can't, because the hardware is not 'open'.) I'm excited about two particular areas of competition. On the capacity side: Mathematically it should be possible to calculate 1000 route tables simultaneously instead of doing 1,000 individual calculations. On the optimization side: The blob of data sent to the network elements can be optimized multiple ways. I look forward to seeing optimizations for size and speed of execution, just like C compilers compete.

Decomposing networking into its basic elements (hardware, route calculation, etc) enables competition on a finer granularity thus opening the space to new companies and new ideas. Competition is good.

What is happening right now is Cisco is trying to decide if they want to follow a path like Sun, HP, or IBM... or can they find an entirely new path?

We live in interesting times!

Posted by Tom Limoncelli

 
LISA14 I'm Teaching button