Stop monitoring whether or not your service is up!

99 percent of all monitoring that I see being done is done wrong.

Most people think of monitoring like this:

  • Step 1: Something goes down.
  • Step 2: I am alerted.
  • Step 3: I fix the problem as fast as I can.
  • Step 4: I get a pat on the back if I was able to fix it "really fast". (i.e. faster than the RTO)

If that's how you think of monitoring, then you are ALWAYS going to have down time. You've got down time "baked into" your process!

Here's how we should think about monitoring:

  • Step 1: I get an alert that something is "a bit off"; something that needs to be fixed or else there will be an outage.
  • Step 2: I fix the thing faster than the "or else".
  • Step 3: Done. Boom. That's it. There is no Step 3. #dropthemic
  • Step 4: We haven't had a user-visible outage in so long that I can't remember what RTO stands for.

The difference here is that in the second scenario there was no down time "baked into" the process. Down time was the exception, not the rule. We have moved from being reactive to proactive.

How do you get there?

  • Step 1: Delete all alerts in your monitoring system.
  • Step 2: Each time there is a user-visible outage, determine what indicators would have predicted that outage.
  • Step 3: Update your monitoring system to collect those indicators and alert as needed.
  • Step 4: Repeat. Eventually you will only have alerts that prevent outages, not respond to outages.

Obviously Step 1 is not tenable. Ok, Ok. You might declare all existing alerts to be legacy; all new alerts can have names that are prefixed "minorityreport" or something. All the legacy alerts should be eliminated over time.

You also need to start designing systems so that they are survivable. A component failure should not equal a user-visible outage. Stop designing networks with single uplinks: have 2 links and alert when the first one goes down. Stop setting up non-redundant storage: use RAID 1 or higher and alert when one disk has failed. Stop using ICMP pings to determine if a web server is down: monitor that page-load times are unusually high, that the SQL database it depends on is getting slow, monitor RAM usage or disk space or 404s or just about anything else. Put all web servers behind a load balancer and monitor if the number of replicas is dangerously low.

See the pattern?

You might be saying to yourself, "But Tom! I can't predict every precursor to an outage!" No, you can't, and that's a bug. That is a bug with the entire freakin' multi-billion dollar software industry. Any time you have a user-visible outage and you can't think of a way to detect it ahead of time: file a bug.

If it is an internally-written service file a feature request to expose the right indicators so your monitoring system can see them.

If it is a commercial product demand that they get involved with the post-mortem process, understand what lead to the outage, and update their product to expose indicators that let you do your job the right way. If they refuse, send them a link to this blog post... not that it will change their mind but I can use the hits.

That should be "the new normal" for operations. Anything else is literally encouraging failure.

P.S. Oh ok... sigh... yes, you can still ping your web server if that makes you feel better. Better yet set up Pingdom or other system but do it as a last resort, not the first line of defense.

Posted by Tom Limoncelli

No TrackBacks

TrackBack URL: http://everythingsysadmin.com/cgi-bin/mt-tb.cgi/1688

11 Comments | Leave a comment

Alerting when something is "a bit off" in a complex system is a great way to have a high rate of false positives and give everyone pager deafness.

Great post as usual, thanks Tom.

SteveT - this needs to be applied with a bit of logic and data, of course. And also with the understanding that alerting thresholds evolve with the system. I generally define "a bit off" by looking at a graph (1 week or 1 month) leading up to an incident, and finding the threshold at which the metric went from normal variance to pre-incident.

If the pager starts going off too often without an actual incident, the answer is simple: tune the threshold until the false positives go away (or evem better, in the case of something like 500s, fix the problem...)

I think the idea here (as it should be with ANY monitoring system) is to only alert on things that require action. The canonical example of a pre-failure alert is configuring disk alerts for 80% full (which is "a bit off", but not causing an outage yet), rather than 100% full (which may mean your server falls over).

If you configure your alerts and thresholds sensibly your monitoring system should not produce noise: every message should be directly tied to an actionable event.

SteveT - it can be possible to build a system where you can easily tell which bits will eventually cause an outage if they're s little off. Say, service-oriented architecture, clustered services, alerts when nodes fail.

My team has been focusing on a service-oriented query layer held together with message passing and a reactive event-driven back. We also prefer services that embrace failure - anything that would render a service unusable automatically reboots the executable.

Event driven node falls? Nobody cares, I'll fix it tomorrow. Query node fails? No big deal, the other nodes have it covered, and it will restart itself. Query node stays failing? Then I get a call and fix it before any user ever knows.

I have loads left to learn, but I think Tom nailed it. I get occasional calls where the problem is gone before I even check, but customers almost never see a failure, no pager deafness because calls are rare and always valid, and I sleep easy even when I carry the phone. Life is good.

•Step 1: Delete all alerts in your monitoring system

At a previous employer, over time the unresolved alerts would grow, either because they were acknowledged or at the bottom of the list and never looked at, etc.

Every couple months or so we would delete all the alerts (just from the "active" table, history was left untouched). It was a good exercise to see what really did needed fixing since the broken stuff would alert after a cycle or two.

SteveT: I think you missed the first step:

Delete all alerts in your monitoring system.

Then you also missed step 2 and 3, which involves only creating alerts that would have predicted a real outage. If you get these steps right, then you should have no false positives, every alert should be a precursor to an eventual outage.

Step 2: create an alert that would have predicted a real outage. A good alert would be "server is powered on". Given enough time, that precursor will always lead to an outage.

Thank you, Captain Obvious!

Or implement something like Chaos Monkey.

Lets hope your employer appreciates proactive no-downtime system administration, too.

You might wanna ponder on some details for a while:

If you're flooded with useless "ping dropped once" alerts, then it's not an issue due to monitoring icmp, but a well-meaning ill-working configuration.

1. Try to define different classes of monitoring:

x Availability (pings, raid lun failed)
x Redundancies (one uplink is gone, a raid disk has failed, 5/8 webservers failed)
x Capacity/Performance (webserver is starting to page to disk, latencies are rising)
x Businessy stuff: actually delivering what end-users need

2. USE redundant components, but NEVER without monitoring them at sub-component level.
You can be held liable for stuff like that.

If you care, look at my #opennebulaconf talk. I'll not link it, I'm not here to spam :)

Leave a comment