99 percent of all monitoring that I see being done is done wrong.
Most people think of monitoring like this:
- Step 1: Something goes down.
- Step 2: I am alerted.
- Step 3: I fix the problem as fast as I can.
- Step 4: I get a pat on the back if I was able to fix it "really fast". (i.e. faster than the RTO)
If that's how you think of monitoring, then you are ALWAYS going to have down time. You've got down time "baked into" your process!
Here's how we should think about monitoring:
- Step 1: I get an alert that something is "a bit off"; something that needs to be fixed or else there will be an outage.
- Step 2: I fix the thing faster than the "or else".
- Step 3: Done. Boom. That's it. There is no Step 3. #dropthemic
- Step 4: We haven't had a user-visible outage in so long that I can't remember what RTO stands for.
The difference here is that in the second scenario there was no down time "baked into" the process. Down time was the exception, not the rule. We have moved from being reactive to proactive.
How do you get there?
- Step 1: Delete all alerts in your monitoring system.
- Step 2: Each time there is a user-visible outage, determine what indicators would have predicted that outage.
- Step 3: Update your monitoring system to collect those indicators and alert as needed.
- Step 4: Repeat. Eventually you will only have alerts that prevent outages, not respond to outages.
Obviously Step 1 is not tenable. Ok, Ok. You might declare all existing alerts to be legacy; all new alerts can have names that are prefixed "minorityreport" or something. All the legacy alerts should be eliminated over time.
You also need to start designing systems so that they are survivable. A component failure should not equal a user-visible outage. Stop designing networks with single uplinks: have 2 links and alert when the first one goes down. Stop setting up non-redundant storage: use RAID 1 or higher and alert when one disk has failed. Stop using ICMP pings to determine if a web server is down: monitor that page-load times are unusually high, that the SQL database it depends on is getting slow, monitor RAM usage or disk space or 404s or just about anything else. Put all web servers behind a load balancer and monitor if the number of replicas is dangerously low.
See the pattern?
You might be saying to yourself, "But Tom! I can't predict every precursor to an outage!" No, you can't, and that's a bug. That is a bug with the entire freakin' multi-billion dollar software industry. Any time you have a user-visible outage and you can't think of a way to detect it ahead of time: file a bug.
If it is an internally-written service file a feature request to expose the right indicators so your monitoring system can see them.
If it is a commercial product demand that they get involved with the post-mortem process, understand what lead to the outage, and update their product to expose indicators that let you do your job the right way. If they refuse, send them a link to this blog post... not that it will change their mind but I can use the hits.
That should be "the new normal" for operations. Anything else is literally encouraging failure.
P.S. Oh ok... sigh... yes, you can still ping your web server if that makes you feel better. Better yet set up Pingdom or other system but do it as a last resort, not the first line of defense.
Alerting when something is "a bit off" in a complex system is a great way to have a high rate of false positives and give everyone pager deafness.