I assume you have some kind of automated monitoring system that watches over your servers, networks and services. Service monitoring is important to a functioning system. It isn't a service if it isn't monitored. If there is no monitoring then you're just running software.
Monitoring "is it down?" is reactionary. It is better than no monitoring at all, but all it tells you is that there is already a problem. Monitoring is better when it predicts the future and prevents problems.
An analog radio (one with an old-fashion vacuum tube) sounds great at first, but you hear more static when the tube starts to wear out. Then the tube dies and you hear nothing. If you change the tube when it starts to degrade, you'll never have a dead radio. (Assume, of course, you change the tube when your favorite radio show isn't on.)
A transistor radio, on the other hand, is digital. It plays and plays and plays and then stops. Now, during your favorite song, you have to repair it.
At Bell Labs someone called this the "run run run dead" syndrome of digital electronics.
How can we monitor computers and networks in a way that makes it more like analog electronics? There are some simple tricks we can use when monitoring to be "more like analog."
One trick is to stop monitoring "is it up?" and monitor "how fast is it?" instead. Don't measure "can I ping the server", measure "ping response time" and alert if the replies are very slow (where "no reply" is veeeerrrry slow). Better yet, don't measure ping, measure the service's performance directly: measure web-site latency, measure time-to-first-byte-received, send a test email to a relay and measure how long it takes to come back.
We can also monitor things that portend (predict) future failures. For example, measure QPS (Queries Per Second) our website is receiving. If there is a sharp increase, that's an indication that we might have problems if it increases much further. How many monitoring packages can do that kind of calculus?
We can monitor our Internet connection and keep historical data. Draw a graph that shows usage over the last year and notice the trajectory as the line goes up and up. Anyone can eyeball the graph and predict that in 3 months we'll be out of capacity. Luckily it takes 2 months to get more capacity approved, paid for, ordered, and deployed. We have predicted the future. Without good monitoring, we find ourselves with an overloaded connection; and the only thing we can predict is 2 months of unhappy users complaining and complaining.
What is an outage? To me an outage is something that is customer-facing. The users felt pain. Either the VPN system went down, or the VPN system was too slow to be usable. More formally, an outage is any time we miss our SLA. The problem is that a lot of people don't work in an environment that has written SLAs so we must invent them ad hoc.
After an outage has been resolved and we've had time to calm down, take a moment to look back at the situation and figure out a monitoring rule that would have predicted this problem. Add (or update) two sets of rules. The first set detects the outage. If such a rule exists already, maybe it needs to be fine tuned or somehow updated. Maybe the SLA changed and we didn't update the rule. Or, we don't have an SLA and we need to make it tigher or looser to match customer expectations of what our SLA should be. The second set of rules should collect data that would help us prevent that particular outage in the future. This second group might be very specific to specific links, settings, or components.
Eventually we grow our monitoring system not based on what we think is good, but what we've learned over time is good. Write comments into the configuration to list what inspired the rule and document what you think should be done if this rule gets triggered. When it does get triggered, update this documentation with what you learned. This makes your monitoring system evolve and grow like a Wiki.
Over time we'll have an optimal monitoring system for our environment.
As computers have become cheaper we use more redundancy to make them more reliable. We don't rely on a single disk, we put them in redundant (non-striped) RAID sets. We don't depend on one router, but we use VRRP so one can fail and packets still get through. We don't have one web server, we have a group of web servers behind a load balancer. We don't have one load balancer, we buy them in pairs and use an active-active or active-passive configuration.
When we have N+1 redundancy, things are more like analog than digital. If we have 20 machines behind a web load balancer, we don't panic if one goes down, we monitor that "at least 80% are up". That is more like an analog system. Two web servers going down is like static on the radio. Our monitoring should reflect this. We can measure the time a particular query takes to complete and alert depending on what we see: x ms response time is fine. Lower than x but rising quickly, time to add more web servers. Nearly x for extended periods of time, better order more web servers. Higher than x for an extended amount of time, steal web servers from other services. Analog.
The beauty of N+1 redundancy is that it decouples component failure from outages. In the old days, a component failure equalled an outage. A disk died, and the file server was down all day as we restored data from backups. Now our disks are in a RAID set and a single one going south just means a hot spare will be used to get us back to N+1. We can stop monitoring "is it down?" but instead monitor that each RAID group is at N+1 redundancy, that the entire RAID chassis has at least X hot spares in the pool, and that the data is accessible. (X is based on how far we are from the hardware. If we are a consultant that visits the site once a month, more is better. If we work in the building, less is needed.)
It can be difficult to adjust your thinking to be more analog than digital. Start with one small change, like monitoring average queue length on a router port, or time to complete a web query. Once you learn how to do this with your monitoring system for one aspect or service, doing it again and again becomes easier.
Then...you'll be better at predicting the future.