A friend of mine told me of a situation where a cron job took longer to run than usual. As a result the next instance of the job started running and now they had two cronjobs running at once. The result was garbled data and an outage.
The problem is that they were using the wrong tool. Cron is good for simple tasks that run rarely. It isn't even good at that. It has no console, no dashboard, no dependency system, no API, no built-in way to have machines run at random times, and its a pain to monitor. All of these issues are solved by CI systems like Jenkins (free), TeamCity (commercial), or any of a zillion other similar systems. Not that cron is all bad... just pick the right tool for the job.
Some warning signs that a cron job will overrun itself: If it has any dependencies on other machines, chances are one of them will be down or slow and the job will take an unexpectedly long time to run. If it processes a large amount of data, and that data is growing, eventually it will grow enough that the job will take longer to run than you had anticipated. If you find yourself editing longer and longer crontab lines, that alone could be a warning sign.
I tend to only use cron for jobs that have little or no dependencies (say, only depend on the local machine) and run daily or less. That's fairly safe.
There are plenty of jobs that are too small for a CI system like Jenkins but too big for cron. So what are some ways to prevent this problem of cron job overrun?
It is tempting to use locks to solve the problem. Tempting but bad. I once saw a cron job that paused until it could grab a lock. The problem with this is that when the job overran there was now an additional process waiting to run. They ended up with zillions of processes all waiting on the lock. Unless the job magically started taking less time to run, all the jobs would never complete. That wasn't going to happen. Eventually the process table filled and the machine crashed. Their solution (which was worse) was to check for the lock and exit if it existed. This solved the problem but created a new one. The lock jammed and now every instance of the job exited. The processing was no longer being done. This was fixed by adding monitoring to alert if the process wasn't running. So, the solution added more complexity. Solving problems by adding more and more complexity makes me a sad panda.
The best solution I've seen is to simply not use cron when doing frequent, periodic, big processes. Just write a service that does the work, sleeps a little bit, and repeats.
while true ; do process_the_thing sleep 600 done
Simple. Yes, you need a way to make sure that it hasn't died, but there are plenty of "watcher" scripts out there. You probably have one already in use. Yes, it isn't going to run precisely n times per hour, but usually that's not needed.
You should still monitor whether or not the process is being done. However you should monitor whether results are being generated rather than if the process is running. By checking for something that is at a high level of abstraction (i.e. "black box testing"), it will detect if the script stopped running or the program has a bug or there's a network outage or any other thing that could go wrong. If you only monitor whether the script is running then all you know is whether the script is running.
And before someone posts a funny comment like, "Maybe you should write a cron job that restarts it if it isn't running". Very funny.