A coworker debugged a problem last week that inspired me to relay this bit of advice:
Nothing happens at "random times". There's always a reason why it happens.
I once had a ISDN router that got incredibly slow now and then. People on the far side of the router lost service for 10-15 seconds every now and then.
The key to finding the problem was timing how often the problem happened. I used a simple once-a-second "ping" and logged the times that the outages happened.
Visual inspection of the numbers turned up no clues. It looked random.
I graphed how far apart the outages happened. The graph looked pretty random, but there were runs that were always 10 minutes apart.
I graphed the outages on a timeline. That's where I saw something interesting. The outages were exactly 10 minutes apart PLUS at other times. I wouldn't have seen that without a graph.
What happens every 10 minutes and other times too? In this case, the router recalculated its routing table every time it got a route update. The route updates came from its peer router exactly every 10 minutes plus any time an ISDN link went up or down. The times I was seeing a 10-minute gap was when we went an entire 10 minutes with no ISDN links going up or down. With so many links, and the fact that they were home users intermittently using their connections, meant that it was pretty rare to go the full 10 minutes with no updates. However, by graphing it the periodic outages were visible.
I've seen other outages that happened 300 seconds after some other event: a machine connects to the network, etc. A lot of protocols do things in 300 second (5 minute) intervals. The most common is ARP: A router expires ARP entries every 300 seconds. Some vendors extend the time any time they receive a packet from the host, others expire the entry and send another ARP request.
What other timeouts have you found to be clues of particular bugs? Please post in the comments!
Another classic interval is 49.7 days, or 2^32 milliseconds. Timer wraps cause all kinds of interesting buggy behaviors. That one is a little more maddening to track, though most mature code has timer wrap protection or 64 bit counters. 64 bit counters take more than 584 thousand years to wrap and I have yet to see any system with that kind of uptime.