Testbed Monitor to the rescue.
The Testbed Monitor is a custom flexible tool that detects problems with the testbed by continually tracking essential processes or services in parallel to detect failures across all the systems in the testbed. It runs on a management host and not on the testbed (we can't assume the testbed will stay up). The Monitor notifies the user via script output to the console and email when it detects a problem and tracks known problems to assure it doesn't spam the user repeatedly. It has flags to control the conditions the user wants to check. It has the option to 'snarf' forensic data from the testbed to a network share when it finds problems. The Monitor returns 0 to the shell if no problems were found and 1 if one or more problems were found.
If you design your test infrastructure right, the Monitor will be useful for manual testers as well as in automation. A manual user launches the Monitor before they start their testing. The execution harness launches the Monitor at the beginning of the test batch. The reservation system uses the Monitor to verify system readiness.
The Monitor has 3 general use cases:
1) Initial - Check the testbed once and exit. This is intended to check for existing testbed issues.
2) Initial Continuous - Check the testbed continuously until killed, including initial conditions.
3) Continuous - Check the testbed continuously until killed, ignoring initial conditions.
The issues to monitor vary by product. Common things to monitor are:
2) Required services
3) Daemon processes
4) Log entries
5) Bounces (reboots)
6) Server protocols (ICMP, NFS, etc...)
The only effective way to deal with the increasing complexity of distributed systems is increasing automation. The Testbed Monitor is one of tools you need to tame your distributed beast.