Tuesday, February 22, 2011

The Testbed Monitor

With the increasing prevalence of distributed, clustered, or cloud products, monitoring the status of your testbed for problems is more important than ever.  Distributed systems are computers, networking, and software working together as a single system.  It's very difficult, if not impossible, manually to monitor the condition of a distributed system competently.  Transitory events can be missed.  Even obvious defects can go unnoticed.

Testbed Monitor to the rescue.

The Testbed Monitor is a custom flexible tool that detects problems with the testbed by continually tracking essential processes or services in parallel to detect failures across all the systems in the testbed.  It runs on a management host and not on the testbed (we can't assume the testbed will stay up).  The Monitor notifies the user via script output to the console and email when it detects a problem and tracks known problems to assure it doesn't spam the user repeatedly.  It has flags to control the conditions the user wants to check.  It has the option to 'snarf' forensic data from the testbed to a network share when it finds problems.  The Monitor returns 0 to the shell if no problems were found and 1 if one or more problems were found.

If you design your test infrastructure right, the Monitor will be useful for manual testers as well as in automation.  A manual user launches the Monitor before they start their testing.  The execution harness launches the Monitor at the beginning of the test batch.  The reservation system uses the Monitor to verify system readiness.

The Monitor has 3 general use cases:

1) Initial - Check the testbed once and exit.  This is intended to check for existing testbed issues.
2) Initial Continuous - Check the testbed continuously until killed, including initial conditions.
3) Continuous - Check the testbed continuously until killed, ignoring initial conditions.

The issues to monitor vary by product.  Common things to monitor are:

1) Cores
2) Required services
3) Daemon processes
4) Log entries
5) Bounces (reboots)
6) Server protocols (ICMP, NFS, etc...)

The only effective way to deal with the increasing complexity of distributed systems is increasing automation.  The Testbed Monitor is one of tools you need to tame your distributed beast.

No comments:

Post a Comment