Skip to main content Skip to navigation

How we do IT: System Monitoring

System Monitoring here means the automatic checking of technical systems and alerting when things go wrong.

Many services and devices at WBS are monitored 24/7 by a product called Nagios (http://nagios.org/). This is a free, open-source product that is extremely flexible and powerful, with new, free, extensions and add-ons being created and shared all the time by the worldwide community. When a problem is detected with a WBS service the issue is highlighted on a webpage viewable to WBS Solutions staff and, in the case of a serious issue, automatic text messages are sent to members of the WBS Systems and Support teams.

Some of the things we check at WBS include:

● That websites such as my.wbs, www.wbs and Webmail are all available

● That servers are not running out of disk space, or running short of memory

● That it is possible to connect to our staff email system and download (POP) messages.

Automatic monitoring is an essential element in the provision of reliable systems. Done well, monitoring can identify technical problems before they affect a service or quickly alert us to relevant problems. Often the alert (for example a server running short of space) can allow us to act before the problem affects a service in a way noticeable to end users. It is even possible to configure Nagios automatically to resolve some problems, such as restarting a server when it has detected a problem from which it cannot recover.

All information about the checks is stored and can be queried later. This is useful for trend analysis or for identifying students whose excuses for late assignment submissions might not be whole truth (the modern equivalent of ‘the hamster ate my homework’ I suppose). If nothing else we can get some lovely graph and tables:

Nagios report _ POP (click to expand)

So monitoring is perfect then? Well not quite.

Monitoring is a balance. The checks themselves take some resources and thus, if they were overdone, could cause more problems than they solve. Adding any extra service to a technical system risks introducing an incompatibility that could cause an issue when there was none before. The KISS (Keep It Simple Stupid) principle is one held dear by most in IT.

Additionally, some things just can’t be monitored at this time. An example would be submitting an assignment through my.wbs. Although it could be possible to program Nagios to do this (such complex checks are called Synthetic Transactions) this would require members of the WBS Programme Administration teams continually to be setting up modules for these fake submissions, something clearly impossible (although if there are any volunteers for such a job please let me know).

So, if you do have a technical problem please let us know via http://help.wbs.ac.uk or email help@wbs.ac.uk . Hopefully we have already been alerted to it but if not we’ll be very grateful to hear from you and will see if it is possible to use Nagios’ flexibility and extensibility automatically to pick the problem up next time.

So don’t have IT nightmares... Nagios is always watching over us and helping to keep our systems available and running smoothly.