Tutorial #2: check_load¶
In this tutorial, we will discuss important basic features that are present in nearly every check. These include command line processing, metric evaluation with scalar contexts, status line formatting and logging.
The check_load plugin resembles the one found in the standard Nagios plugins collection. It allows to check the system load average against thresholds.
Data acquisition¶
First, we will subclass Resource
to generate metrics for the 1,
5, and 15 minute load averages.
class Load(nagiosplugin.Resource):
"""Domain model: system load.
Determines the system load parameters and (optionally) cpu count.
The `probe` method returns the three standard load average numbers.
If `percpu` is true, the load average will be normalized.
This check requires Linux-style /proc files to be present.
"""
def __init__(self, percpu=False):
self.percpu = percpu
def cpus(self):
_log.info('counting cpus with "nproc"')
cpus = int(subprocess.check_output(['nproc']))
_log.debug('found %i cpus in total', cpus)
return cpus
def probe(self):
_log.info('reading load from /proc/loadavg')
with open('/proc/loadavg') as loadavg:
load = loadavg.readline().split()[0:3]
_log.debug('raw load is %s', load)
cpus = self.cpus() if self.percpu else 1
load = [float(l) / cpus for l in load]
for i, period in enumerate([1, 5, 15]):
yield nagiosplugin.Metric('load%d' % period, load[i], min=0,
context='load')
check_load has two modes of operation: the load averages may either
be takes as read from the kernel or normalized by cpu. Accordingly, the
Load()
constructor has a parameter two switch normalization on.
In the Load.probe()
method the check reads the load averages from the
/proc
filesystem and extracts the interesting values. For each value, a
Metric
object is returned. Each metric has a generated name
(“load1”, “load5”, “load15”) and a value. We don’t declare a unit of measure
since load averages come without unit. All metrics will share the same context
“load” which means that the thresholds for all three values will be the same.
Note
Deriving the number of CPUs from /proc
is a little bit messy and
deserves an extra method. Resource classes may encapsulate arbitrary complex
measurement logic as long they define a Resource.probe()
method that
returns a list of metrics. In the code example shown above, we sprinkle some
logging statements which show effects when the check is called with an
increased logging level (discussed below).
Evaluation¶
The check_load plugin should accept warning and critical ranges and
determine if any load value is outside these ranges. Since this kind of logic is
pretty standard for most of all Nagios/Icinga plugins,
nagiosplugin
provides a generalized context class for it. It is
the ScalarContext
class which accepts a warning
and a critical range as well as a template to present metric values in a
human-readable way.
When ScalarContext
is sufficient, it may be
configured during instantiation right in the main
function. A first
version of the main
function looks like this:
def main():
argp = argparse.ArgumentParser(description=__doc__)
argp.add_argument('-w', '--warning', metavar='RANGE', default='',
help='return warning if load is outside RANGE')
argp.add_argument('-c', '--critical', metavar='RANGE', default='',
help='return critical if load is outside RANGE')
argp.add_argument('-r', '--percpu', action='store_true', default=False)
args = argp.parse_args()
check = nagiosplugin.Check(
Load(args.percpu),
nagiosplugin.ScalarContext('load', args.warning, args.critical))
check.main()
Note that the context name “load” is referenced by all three metrics returned by
the Load.probe
method.
This version of check_load is already functional:
1 2 3 4 5 6 7 8 9 10 11 12 | $ ./check_load.py
LOAD OK - load1 is 0.11
| load15=0.21;;;0 load1=0.11;;;0 load5=0.18;;;0
$ ./check_load.py -c 0.1:0.2
LOAD CRITICAL - load15 is 0.22 (outside 0.1:0.2)
| load15=0.22;;0.1:0.2;0 load1=0.11;;0.1:0.2;0 load5=0.2;;0.1:0.2;0
# exit status 2
$ ./check_load.py -c 0.1:0.2 -r
LOAD OK - load1 is 0.105
| load15=0.11;;0.1:0.2;0 load1=0.105;;0.1:0.2;0 load5=0.1;;0.1:0.2;0
|
In the first invocation (lines 1–3), check_load reports only the first load value which looks bit arbitrary. In the second invocation (lines 5–8), we set a critical threshold. The range specification is parsed automatically according to the Nagios plugin API and the first metric that lies outside is reported. In the third invocation (lines 10–12), we request normalization and all values fit in the range this time.
Result presentation¶
Although we now have a running check, the output is not as informative as it could be. The first line of output (status line) is very important since the information presented therein should give the admin a clue what is going on. We want the first line to display:
- a load overview when there is nothing wrong
- which load value violates a threshold, if applicable
- which threshold is being violated, if applicable.
The last two points are already covered by the Result
default
implementation, but we need to tweak the summary to display a load overview
as stated in the first point:
class LoadSummary(nagiosplugin.Summary):
"""Status line conveying load information.
We specialize the `ok` method to present all three figures in one
handy tagline. In case of problems, the single-load texts from the
contexts work well.
"""
def __init__(self, percpu):
self.percpu = percpu
def ok(self, results):
qualifier = 'per cpu ' if self.percpu else ''
return 'loadavg %sis %s' % (qualifier, ', '.join(
str(results[r].metric) for r in ['load1', 'load5', 'load15']))
The Summary
class has three methods which can be
specialized: ok()
to return a status line
when there are no problems, problem()
to
return a status line when the overall check status indicates problems, and
verbose()
to generate additional output. All
three methods get a set of Result
objects passed
in. In our code, the ok
method queries uses the original metrics referenced by
the result objects to build an overview like “loadavg is 0.19, 0.16, 0.14”.
Check setup¶
The last step in this tutorial is to put the pieces together:
@nagiosplugin.guarded
def main():
argp = argparse.ArgumentParser(description=__doc__)
argp.add_argument('-w', '--warning', metavar='RANGE', default='',
help='return warning if load is outside RANGE')
argp.add_argument('-c', '--critical', metavar='RANGE', default='',
help='return critical if load is outside RANGE')
argp.add_argument('-r', '--percpu', action='store_true', default=False)
argp.add_argument('-v', '--verbose', action='count', default=0,
help='increase output verbosity (use up to 3 times)')
args = argp.parse_args()
check = nagiosplugin.Check(
Load(args.percpu),
nagiosplugin.ScalarContext('load', args.warning, args.critical),
LoadSummary(args.percpu))
check.main(verbose=args.verbose)
if __name__ == '__main__':
main()
In the main()
function we parse the command line parameters using the
standard argparse.ArgumentParser
class. Watch the
Check
object creation: its constructor can be fed
with a variable number of Resource
,
Context
, and
Summary
objects. In this tutorial, instances of
our specialized Load
and LoadSummary
classes go in.
We did not specialize a Context
class to evaluate
the load metrics. Instead, we use the supplied
ScalarContext
which compares a scalar value
against two ranges according to the range syntax defined by the Nagios plugin
API. The default ScalarContext
implementation covers the majority of evaluation needs. Checks using non-scalar
metrics or requiring special logic should subclass
Context
to fit their needs.
The check’s main()
method runs the check, prints
the check’s output including summary, log messages and performance data
to stdout and exits the plugin with the appropriate exit code.
Note the guarded()
decorator in front of the main
function. It helps the code part outside Check
to
behave: in case of uncaught exceptions, it ensures that the exit code is 3
(unknown) and that the exception string is properly formatted. Additionally,
logging is set up at an early stage so that even messages logged from
constructors are captured and printed at the right place in the output (between
status line and performance data).