Esecuzione di un comando in caso di allarme Nagios (EventHandler)
Esecuzione di un comando su server remoto
Se si monitora un serivizio, è possibile far eseguire un comando quando lo stesso è definitivamente CRITICAL.
In questo caso facciamo ripartire CUPS quando non è disponibile
Configurazione server Nagios su macchina monitorante
- Attivare l'event handlers nel servizio da monitorare, e definire il comando nagios da eseguire:
sudoedit /etc/nagios3/conf.d/file.cfg
event_handler_enabled 1 event_handler restart-service!NTP
- Definire il comando nagios event handler
sudoedit /etc/nagios3/conf.d/eventhandlers.cfg
define command{
command_name restart-service
command_line /etc/nagios3/conf.d/eventhandlers/restart-service.sh "$SERVICESTATE$" "$SERVICESTATETYPE$" "$SERVICEATTEMPT$" "$HOSTADDRESS$" "$ARG1$" "$SERVICEDESC$"
}
- Creare lo script effettivo:
sudo mkdir -p /etc/nagios3/conf.d/eventhandlers sudoedit /etc/nagios3/conf.d/eventhandlers/restart-service.sh
#!/bin/sh
#
# Event handler script for restarting the nrpe server on the local machine
# Taken from the Nagios documentation and
# http://www.techadre.com/sites/techadre.com/files/event_handler_script_0.txt
# Adapted by L.C. Karssen
# Time-stamp: <2010-09-14 15:24:33 (root)>
#
# Note: This script will only restart the nrpe server if the service is
# retried 3 times (in a "soft" state) or if the web service somehow
# manages to fall into a "hard" error state.
#
date=`date`
#/etc/nagios3/conf.d/restart-service.sh
# define command{
# command_name restart-service
# command_line /etc/nagios3/conf.d/eventhandlers/ restart-service.sh
# "$SERVICESTATE$" "$SERVICESTATETYPE$" "$SERVICEATTEMPT$" "$HOSTADDRESS$"
# "$ARG1$" "$SERVICEDESC$"
# }
# To check restart
# sudo -u nagios /etc/nagios3/conf.d/eventhandlers/restart-service.sh CRITICAL SOFT 1 hostname NTP TIME
#
# max_check_attempts 3
#
echo "$date - Eventhandler run #1=${1} #2=${2} #3=${3} #4=${4} #5=${5} #6=${6}" >> /var/log/nagios3/eventhandlers.log
#Fri May 13 17:29:15 CEST 2011 - Eventhandler run #1=CRITICAL #2=HARD #3=3 #4=192.168.0.9 #5=cups #6=CUPS
# What state is the NRPE service in?
case "$1" in
OK)
# The service just came back up, so don't do anything...
;;
WARNING)
# We don't really care about warning states, since the service is probably still running...
;;
UNKNOWN)
# We don't know what might be causing an unknown error, so don't do anything...
;;
CRITICAL)
# Aha! The BLAH service appears to have a problem - perhaps we should restart the server...
# Is this a "soft" or a "hard" state?
case "$2" in
# Were in a soft state, meaning that Nagios is in the middle of retrying the
# check before it turns into a "hard" state and contacts get notified
SOFT)
# What check attempt are we on? We don't want to restart the web server on the first
# check, because it may just be a fluke!
case "$3" in
# Wait until the check has been tried 3 times before restarting the web server.
# If the check fails on the 4th time (after we restart the web server), the state
# type will turn to "hard" and contacts will be notified of the problem.
# Hopefully this will restart the web server successfully, so the 4th check will
# result in a "soft" recovery. If that happens no one gets notified because we
# fixed the problem!
1|2)
echo -n "Restarting service $6\n"
# Call NRPE to restart the service on the remote machine
/usr/lib/nagios/plugins/check_nrpe -H $4 -c restart-service -a $5
echo "$date - restart $6 (restart-service ${5}) - SOFT" >> /var/log/nagios3/eventhandlers.log
;;
esac
;;
# The service somehow managed to turn into a hard error without getting fixed.
# It should have been restarted by the code above, but for some reason it didn't.
# Let's give it one last try, shall we?
# Note: Contacts have already been notified of a problem with the service at this
# point (unless you disabled notifications for this service)
HARD)
case "$3" in
3)
echo -n "Restarting $6 service...\n"
# Call the init script to restart the NRPE server
/usr/lib/nagios/plugins/check_nrpe -H $4 -c restart-service -a $5
echo "$date - restart $6 (restart-service $5) - HARD" >> /var/log/nagios3/eventhandlers.log
;;
esac
;;
esac
;;
esac
sudo chmod +x /etc/nagios3/conf.d/eventhandlers/restart-service.sh
- Creare il file di log
sudo -u nagios touch /var/log/nagios3/eventhandlers.log
- Provare lo script
- sudo -u nagios /etc/nagios3/conf.d/eventhandlers/restart-service.sh CRITICAL SOFT 1 hostname NTP TIME
- Riavviare nagios
sudo -u nagios nagios3 -v /etc/nagios3/nagios.cfg && sudo invoke-rc.d nagios3 restart
Configurazione agente nrpe su macchina monitorata
- Definire il comando da eseguire
sudoedit /etc/nagios/nrpe_local.cfg
command[restart-service]=/usr/bin/sudo /usr/sbin/invoke-rc.d '$ARG1$' restart
- Abilitarne l'esecuzione allo user nagios, senza password:
sudo visudo
Cmnd_Alias NAGIOS_EH = /usr/sbin/invoke-rc.d
nagios ALL=NOPASSWD: NAGIOS_EH
- Abilitare il debug di nrpe, se si vuole debuggare
sudoedit /etc/nagios/nrpe.cfg
debug=1
- Riavviarlo:
sudo invoke-rc.d nagios-nrpe-server restart
Test
- Fermare il servizio, o inserire un dummy check command
- Verificare nel log che l'event handler sia eseguito da nagios:
sudo tail -f /var/log/nagios3/event_handlers.log
Fri May 13 17:37:08 CEST 2011 - Eventhandler run #1=CRITICAL #2=SOFT #3=1 #4=192.168.0.9 #5=cups #6=CUPS Fri May 13 17:38:08 CEST 2011 - Eventhandler run #1=CRITICAL #2=SOFT #3=2 #4=192.168.0.9 #5=cups #6=CUPS Fri May 13 17:38:08 CEST 2011 - restart CUPS (restart-service cups) - SOFT Fri May 13 17:39:08 CEST 2011 - Eventhandler run #1=CRITICAL #2=HARD #3=3 #4=192.168.0.9 #5=cups #6=CUPS Fri May 13 17:39:08 CEST 2011 - restart CUPS (restart-service cups) - HARD Fri May 13 17:49:15 CEST 2011 - Eventhandler run #1=OK #2=HARD #3=3 #4=192.168.0.9 #5=cups #6=CUPS
- Verificare che il comando sia eseguito da nrpe:
sudo tail -f /var/log/syslog | grep nrpe
May 13 17:39:08 galserver nrpe[4307]: Connection from 192.168.0.8 port 5803 May 13 17:39:08 galserver nrpe[4307]: Handling the connection... May 13 17:39:08 galserver nrpe[4307]: Host is asking for command 'restart-service' to be run... May 13 17:39:08 galserver nrpe[4307]: Running command: /usr/bin/sudo /usr/sbin/invoke-rc.d 'cups' restart May 13 17:39:13 galserver nrpe[4307]: Command completed with return code 0 and output: Restarting Common Unix Printing System: cupsd. May 13 17:39:13 galserver nrpe[4307]: Return Code: 0, Output: Restarting Common Unix Printing System: cupsd. May 13 17:39:13 galserver nrpe[4307]: Connection from 192.168.0.8 closed.
- Disabilatare il debug nrpe