Esecuzione di un comando in caso di allarme Nagios (EventHandler)

Esecuzione di un comando su server remoto

Se si monitora un serivizio, è possibile far eseguire un comando quando il servizio entra per la prima volta in CRITICAL.

In questo caso facciamo ripartire il servizio NTP quando non si verifica che c'è una deriva dell'orario di sistema

Configurazione server Nagios su macchina monitorante

Attivare l'event handlers nel servizio da monitorare, e definire il comando nagios da eseguire:

sudoedit /etc/nagios3/conf.d/file.cfg

#...
event_handler_enabled           1
event_handler                   restart-service!NTP
#...

Definire il comando nagios event handler

sudoedit /etc/nagios3/conf.d/eventhandlers.cfg

define command{
    command_name    restart-service
    command_line    /etc/nagios3/conf.d/eventhandlers/restart-service.sh           "$SERVICESTATE$" "$SERVICESTATETYPE$" "$SERVICEATTEMPT$" "$HOSTADDRESS$" "$ARG1$" "$SERVICEDESC$"
}

Creare lo script effettivo:

sudo mkdir -p /etc/nagios3/conf.d/eventhandlers
sudoedit /etc/nagios3/conf.d/eventhandlers/restart-service.sh

#!/bin/sh                                                                                            
#
# Event handler script for restarting the nrpe server on the local machine
# Taken from the Nagios documentation and
# http://www.techadre.com/sites/techadre.com/files/event_handler_script_0.txt
# Adapted by L.C. Karssen
# Time-stamp: <2010-09-14 15:24:33 (root)>
#
# Note: This script will only restart the nrpe server if the service is
#       retried 3 times (in a "soft" state) or if the web service somehow
#       manages to fall into a "hard" error state.
#
 
date=`date`

#/etc/nagios3/conf.d/restart-service.sh
# define command{
#    command_name    restart-service
#    command_line    /etc/nagios3/conf.d/eventhandlers/ restart-service.sh           
#                       "$SERVICESTATE$" "$SERVICESTATETYPE$" "$SERVICEATTEMPT$" "$HOSTADDRESS$" 
#                       "$ARG1$" "$SERVICEDESC$"
#    }
# To check restart
# sudo -u nagios /etc/nagios3/conf.d/eventhandlers/restart-service.sh CRITICAL SOFT 1 hostname NTP TIME
#
# max_check_attempts 3
#

echo "$date - Eventhandler run #1=${1} #2=${2} #3=${3} #4=${4} #5=${5} #6=${6}"  >> /var/log/nagios3/eventhandlers.log
#Fri May 13 17:29:15 CEST 2011 - Eventhandler run #1=CRITICAL #2=HARD #3=3 #4=192.168.0.9 #5=cups #6=CUPS

# What state is the NRPE service in?
case "$1" in
OK)
        # The service just came back up, so don't do anything...
        ;;
WARNING)
        # We don't really care about warning states, since the service is probably still running...
        ;;
UNKNOWN)
        # We don't know what might be causing an unknown error, so don't do anything...
        ;;
CRITICAL)
        # Aha!  The BLAH service appears to have a problem - perhaps we should restart the server...
 
        # Is this a "soft" or a "hard" state?
        case "$2" in
 
        # Were in a soft state, meaning that Nagios is in the middle of retrying the
        # check before it turns into a "hard" state and contacts get notified
        SOFT)
                # What check attempt are we on?  We don't want to restart the web server on the first
                # check, because it may just be a fluke!
                case "$3" in
                # Wait until the check has been tried 3 times before restarting the web server.
                # If the check fails on the 4th time (after we restart the web server), the state
                # type will turn to "hard" and contacts will be notified of the problem.
                # Hopefully this will restart the web server successfully, so the 4th check will
                # result in a "soft" recovery.  If that happens no one gets notified because we
                # fixed the problem!
                1|2)
                        echo -n "Restarting service $6\n"
                        # Call NRPE to restart the service on the remote machine
                        /usr/lib/nagios/plugins/check_nrpe -H $4 -c restart-service -a $5
                        echo "$date - restart $6 (restart-service ${5}) - SOFT"  >> /var/log/nagios3/eventhandlers.log
                        ;;
                esac
                ;;
 
        # The service somehow managed to turn into a hard error without getting fixed.
        # It should have been restarted by the code above, but for some reason it didn't.
        # Let's give it one last try, shall we?
        # Note: Contacts have already been notified of a problem with the service at this
        # point (unless you disabled notifications for this service)
        HARD)
                case "$3" in
 
                3)
                        echo -n "Restarting $6 service...\n"
                        # Call the init script to restart the NRPE server
                        /usr/lib/nagios/plugins/check_nrpe -H $4 -c restart-service -a $5
                        echo "$date - restart $6 (restart-service $5) - HARD"  >> /var/log/nagios3/eventhandlers.log
                        ;;
                esac
                ;;
        esac
        ;;
esac

sudo chmod +x /etc/nagios3/conf.d/eventhandlers/restart-service.sh

Creare il file di log

sudo -u nagios touch /var/log/nagios3/eventhandlers.log

Provare lo script

sudo -u nagios /etc/nagios3/conf.d/eventhandlers/restart-service.sh CRITICAL SOFT 1 hostname NTP TIME

Riavviare nagios

sudo -u nagios nagios3 -v /etc/nagios3/nagios.cfg  && sudo invoke-rc.d nagios3 restart

Configurazione agente nrpe su macchina monitorata

MAcchina linux

Definire il comando da eseguire

sudoedit /etc/nagios/nrpe_local.cfg

command[restart-service]=/usr/bin/sudo /usr/sbin/invoke-rc.d '$ARG1$' restart

Abilitarne l'esecuzione allo user nagios, senza password:

sudo visudo

Cmnd_Alias NAGIOS_EH = /usr/sbin/invoke-rc.d

nagios      ALL=NOPASSWD: NAGIOS_EH

Abilitare il debug di nrpe, se si vuole debuggare

sudoedit /etc/nagios/nrpe.cfg

debug=1

Riavviarlo:

sudo invoke-rc.d nagios-nrpe-server restart

Test

Fermare il servizio, o inserire un dummy check command

Verificare nel log che l'event handler sia eseguito da nagios:

sudo tail -f /var/log/nagios3/eventhandlers.log

Fri May 13 17:37:08 CEST 2011 - Eventhandler run #1=CRITICAL #2=SOFT #3=1 #4=192.168.0.9 #5=cups #6=CUPS
Fri May 13 17:38:08 CEST 2011 - Eventhandler run #1=CRITICAL #2=SOFT #3=2 #4=192.168.0.9 #5=cups #6=CUPS
Fri May 13 17:38:08 CEST 2011 - restart CUPS (restart-service cups) - SOFT
Fri May 13 17:39:08 CEST 2011 - Eventhandler run #1=CRITICAL #2=HARD #3=3 #4=192.168.0.9 #5=cups #6=CUPS
Fri May 13 17:39:08 CEST 2011 - restart CUPS (restart-service cups) - HARD
Fri May 13 17:49:15 CEST 2011 - Eventhandler run #1=OK #2=HARD #3=3 #4=192.168.0.9 #5=cups #6=CUPS

Verificare che il comando sia eseguito da nrpe:

sudo tail -f /var/log/syslog | grep nrpe

May 13 17:39:08 galserver nrpe[4307]: Connection from 192.168.0.8 port 5803
May 13 17:39:08 galserver nrpe[4307]: Handling the connection...
May 13 17:39:08 galserver nrpe[4307]: Host is asking for command 'restart-service' to be run...
May 13 17:39:08 galserver nrpe[4307]: Running command: /usr/bin/sudo /usr/sbin/invoke-rc.d 'cups' restart
May 13 17:39:13 galserver nrpe[4307]: Command completed with return code 0 and output: Restarting Common Unix Printing  System: cupsd.
May 13 17:39:13 galserver nrpe[4307]: Return Code: 0, Output: Restarting Common Unix Printing System: cupsd.
May 13 17:39:13 galserver nrpe[4307]: Connection from 192.168.0.8 closed.

Disabilatare il debug nrpe

Esecuzione di un comando in caso di allarme Nagios (EventHandler)

Contents

Esecuzione di un comando su server remoto

Configurazione server Nagios su macchina monitorante

Configurazione agente nrpe su macchina monitorata

MAcchina linux

Test

Riferimenti

Navigation menu

Esecuzione di un comando in caso di allarme Nagios (EventHandler)

Esecuzione di un comando su server remoto

Configurazione server Nagios su macchina monitorante

Configurazione agente nrpe su macchina monitorata

MAcchina linux

Test

Riferimenti

Navigation menu

Search