1

I have a problem with an NRPE check that I wrote.

It's a simple shell script that run "systemctl is-active [service_name]" and return the value to our Thruk.

When I run the script directly with the user nrpe, it works :

-bash-4.2$ /usr/lib64/nagios/plugins/check_service_active.sh --service dynflowd
dynflowd
Service dynflowd démarré

But when I run it with NRPE, locally, it tells me that the service is stopped :

-bash-4.2$ ./check_nrpe -H 127.0.0.1 -c check_service_active -a 'dynflowd'
dynflowd
Service dynflowd arrêté

After multiple tests, I figure out that it's linked to the systemctl command. When I replace systemctl by another command like "echo", it works.

So I think there is something with NRPE and systemctl but I can't find what ? And I don't find anything about it on Google.

So here I am !

Thank you in advance for your reply and sorry if I'm not understandable enough.

Here's my script :

#!/bin/sh
#
# Script d'interrogation d'un service via systemctl

# Nagios return codes
STATE_OK=0
STATE_WARNING=1
STATE_CRITICAL=2
STATE_UNKNOWN=3
STATE_DEPENDENT=4

#Recuperation des parametres
while test -n "$1"; do
        case "$1" in
                --service)
                        SERV=$2
                        shift
                        ;;

                -u)
                        print_usage
                        exit $STATE_OK
                        ;;
        esac
        shift
done

STAT=$(systemctl is-active $SERV)

if [[ $STAT  == "active" ]]
then
        echo "Service $SERV démarré"
        exit $STATE_OK
else
        echo "Service $SERV arrêté"
        exit $STATE_CRITICAL
fi
Grimmj0w
  • 11
  • 3
  • A couple of questions: 1/ What's printing out the first line of your output (the service name on its own)? There doesn't appear to be anything in the script that does this. 2/ How is the NPRE version, assuming it calls the same script at some point, setting the `SERV` variable when you have no `--server` argument? I may not know how NPRE actually works but the answer to those two questions may guide you. – paxdiablo Apr 22 '20 at 06:46
  • 1/ The variable $STAT contains the value of my command (here, it can be active or inactive) and depending on that, the script print the status of the service and exit with the right code : OK or CRITICAL. 2/ I don't know if I understand your question the right way but it's NRPE v3.2.1. And it's the '--service' argument that sets the 'SERV' variable. – Grimmj0w Apr 22 '20 at 06:56
  • 1/ Yes, but the only thing being *printed* by the script is the line `Service $SERV démarré/arrêté` while your output has *two* lines, `dynflowd` and `Service dynflowd arrêté`. Where's that first line of output coming from? 2/ I understand that `--service` sets `SERV` when you supply it to the script, I'm just asking where that `--service` comes from with the `check_nrpe` command you execute - all you have there is the `-a`. Is there some other config that sits between `check_nrpe` and the call to your script? – paxdiablo Apr 22 '20 at 07:13
  • 1/ I forgot the first line, excuse me. It was just a test to see if the `SERV` variable is set correctly. 2/ I forgot to add the line in my nrpe.cfg which set the nrpe command : **command[check_service_active]=/usr/lib64/nagios/plugins/check_service_active.sh --service $ARG1$** – Grimmj0w Apr 22 '20 at 07:25
  • So the command `check_nrpe -H 127.0.0.1 -c check_service_active -a 'dynflowd'` can be read with `-c check_service_active` (=execute the command `check_service_active`) with the arguments `-a 'dynflowd'` – Grimmj0w Apr 22 '20 at 07:33
  • Ah, that makes more sense, I think I have an answer which may help, hang on ... – paxdiablo Apr 22 '20 at 07:36

2 Answers2

0

Okay, similar to cron jobs, it may be that NRPE (the server) runs with a different environment to your shell, and that distinct environment is somehow not running systemctl properly.

An easy way to see this is to modify the:

STAT=$(systemctl is-active $SERV)

line temporarily so you can see what's happening. Change the script so that line now becomes:

(
    echo ==== $(date) ==== ${SERV}
    systemctl is-active $SERV
) >> /tmp/paxdebug.dynflowd 2>&1
STAT=$(systemctl is-active $SERV)

That will, as well as running the script to get the status, write some useful information to the /tmp/paxdebug.dynflowd file, which you can then examine to see exactly what's happening in the NRPE-started instance of the script.

Hopefully, it'll say something simple like Cannot find systemctl (indicating path problems) but, whatever it gives you, it should help toward figuring out exactly what the problem is.


Update 1: based on your comments that attempting to run systemctl resulted in:

systemctl: command not found

That's almost certainly because the path is wrong. You can check the path by adding the following line into that debug code I posted:

echo "PATH is [$PATH]"

To fix it, either modify your path in the script to include /usr/bin (assuming that's where systemctl resides) or just run the absolute path (in both the debug and original areas):

/usr/bin/systemctl is-active ${SERV}
STAT=$(/usr/bin/systemctl is-active ${SERV})

Update 2: based on your comments that, with the absolute path being used, you now get:

/usr/lib64/nagios/plugins/check_service_active.sh: line 32:
    /usr/bin/systemctl: Permission denied

This is likely to be NRPE running at a low privilege level, or as a different user to provide security from attacks. Given how central systemd is to the running of a system, it would be unwise to allow unfettered access to it.

So, similar to the previous update, add the following to the debug area:

/bin/ls -al /usr/bin/systemctl # Check "ls" is in this directory first.
/usr/bin/id                    # Ditto for "id".

The first line will get you the permissions, the second will get you your user details. At that point, it becomes an exercise in figuring out how to run systemctl without violating security.

If it turns out this is a permission or user issue, one possibility would be to provide a well-secured setuid script which would be owned by (and hence run as) a user that's allowed to run systemctl. But I really mean well-secured, since you don't want to open up a hole:

# SysCtlIsActive.sh: only allows certain services to be queried.

# Limit to these ones (white-space separated).

allowed="dynflowd"

# If not allowed, reject with special status.

result="GoAway"
for service in ${allowed} ; do
    [[ "$1" = "${service}" ]] && result=""
done

# If it IS allowed, get actual status.

[[ -z "${result}" ]] && result="$(/usr/bin/systemctl is-active "$1")"

echo "${result}"

There may be other methods (and they may be better) but that should hopefully be a good start if that is indeed the problem.


Just be aware that I think setuid is ignored for shell scripts that have the shebang line (like #!/usr/bin/env bash) so you may have to work around that, possibly by building a real executable file to do this work.

If you do have to build a real executable for it, you can start with the following C code, which is an adaptation of the shell script above:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

int main(int argc, char **argv) {
    // Check service name provided.

    if (argc < 2) {
        puts("NoServiceProvided");
        return 1;
    }

    // Check service name allowed.

    static char *allowed[] = { "dynflowd", NULL };
    int isAllowed = 0;
    for (char **service = &(allowed[0]); *service != NULL; service++) {
        if (strcmp(*service, argv[1]) == 0) {
            isAllowed = 1;
            break;
        }
    }
    if (! isAllowed) {
        puts("InvalidServiceName");
        return 1;
    }

    // Try to allocate memory for command.

    char *prefix = "/usr/bin/systemctl is-active ";
    char *cmdBuff = malloc(strlen(prefix) + strlen(argv[1]) + 1);
    if (cmdBuff == NULL) {
        puts("OutOfMemory");
        return 1;
    }

    // Execute command, free memory, and return.

    sprintf(cmdBuff, "%s%s", prefix, argv[1]);
    system(cmdBuff);
    free(cmdBuff);

    return 0;
}
paxdiablo
  • 854,327
  • 234
  • 1,573
  • 1,953
  • I replace the line `STAT=$(systemctl is-active $SERV)` by the code you gave me. But when I run the check, nothing else than the message "Service dynflowd arrêté" happens... And no /tmp/paxdebug.dynflowd file. This file appears only if I run the script directly without using NRPE but there are no useful information inside. – Grimmj0w Apr 22 '20 at 09:19
  • But I agree with you on the idea that maybe NRPE runs with a different environment. – Grimmj0w Apr 22 '20 at 09:21
  • Okay, a few points. What I'd expect to see in the file is the date/service-name line followed by active. You say you got no useful info but did you at least get that. And, if you get *nothing* when running under NRPE, I'd be thinking that it's running some *other* script, not the one you modified. – paxdiablo Apr 22 '20 at 12:38
  • After your comment, I wanted to re-run the script directly and via NRPE to make sure that I didn't miss something, and I have finally something printed in the /tmp/paxdebug.dynflowd file when I run the NRPE check. I think that running the NRPE check can't create the file, but can write into it. Anyway it says that at line 32 (`systemctl is-active $SERV`) : `systemctl: command not found` – Grimmj0w Apr 22 '20 at 13:31
  • To go further, I replace `systemctl` by `/usr/bin/systemctl` and the file /tmp/paxdebug.dynflowd gave me this error : `/usr/lib64/nagios/plugins/check_service_active.sh: line 32: /usr/bin/systemctl: Permission denied` – Grimmj0w Apr 22 '20 at 13:52
  • I add your update n°1 : `PATH is [/usr/bin:/bin:/usr/sbin:/sbin]`. For your second update : `/bin/ls -al /usr/bin/systemctl` return `/bin/ls: cannot access /usr/bin/systemctl: Permission denied` and `/usr/bin/id` return `uid=990(nrpe) gid=987(nrpe) groups=987(nrpe),988(nagios) context=system_u:system_r:nrpe_t:s0`. To have more detail, I add the `ls -l /usr/bin` in the script and I find this in the results : `-?????????? ? ? ? ? ? systemctl` I will look for your solution of building an executable file. Have you any idea where to begin my research ? – Grimmj0w Apr 23 '20 at 06:26
0

I finally find the problem : NRPE version !!!

On my server, NRPE is in nrpe-3.2.1-6.

I run my script via NRPE on another server and it works.

This other server runs nrpe-3.2.1-8.

So the solution is : updating !

Thank you for your time and ideas, especially the >> /tmp/paxdebug.dynflowd 2>&1 idea which help me figured out the problem.

Grimmj0w
  • 11
  • 3