2

We need to run pt-stalk on a handful of servers to keep an eye on mySQL, and I was sick of manually starting it every time the server rebooted. A little googling turned up an init script for pt-stalk, and it seemed to work just fine. [my slightly modified version included at the bottom of this post]

It was taking too long to figure out how to push the script and config out via ssh [long story, please don't ask] so I decided to just log into the 20-odd servers and set everything up manually and everything worked.

A couple days later my coworker commented that he was getting the emails, but I clearly wasn't, and it looked like I had put the wrong email in the config. This time I had figured out how to push the change via ssh, and finished everything off with:

for server in `cat serverlist.txt`; do
  ssh -t $server sudo -i service pt-stalk restart
done

And this is the point where pt-stalk stopped working on every single server with:

2013_08_23_11_43_20 Caught signal, exiting
2013_08_23_11_43_20 Exiting because OKTORUN is false
2013_08_23_11_43_20 /usr/bin/pt-stalk exit status 1
2013_08_23_11_43_22 Starting /usr/bin/pt-stalk --function=status --variable=Threads_connected --threshold=100 --match= --cycles=5 --interval=1 --iterations= --run-time=30 --sleep=300 --dest=/var/lib/pt-stalk --prefix= --notify-by-email=servers@domain.com --log=/var/log/pt-stalk.log --pid=/var/run/pt-stalk.pid
2013_08_23_11_43_22 Caught signal, exiting

Through yesterday's testing I've deciphered that 'Caught signal, exiting' means it's caught a HUP/TERM/KILL. The first one is from service pt-stalk restart, and the second one immediately after the successful start is from when the ssh session closes. wat.jpg

If I simply ssh to the server, enter sudo -i service pt-stalk start or restart I can log out and it continues happily. However, if I just feed a command to ssh like the above loop pt-stalk it catches a signal and exits. Sometimes it catches two signals before it exits.

What the hell is going on?


My /etc/init.d/pt-stalk for reference:

#!/usr/bin/env bash
# chkconfig: 2345 20 80
# description: pt-stalk
### BEGIN INIT INFO
# Provides: pt-stalk
# Required-Start: $network $named $remote_fs $syslog
# Required-Stop: $network $named $remote_fs $syslog
# Should-Start: pt-stalk
# Default-Start: 2 3 4 5
# Default-Stop: 0 1 6
### END INIT INFO

PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin
DAEMON="/usr/bin/pt-stalk"
DAEMON_OPTS="--config /etc/pt-stalk.conf"
NAME="pt-stalk"
DESC="pt-stalk"
PIDFILE="/var/run/${NAME}.pid"
STALKHOME="/var/lib/pt-stalk"

test -x $DAEMON || exit 1

[ -r /etc/default/pt-stalk ] && . /etc/default/pt-stalk

#. /lib/lsb/init-functions

sig () {
    test -s "$PIDFILE" && kill -$1 `cat $PIDFILE`
}

start() {
  if [[ -z $MYSQL_OPTS ]]; then
HOME=$STALKHOME $DAEMON $DAEMON_OPTS
  else
HOME=$STALKHOME $DAEMON $DAEMON_OPTS -- $MYSQL_OPTS
  fi
return $?
}

stop() {
  if sig TERM; then
    while sig 0 ; do
      echo -n "."
      sleep 1
    done
    return 0
  else
    echo "$DESC is not running."
    return 1
  fi
}

status() {
  if sig 0 ; then
    echo "$DESC (`cat $PIDFILE`) is running."
    return 0
  else
    echo "$DESC is stopped."
    return 1
  fi
}

log_begin_msg() {
        echo $1
}

log_end_msg() {
        if [ $1 -eq 0 ]; then
                echo "Success"
        else
                echo "Failure"
        fi
}

case "$1" in
  start)
   log_begin_msg "Starting $DESC"
   start
   log_end_msg $?
   ;;

  stop)
   log_begin_msg "Stopping $DESC"
   stop
   log_end_msg $?
   ;;
  status)
    status ;;

  restart)
    log_begin_msg "Restarting $DESC"
    stop
    sleep 1
    start
    log_end_msg $?
    ;;

  *)
    echo "Usage: $0 {start|stop|status|}" >&2
    exit 1
    ;;
esac
Sammitch
  • 2,111
  • 1
  • 21
  • 35

1 Answers1

0

Since your daemon is terminated at once I'm pretty sure that if the --daemonize option is given to /usr/bin/pt-stalk it might not close one of the file descriptors stdin, stdout or stderr properly and early enough or/and does not handle the SIGHUP signal correctly.

To test which of my assumptions is correct, modify your init script so that input and output of start are redirected from and to /dev/null. Example:

start </dev/null >/dev/null 2>/dev/null

If this removes the early termination problem narrow it down by removing these redirections one after the other again. It might be that pt-stalk simply forks to early. In this case inserting another sleep 1 after the call to start might also be able to work around this. If it comes out to the handling of the SIGHUP signal then it might also be a workaround to modify your init script by adding this:

trap "echo SIGHUP ignored" 1

before the call to start and this:

trap - 1

right after the call to start.

I did not download pt-stalk and had no look into it and did not test my theory described above. This was all from my experiences with other daemons.

pefu
  • 679
  • 1
  • 6
  • 24