1

I have a Python script that starts a daemon process. I was able to do this by using the code found at: https://gist.github.com/marazmiki/3618191.

The code starts the daemon process exactly as expected. However, sometimes, and only sometimes, when the daemon process is stopped, the running job is zombied.

The stop function of the code is:

    def stop(self):
        """
            Stop the daemon
        """
        # Get the pid from the pidfile
        try:
            pf = file(self.pidfile, 'r')
            pid = int(pf.read().strip())
            pf.close()
        except:
            pid = None

        if not pid:
            message = "pidfile %s does not exist. Daemon not running?\n"
            sys.stderr.write(message % self.pidfile)
            return # not an error in a restart

        # Try killing the daemon process
        try:
            while 1:
                os.kill(pid, SIGTERM)
                time.sleep(1.0)
        except OSError, err:
            err = str(err)
            if err.find("No such process") > 0:
                if os.path.exists(self.pidfile):
                    os.remove(self.pidfile)
            else:
                print str(err)
                sys.exit(1)

When this stop() method is run, the process (pid) appears to hang, and when I Control+C out, I see the script is KeyboardInterrupted on the line time.sleep(1.0), which leads me to believe that the line:

os.kill(pid, SIGTERM)

is the offending code.

Does anyone have any idea why this could be happening? Why would this os.kill() would force a process to become a zombie?

I am running this on Ubuntu linux (if it matters).

UPDATE: I'm including my start() method per @paulus's answer.

    def start(self):
        """
            Start the daemon
        """
        pid = None
        # Check for a pidfile to see if the daemon already runs
        try:
            pf = file(self.pidfile, 'r')
            pid = int(pf.read().strip())
            pf.close()
        except:
            pid = None

        if pid:
            message = "pidfile %s already exist. Daemon already running?\n"
            sys.stderr.write(message % self.pidfile)
            sys.exit(1)

        # Start the daemon
        self.daemonize()
        self.run()

UPDATE 2: And here is the daemonize() method:

def daemonize(self):
        """
            do the UNIX double-fork magic, see Stevens' "Advanced
            Programming in the UNIX Environment" for details (ISBN 0201563177)
            http://www.erlenstar.demon.co.uk/unix/faq_2.html#SEC16
        """
        try:
            pid = os.fork()
            if pid > 0:
                # exit first parent
                sys.exit(0)
        except OSError, e:
            sys.stderr.write("fork #1 failed: %d (%s)\n" % (e.errno, e.strerror))
            sys.exit(1)

        # decouple from parent environment
        os.chdir("/")
        os.setsid()
        os.umask(0)

        # do second fork
        try:
            pid = os.fork()
            if pid > 0:
                # exit from second parent
                sys.exit(0)
        except OSError, e:
            sys.stderr.write("fork #2 failed: %d (%s)\n" % (e.errno, e.strerror))
            sys.exit(1)

        # redirect standard file descriptors
        sys.stdout.flush()
        sys.stderr.flush()

        sys.stdout = file(self.stdout, 'a+', 0)
        si = file(self.stdin, 'r')
        so = file(self.stdout, 'a+')
        se = file(self.stderr, 'a+', 0)
        os.dup2(si.fileno(), sys.stdin.fileno())
        os.dup2(so.fileno(), sys.stdout.fileno())
        os.dup2(se.fileno(), sys.stderr.fileno())

        # write pidfile
        atexit.register(self.delpid)
        pid = str(os.getpid())
        file(self.pidfile, 'w+').write("%s\n" % pid)
Brett
  • 11,637
  • 34
  • 127
  • 213
  • (1) Did you actually check that `pid` was becoming a zombie with `ps`? (2) What happens if you use `SIGKILL` instead of `SIGTERM`? – nneonneo Apr 14 '14 at 18:06

1 Answers1

3

You're looking in the wrong direction. The flawed code is not the one in the stop routine but it is in the start one (if you're using the code from gist). Double fork is a correct method, but the first fork should wait for the child process, not simply quit.

The correct sequence of commands (and the reasons to do the double fork) can be found here: http://lubutu.com/code/spawning-in-unix (see the "Double fork" section).

The sometimes you mention is happening when the first parent dies before getting SIGCHLD and it doesn't get to init.

As far as I remember, init should periodically read exit codes from it's children besides signal handling, but the upstart version simply relies on the latter (therefore the problem, see the comment on the similar bug: https://bugs.launchpad.net/upstart/+bug/406397/comments/2).

So the solution is to rewrite the first fork to actually wait for the child.

Update: Okay, you want some code. Here it goes: pastebin.com/W6LdjMEz I've updated the daemonize, fork and start methods.

AlexLordThorsen
  • 8,057
  • 5
  • 48
  • 103
paulus
  • 637
  • 3
  • 9
  • Many thanks for the answer. I understand most of it. I am still confused on how to have the first fork wait for based on the link you sent. Any pointers? This would likely be in the `daemonize()` method, and not the `start()`, correct? – Brett Apr 14 '14 at 16:53
  • Another comment/question: In doing the first fork, there is currently a `sys.exit(0)` as you point out. Would removing this do the trick? Doesn't seem like it from testing, but I'm not sure how to force the first for to wait for the child. – Brett Apr 14 '14 at 17:18
  • Okay, you want some code. Here it goes: http://pastebin.com/W6LdjMEz I've updated the daemonize, fork and start methods. There is still the race between processes (the one creating the pidfile and the one checking) therefore the sleep. – paulus Apr 15 '14 at 01:10
  • 1
    By the way you may want to use the native upstart start/stop utilities rather than reinventing the wheel if you're not planning to run your processes on other OSes. You will also get a nice watchdog for your service. – paulus Apr 15 '14 at 01:14
  • What do you mean by the native upstart start/stop utilities? – Brett Apr 16 '14 at 13:38
  • 1
    Look into /etc/init/ssh.conf for example. It's pretty straightforward. It will launch your app with a command like initctl start ssh (based on the config file name) and stop with initctl stop ssh. You don't need to daemonize at all to launch your app in the background at all. – paulus Apr 16 '14 at 14:48
  • Hello, having some issues here, when the app gets to a point to close exit and stop the daemon ( in case of exception ) calling `daemon.stop()` will not do it. The stop function gets called but seems to be hanging at `os.kill(pid, signal.SIGTERM)`. If however i do start and then for some reason stop it again by myself the process ends fine and the file gets deleted as well. Any ideas what is going on ? – LefterisL Mar 02 '16 at 14:13
  • It largely depends on what your daemon is doing. It looks like it hangs and sending it signals doesn't help. You may try determining what is happening from the console. First check the daemon state (something like `ps ax -o state,pid,command | grep yourdaemonname` should show the state in the first column). You may then try to kill the daemon from the console (`kill `). I can think of two scenarios where a daemon will not die: either it is ignoring signals (`kill -9 pid` will help) or it is in D-state (probably IO on disk). In the latter case it will stop as soon as the IO finishes. – paulus Mar 02 '16 at 22:25
  • D-state would make sense since i'm using threads to fetch some data and write them to disk. When i try my `init.d` status i'm getting that `Process is dead but pid file exists`. You think it could be something to do with the open threads that it has opened ? I will try and join the queue before doing `daemon.stop` to make sure everything is done. – LefterisL Mar 03 '16 at 08:13
  • Actually noticed something, every time i run `ps ax ...` i get a different result `S 18206 grep --color=auto daemon-name` with the `pid` changing each time. That is all before i even stop it or do something. Just start. Could those be from the threads ? – LefterisL Mar 03 '16 at 11:42
  • Hmm... if you see that the process is dead, it probably is. The greps with different pids are actually different processes, but not your daemon. It is the grep programs you spawn to grep the processlist... and it happens that they also contain the daemon-name you're grepping for :) Do you get any other processes? If not, your daemon is dead already. – paulus Mar 03 '16 at 16:17