3

I have a Python3 daemon process running on Linux. It is a normal single thread process running in the background and doing select.select() in the main loop and then handling I/O. Sometimes (approx. 1 or 2 times in a month) it stops responding. When it happens, I'd like to debug the problem.

I have tried the pyrasite, but was not succesfull, because stdin/stdout of the daemon is redirected to the /dev/null device and pyrasite uses this stdin/stdout, not the console where it was started from.

So I have added a SIGUSR1 signal handler which logs the stack trace. Works fine normally.

Today I got a freeze. ps shows, the daemon is in "S" (interruptible sleep) state. A busy loop is ruled out.

The server does not respond to SIGUSR or SIGINT (used for shutdown).

I'd like to have at least some hint what is going on there.

Under what conditions is a sleeping Python3 Linux process not handling interrupts that it is supposed to handle?


UPDATE:

I could reproduce the issue finally. After adding a lot of debug messages, I have found a race condition that I'm going to fix soon.

When the daemon is not responding, it is sleeping at os.read(p) where p is a read-end of a new pipe (see: os.pipe) where nobody writes to.

However, all my attempts to write a simple demonstration program failed. When I tried to read from an empty pipe, the program blocked as expected, but could be interrupted (killed from other terminal with SIGINT) as usual. The mystery remains unsolved.


UPDATE2:

Finally some code! I have deliberately chosen low level system calls.

import os
import time
import signal
import sys 

def sighandler(*unused):
    print("got signal", file=sys.stderr)

print("==========")
signal.signal(signal.SIGUSR1, sighandler)

pid = os.getpid()
rfd, wfd = os.pipe()
if os.fork():
    os.close(wfd)
    print("parent: read() start")
    os.read(rfd, 4096)
    print("parent: read() stop")
else:
    os.close(rfd)
    os.kill(pid, signal.SIGUSR1)
    print("child: wait start")
    time.sleep(3)
    print("child: wait end")

If you run this many time, you'll get this:

parent: read() start
got signal
child: wait start
child: wait end
parent: read() stop

which is fine, but sometimes you'll see this:

parent: read() start
child: wait start
child: wait end
got signal
parent: read() stop

What is happening here:

  1. parent starts a read from the pipe
  2. child sends a signal to the parent. The parent must have received this signal, but it seems to be "somehow postponed"
  3. child waits
  4. child exits, pipe is closed automatically
  5. parent's read operation ends with an EOF
  6. the signal is handled now

Now, due to a bug in my program, the signal was received in step 2, but the EOF was not delivered, so the read did not finish and step 6 (signal handling) was never reached.

That is all information I am able to provide.

VPfB
  • 14,927
  • 6
  • 41
  • 75
  • Could you show some code? – Christian Dean Aug 30 '16 at 17:16
  • @Mr.goosberry There are about 5000 lines of code, Don't know what part is important. – VPfB Aug 30 '16 at 17:24
  • Well could you provide a little more info? What exactly is `select.select()`? And saying that _sometimes_ your program fails is ambiguous. When exactly does it fail\freeze\stop responding? – Christian Dean Aug 30 '16 at 17:27
  • select.select() would be some kind of async file manager (?) – Loïc Aug 30 '16 at 17:29
  • I'd bet on a bad exception handling that covers the error. Do you have any try/except statement without real exception management? – Loïc Aug 30 '16 at 17:30
  • `select.select` is part of the standard library: https://docs.python.org/3/library/select.html#select.select – VPfB Aug 30 '16 at 17:33
  • @Mr.goosberry It freezes, i.e stops reading input, writing outputs, stops logging. That all can be explained with some bug in the code. But it stops handling interrupts and that is something I cannot understand. – VPfB Aug 30 '16 at 17:36
  • @Loïc All exceptions are logged. – VPfB Aug 30 '16 at 17:45
  • Have you tried `strace`-ing the python process? – Leon Aug 30 '16 at 17:59
  • @Leon No, i did not. I know only basic `strace` usage and don't know how to use it in this situation. – VPfB Aug 30 '16 at 18:16
  • The freeze happens so 1 or 2 times in a month, I will add it to the question. – VPfB Aug 30 '16 at 18:16
  • do you use threads? – YOU Aug 31 '16 at 06:34
  • @YOU: No. However `threading` is imported by a library module, but it is not used in the server – VPfB Aug 31 '16 at 06:45
  • Any blocking i/o also blocks signals until it is done. You should try using `select.select()` with timeout (and handle timeout error). – freakish Aug 31 '16 at 08:18
  • @freakish Many programs would be unstoppable if it was true. For example you can interrupt blocking input() with ctrl-C signal. – VPfB Aug 31 '16 at 08:29
  • @VPfB Many programs **are** unstoppable. For example try to stop any database driver on long queries. You can stop `input()` with ctrl-C simply because that's what `input()` does: it waits for keyboard. -.- – freakish Aug 31 '16 at 08:31
  • @freakish But many programs are stoppable with signals even when doing blocking I/O. Could you please quote any source regarding your first comment? – VPfB Aug 31 '16 at 08:42
  • @VPfB No, they don't. They may appear to do blockin i/o but they do for example `select.select()` with timeout. I'll post my answer soon and you'll see how it is done. – freakish Aug 31 '16 at 08:43

0 Answers0