I have a Python3 daemon process running on Linux. It is a normal single thread process running in the background and doing select.select()
in the main loop and then handling I/O. Sometimes (approx. 1 or 2 times in a month) it stops responding. When it happens, I'd like to debug the problem.
I have tried the pyrasite
, but was not succesfull, because stdin/stdout of the daemon is redirected to the /dev/null
device and pyrasite uses this stdin/stdout, not the console where it was started from.
So I have added a SIGUSR1
signal handler which logs the stack trace. Works fine normally.
Today I got a freeze. ps
shows, the daemon is in "S" (interruptible sleep) state. A busy loop is ruled out.
The server does not respond to SIGUSR
or SIGINT
(used for shutdown).
I'd like to have at least some hint what is going on there.
Under what conditions is a sleeping Python3 Linux process not handling interrupts that it is supposed to handle?
UPDATE:
I could reproduce the issue finally. After adding a lot of debug messages, I have found a race condition that I'm going to fix soon.
When the daemon is not responding, it is sleeping at os.read(p)
where p
is a read-end of a new pipe (see: os.pipe
) where nobody writes to.
However, all my attempts to write a simple demonstration program failed. When I tried to read from an empty pipe, the program blocked as expected, but could be interrupted (killed from other terminal with SIGINT) as usual. The mystery remains unsolved.
UPDATE2:
Finally some code! I have deliberately chosen low level system calls.
import os
import time
import signal
import sys
def sighandler(*unused):
print("got signal", file=sys.stderr)
print("==========")
signal.signal(signal.SIGUSR1, sighandler)
pid = os.getpid()
rfd, wfd = os.pipe()
if os.fork():
os.close(wfd)
print("parent: read() start")
os.read(rfd, 4096)
print("parent: read() stop")
else:
os.close(rfd)
os.kill(pid, signal.SIGUSR1)
print("child: wait start")
time.sleep(3)
print("child: wait end")
If you run this many time, you'll get this:
parent: read() start
got signal
child: wait start
child: wait end
parent: read() stop
which is fine, but sometimes you'll see this:
parent: read() start
child: wait start
child: wait end
got signal
parent: read() stop
What is happening here:
- parent starts a read from the pipe
- child sends a signal to the parent. The parent must have received this signal, but it seems to be "somehow postponed"
- child waits
- child exits, pipe is closed automatically
- parent's read operation ends with an EOF
- the signal is handled now
Now, due to a bug in my program, the signal was received in step 2, but the EOF was not delivered, so the read did not finish and step 6 (signal handling) was never reached.
That is all information I am able to provide.