Hard-kill hanging sub-processes in Python's multiprocessing

Question

I have a Python function that calls a wrapper to a C function (which I can't change). Most of the time the C function is very fast, but when it fails the call just hangs on forever. To palliate this, I time-out the call using multiprocessing:

pool = multiprocessing.Pool(processes=4)
try:
    res = pool.apply_async(my_dangerous_cpp_function, args=(bunch, of, vars))
    return res.get(timeout=1.)
except multiprocessing.TimeoutError:
    terminate_pool(pool)
    pool = multiprocessing.Pool(processes=4)

How can I terminate the pool when the function being called doesn't answer any signal?

If I replace terminate_pool(pool) by pool.terminate(), then the call to pool.terminate() hangs as well. Instead, I'm currently sending SIGKILL to all sub-processes:

def terminate_pool(pool):
    for p in pool._pool:
        os.kill(p.pid, 9)
    pool.close()  # ok, doesn't hang
    #pool.join()  # not ok, hangs forever

This way, hanging sub-processes stop eating 100% CPU, however I can't call pool.terminate() or pool.join() (they hang), so I just leave the pool object behind and create a new one. Even though they received a SIGKILL, sub-processes are still open, so my number of Python processes never stops increasing...

Is there a way to annihilate the pool and all its sub-processes once and for all?

score 2 · Accepted Answer · answered Feb 28 '16 at 19:56

2

The standard multiprocessing.Pool is not designed for dealing with workers timeouts.

Pebble processing Pool does support timing-out tasks.

from pebble import process, TimeoutError

with process.Pool() as pool:
    task = pool.schedule(function, args=[1,2], timeout=5)

    try:
        result = task.get()
    except TimeoutError:
        print "Task: %s took more than 5 seconds to complete" % task

answered Feb 28 '16 at 19:56

noxdafox

14,439
4
33
45

Thanks! I tried it out and it works faster when no timeout happens, but Pebble has a major problem in my case: my function calls are done within threads from [rospy](http://answers.ros.org/question/9543/rospy-threading-model/), but Pebble kills these threads when timeouts occur. The same happens when I use ``pebble.process.concurrent``: after one call, the rospy thread that called it is terminated, and thus stops doing the other useful stuff it should be doing... Is there a way to keep these threads alive? – Tastalian Mar 01 '16 at 05:49
I don't know rospy threading internals. Assuming it's using regular threads, then it's the expected behaviour. You can look at a process as a sort of "container for threads". If you terminate a process (in your example due to a hanging thread) the whole container and its threads will be destroyed. This is a quite powerful design as it allows you to isolate your service from the unstable code (and allows you to scale on multiple CPUs). – noxdafox Mar 02 '16 at 09:12
In such scenarios, you need to re-initialize your workers once destroyed. This is the case with DB connections, sockets etc. You can use the [initializer](http://pythonhosted.org/Pebble/#Pool) parameter for doing so. – noxdafox Mar 02 '16 at 09:14
Thank you very much for your comments :) I'm sorry for not being precise enough in my description, but here is what my layout looks like, and why it is problematic that Pebble kills the threads. The main process spins ROS threads and creates a global ``pool = process.Pool()`` object. Then, one ROS thread calls ``task = pool.schedule(function)`` and ``task.get()``. If the call to ``function`` hangs, the ROS thread that **called** ``task.get()`` (not the thread in the separate process that executes it) is killed. – Tastalian Mar 04 '16 at 04:39
Pebble kills only the spawned process (and all its threads). Not the callee. Therefore there's some issue within your code. Are you capturing the TimeoutError exception as in the example? Try wrapping the thread logic within a `try: except:` and see which exception it's stopping your thread. – noxdafox Mar 07 '16 at 19:50
OK, I've figured it out, actually it was more complicated than this. The fact is that pebble uses signals (e.g. at ``pebble.process.pool:291``), while rospy also sets up its own signals. It turns out the combination of the two sometimes causes a broken pipe in ROS client-server communication socket (at ``rospy.impl.tcpros_base:657`` in my case). This problem is solved by setting the ``disable_signals=True`` keyword argument in ``rospy.init_node()``. So, it seems somehow the signals used by pebble are not 100% isolated from other signals set by other tools (like rospy here). – Tastalian Mar 09 '16 at 06:19
Anyway, now I can use pebble in my project :) Thanks for helping me and being responsive here. – Tastalian Mar 09 '16 at 06:21
Signals are never "isolated" as they are delivered to a process and can't be delivered to a single thread. On top of that Python does a pretty poor job in handling signals in multithreading environments. My guess on this is that the ROS socket polling mechanism was awoken by the SIGCHLD signal delivered to the parent process when the child ones were killed. This probably sends the rospy in an unconsistent state. – noxdafox Mar 09 '16 at 08:57

Hard-kill hanging sub-processes in Python's multiprocessing

1 Answers1