8

I have a simple server:

from multiprocessing import Pool, TimeoutError
import time
import os


if __name__ == '__main__':
    # start worker processes
    pool = Pool(processes=1)

    while True:
        # evaluate "os.getpid()" asynchronously
        res = pool.apply_async(os.getpid, ())  # runs in *only* one process
        try:
            print(res.get(timeout=1))             # prints the PID of that process
        except TimeoutError:
            print('worker timed out')

        time.sleep(5)

    pool.close()
    print("Now the pool is closed and no longer available")
    pool.join()
    print("Done")

If I run this I get something like:

47292
47292

Then I kill 47292 while the server is running. A new worker process is started but the output of the server is:

47292
47292
worker timed out
worker timed out
worker timed out

The pool is still trying to send requests to the old worker process.

I've done some work with catching signals in both server and workers and I can get slightly better behaviour but the server still seems to be waiting for dead children on shutdown (ie. pool.join() never ends) after a worker is killed.

What is the proper way to handle workers dying?

Graceful shutdown of workers from a server process only seems to work if none of the workers has died.

(On Python 3.4.4 but happy to upgrade if that would help.)

UPDATE: Interestingly, this worker timeout problem does NOT happen if the pool is created with processes=2 and you kill one worker process, wait a few seconds and kill the other one. However, if you kill both worker processes in rapid succession then the "worker timed out" problem manifests itself again.

Perhaps related is that when the problem occurs, killing the server process will leave the worker processes running.

ivo
  • 1,103
  • 10
  • 13

1 Answers1

4

This behavior comes from the design of the multiprocessing.Pool. When you kill a worker, you might kill the one holding the call_queue.rlock. When this process is killed while holding the lock, no other process will ever be able to read in the call_queue anymore, breaking the Pool as it cannot communicate with its worker anymore.
So there is actually no way to kill a worker and be sure that your Pool will still be okay after, because you might end up in a deadlock.

multiprocessing.Pool does not handle the worker dying. You can try using concurrent.futures.ProcessPoolExecutor instead (with a slightly different API) which handles the failure of a process by default. When a process dies in ProcessPoolExecutor, the entire executor is shutdown and you get back a BrokenProcessPool error.

Note that there are other deadlocks in this implementation, that should be fixed in loky. (DISCLAIMER: I am a maintainer of this library). Also, loky let you resize an existing executor using a ReusablePoolExecutor and the method _resize. Let me know if you are interested, I can provide you some help starting with this package. (I realized we still need a bit of work on the documentation... 0_0)

Thomas Moreau
  • 4,377
  • 1
  • 20
  • 32
  • My use-case is a long-running server where the parent process reads jobs from an external queue and hands each job off a child process to execute (one child per job). Obviously I want to be able to handle child deaths. Maybe I should use the Process and roll my own solution but I wanted to avoid trying to solve the same problems already solved in some of the higher-level objects. If the ProcessPoolExecutor sounds like the best solution for my problem, I'd love to get more info on getting started with it. Thanks! – ivo Aug 03 '17 at 14:16
  • You want to handle the child process dying of what....? If you want to handle the death of a child process killed by external causes, this is very harduous because you might cause deadlocks. If you want to handle the termination of the workers, you can use a `Pool` with process that return after each tasks. The `Pool` will then spawn a new process for each new task submitted. Please precise you constraints if you want a more precise answer. From the state of your question, you want to be able to kill `Pool.worker` with external `kill`, which is not possible as stated in my answer. – Thomas Moreau Aug 03 '17 at 14:33
  • Just to wrap up: the broken `Pool` you describe in your question is due to a deadlock, as stated in the answer. If you want to handle the death of a worker, you need to precise the kind of circumstances: murder or accident? :) From these precision depend the design that you can use. The main issue with non python death of your worker is that you cannot assure that the synchronization primitive (`Lock`) won't end up in a irrecuperable state. – Thomas Moreau Aug 03 '17 at 15:30
  • Thanks for your in-depth answer. The death would be by accident -- some workers call into complex third-party C++ code which potentially could seg fault. It's possible in that case I could trap the signal and get the worker to shut down cleanly (as far as the Pool is concerned.) This is probably a rare occurrence so I could simply add a monitoring task to verify the pool is still working and not deadlocked. – ivo Aug 04 '17 at 15:12
  • If your task died while running, there should be no deadlock. Creating a thread to handle worker death is a possibility. If you do not want to code it yourself, `concurrent.futures` already have this feature. The drawback is that as it handles all worker death, it cannot recover the pool once it dies. `loky` is just a reimplementation of `concurrent.futures` a bit more robust. – Thomas Moreau Aug 05 '17 at 10:25
  • @ThomasMoreau If I use `ProcessPoolExecutor` from concurrent.future instead of loky what am I missing? I want to be able to get a nice exception when a worker is killed, no need to resize or reuse. Multiprocessing does not work, but installing `futures` I cna use concurrent.future also in Python 2. – dashesy Apr 26 '18 at 22:17
  • What are the type or example of deadlocks using `concurrent.futures.ProcessPoolExecutor`? – dashesy Apr 26 '18 at 22:19
  • 1
    Here are some examples of deadlock (with pickling errors) http://loky.readthedocs.io/en/stable/auto_examples/index.html . Also, note that `futures` is very different compared to `concurrent.futures`. For instance it does not detect the workers that died. You should use `loky.ProcessPoolExecutor` if you want a reliable backport of `concurrent.futures` to python2.7. – Thomas Moreau Apr 27 '18 at 08:34