I have a simple server:
from multiprocessing import Pool, TimeoutError
import time
import os
if __name__ == '__main__':
# start worker processes
pool = Pool(processes=1)
while True:
# evaluate "os.getpid()" asynchronously
res = pool.apply_async(os.getpid, ()) # runs in *only* one process
try:
print(res.get(timeout=1)) # prints the PID of that process
except TimeoutError:
print('worker timed out')
time.sleep(5)
pool.close()
print("Now the pool is closed and no longer available")
pool.join()
print("Done")
If I run this I get something like:
47292
47292
Then I kill 47292
while the server is running. A new worker process is started but the output of the server is:
47292
47292
worker timed out
worker timed out
worker timed out
The pool is still trying to send requests to the old worker process.
I've done some work with catching signals in both server and workers and I can get slightly better behaviour but the server still seems to be waiting for dead children on shutdown (ie. pool.join() never ends) after a worker is killed.
What is the proper way to handle workers dying?
Graceful shutdown of workers from a server process only seems to work if none of the workers has died.
(On Python 3.4.4 but happy to upgrade if that would help.)
UPDATE: Interestingly, this worker timeout problem does NOT happen if the pool is created with processes=2 and you kill one worker process, wait a few seconds and kill the other one. However, if you kill both worker processes in rapid succession then the "worker timed out" problem manifests itself again.
Perhaps related is that when the problem occurs, killing the server process will leave the worker processes running.