I launched a pool of worker processes and submitted a bunch of tasks. The system ran low on memory and the oomkiller killed one of the worker processes. The parent process just hung there waiting for the tasks to finish and never returned.
Here's a runnable example that reproduces the problem. Instead of waiting for oomkiller to kill one of the worker processes, I find the process ids of all the worker processes and tell the first task to kill that process. (The call to ps
won't work on all operating systems.)
import os
import signal
from multiprocessing import Pool
from random import choice
from subprocess import run, PIPE
from time import sleep
def run_task(task):
target_process_id, n = task
print(f'Processing item {n} in process {os.getpid()}.')
delay = n + 1
sleep(delay)
if n == 0:
print(f'Item {n} killing process {target_process_id}.')
os.kill(target_process_id, signal.SIGKILL)
else:
print(f'Item {n} finished.')
return n, delay
def main():
print('Starting.')
pool = Pool()
ps_output = run(['ps', '-opid', '--no-headers', '--ppid', str(os.getpid())],
stdout=PIPE, encoding='utf8')
child_process_ids = [int(line) for line in ps_output.stdout.splitlines()]
target_process_id = choice(child_process_ids[1:-1])
tasks = ((target_process_id, i) for i in range(10))
for n, delay in pool.imap_unordered(run_task, tasks):
print(f'Received {delay} from item {n}.')
print('Closing.')
pool.close()
pool.join()
print('Done.')
if __name__ == '__main__':
main()
When I run that on a system with eight CPU's, I see this output:
Starting.
Processing item 0 in process 303.
Processing item 1 in process 304.
Processing item 2 in process 305.
Processing item 3 in process 306.
Processing item 4 in process 307.
Processing item 5 in process 308.
Processing item 6 in process 309.
Processing item 7 in process 310.
Item 0 killing process 308.
Processing item 8 in process 303.
Received 1 from item 0.
Processing item 9 in process 315.
Item 1 finished.
Received 2 from item 1.
Item 2 finished.
Received 3 from item 2.
Item 3 finished.
Received 4 from item 3.
Item 4 finished.
Received 5 from item 4.
Item 6 finished.
Received 7 from item 6.
Item 7 finished.
Received 8 from item 7.
Item 8 finished.
Received 9 from item 8.
Item 9 finished.
Received 10 from item 9.
You can see that item 5 never returns, and the pool just waits forever.
How can I get the parent process to notice when a child process is killed?