1

I have a Python program that starts N subprocesses (clients) which send requests to and listen for responses from the main process (server). The interprocess communication uses pipes through multiprocessing.Queue objects according to the following scheme (one queue per consumer, so one request queue and N response queues):

                1 req_queue
                              <-- Process-1
MainProcess <-- ============= <-- …
                              <-- Process-N

                N resp_queues
            --> ============= --> Process-1
MainProcess --> ============= --> …
            --> ============= --> Process-N

The (simplified) program:

import multiprocessing


def work(event, req_queue, resp_queue):
    while not event.is_set():
        name = multiprocessing.current_process().name
        x = 3
        req_queue.put((name, x))
        print(name, 'input:', x)
        y = resp_queue.get()
        print(name, 'output:', y)


if __name__ == '__main__':
    event = multiprocessing.Event()
    req_queue = multiprocessing.Queue()
    resp_queues = {}
    processes = {}
    N = 10
    for _ in range(N):  # start N subprocesses
        resp_queue = multiprocessing.Queue()
        process = multiprocessing.Process(
            target=work, args=(event, req_queue, resp_queue))
        resp_queues[process.name] = resp_queue
        processes[process.name] = process
        process.start()
    for _ in range(100):  # handle 100 requests
        (name, x) = req_queue.get()
        y = x ** 2
        resp_queues[name].put(y)
    event.set()  # stop the subprocesses
    for process in processes.values():
        process.join()

The problem that I am facing is that the execution of this program (under Python 3.11.2) sometimes never stops, hanging at the line y = resp_queue.get() in some subprocess once the main process notify subprocesses to stop at the line event.set(). The problem is the same if I use the threading library instead of the multiprocessing library.

How to stop the subprocesses?

Géry Ogam
  • 6,336
  • 4
  • 38
  • 67

2 Answers2

1

queue.get() is a blocking function, a thread (process) reaching it will wait until an item is put on the queue, it won't be woken up by setting the event if it has already reached the get() line.

The way that's usually done (even in the standard modules) is to send a None (or another meaningless object) on the queue to wake the processes waiting on the queue and have them terminate when there is no more work.

event.set()
for queue_obj in resp_queues:
    queue_obj.put(None)

This makes your event only useful for early termination, but if early termination is not needed you can just omit the event from the workers altogether.

def work(event, req_queue, resp_queue):
    while True:
        ...
        y = resp_queue.get()
        print(name, 'output:', y)
        if y is None:
            break

Obviously just using queue.get() can lead to resources leak if the main process fails, so another solution that you should do is to use a timeout on the queue and not leave it waiting forever.

y = resp_queue.get(timeout=0.1)

This makes sure the processes will terminate eventually on "unexpected failures", but sending the None is what's used for instantaneous termination.

If you have multiple resp_queue.get() embedded throughout your code that a simple break on None won't work then you can use sys.exit() when you receive the None to terminate the worker, this will do the necessary cleanup and can only be caught be a bare except:, the code that intercepts the None and calls sys.exit can be hidden in a subclass of multiprocessing.queues.Queue.

class MyQueue(multiprocessing.queues.Queue):

    def get(self, block: bool=False, timeout: Optional[float]=None) -> object:
        # set a default alternative timeout if you want
        return_value = super().get(block=block, timeout=timeout)  
        if return_value is None:  # or another dummy class as a signal
            sys.exit()  # or raise a known exception
        return return_value
Géry Ogam
  • 6,336
  • 4
  • 38
  • 67
Ahmed AEK
  • 8,584
  • 2
  • 7
  • 23
  • Thank you! What I forgot to mention is that in my code I don’t control the worker code because it can be redefined by the user in a subclass, and typically the user will use multiple request–response pairs in the `work` function. So sending a single `None` sentinel per worker to unblock the `y = resp_queue.get()` call may not be enough as their can be several of such calls. Do you have a solution for an open-ended number of `y = resp_queue.get()` calls? – Géry Ogam Mar 05 '23 at 18:59
  • 1
    @Maggyero you can simply call `sys.exit` when you get the `None` signal, which does the necessary resources cleanup but can only be caught by a bare `except:`, and to hide it from the user you can hide it inside a subclass of `multiprocessing.queue` that you will be using. – Ahmed AEK Mar 05 '23 at 19:20
  • Very interesting, could you add this to your answer? And +1 for the subtle distinction between early termination and late termination in your answer. – Géry Ogam Mar 05 '23 at 19:40
  • I found an alternative solution to yours [here](https://stackoverflow.com/a/75644825/2326961), what are your thoughts on this? I’ll accept your answer anyway. – Géry Ogam Mar 05 '23 at 19:42
  • @Maggyero seems like a hacky way to make sure each `get` is matched by a `None`, but what if the number of requests is varying each run by a condition or a loop ? you could still run into the deadlock. – Ahmed AEK Mar 05 '23 at 19:55
  • Do you mean what if the number of requests `req_queue.put((name, x))` does not match the number of responses `y = resp_queue.get()` in the `work` function? – Géry Ogam Mar 05 '23 at 20:22
  • @Maggyero exactly – Ahmed AEK Mar 05 '23 at 20:24
  • 1
    Yes that would dead lock. A matching number of requests and responses in the `work` function is indeed an assumption of my alternative solution. – Géry Ogam Mar 05 '23 at 20:26
1

Here is an alternative solution to @AhmedAEK’s which works for an open-ended number of request–response pairs in the work function, i.e. an open-ended number of req_queue.put((name, x)) and y = resp_queue.get() call pairs (in my real program I don’t control the worker code because it can be redefined by the user in a subclass):

...

if __name__ == '__main__':
    ...
    event.set()
    try:
        while True:
            (name, x) = req_queue.get(timeout=1)
            resp_queues[name].put(None)
    except queue.Empty:
        pass
    ...
Géry Ogam
  • 6,336
  • 4
  • 38
  • 67