0

Here is a multiprocessing workflow with many consumers that execute tasks from a manager.queue. I use one extension Consumer of the multiprocess.Process class that runs Task which includes most of the runtime, and Aggor which runs all the time and whose job is to aggregate the output from each of the Task calls. The problem is that the managed Queue (which I am using because multiprocessing.Queue() seemed to lose results frequently) seems to hang, showing that the result queue is empty until the first Consumer exits. and then suddenly all of those results get dumped onto the queue. This is not preferred behavior because x and agg are very large, so having many of these stored in parallel is intractable. Preferably Aggor will aggregate them into the final output as quickly as they are added to the queue.

I would like to better understand why the queue appears to be preventing "get" until each Consumer process ends, and how to work around this to achieve the desired smaller memory profile.

Class Consumer(mp.Process):
    def __init__(self, task_queue, result_queue, x, y):
        mp.Process.__init__(self)
        self.task_queue = task_queue
        self.result_queue = result_queue
        self.x = x
        self.y = y
    def run(self):
        proc_name = self.name
        while True:
            next_task = self.task_queue.get()
            if next_task is None:
                self.task_queue.task_done()
                break
            (answer, ind) = next_task(self.x, self.y)
            self.task_queue.task_done()
            self.result_queue.put(answer)
        return

and

Class Aggor(mp.Process):
    def __init__(self, result_queue, final_queue, agg):
        mp.Process.__init__(self)
        self.result_queue = result_queue
        self.final_queue = final_queue
        self.agg = agg

    def run(self):
        proc_name = self.name
        while True:
            if not self.result_queue.empty():
                answer = self.result_queue.get()
                if answer is None:
                    break
                else:
                    self.agg = welford(self.agg, answer)
             
            else:
                time.sleep(1)
                continue
        self.final_queue.put(self.agg)
        return

and a task

class Task (object):
    def __init__(self, rf, ind):
        self.rf = rf
        self.ind = ind
    def __call__(self, x, y):
        self.rf.fit(x, y)
        m = Importance(self.rf, x) # the very time-consuming not multithreaded step. 
        return (m, self.ind)
    def __str__(self):
        return(F"job {self.ind}")

manager = mp.Manager()
tasks = mp.JoinableQue()
results = manager.Queue()
final = manager.Queue()

agg = (np.zeros(x.shape), np.zeros(x.shape))
ag = Aggor(results, final, ()
ag.start()

consumers = [ Consumer(tasks, results, x, y) for i in range(num_consumers)]

for w in consumers:
    w.start()

for ii in range(n_times):
    tasks.put(Task(rf, ii))

for i in range(num_consumers):
    tasks.put(None)

tasks.join()

# this code hangs here for a long time and Aggor does not run until the first task exits. 
results.put(None)
ag.join()

final_result = final.get()

I was expecting that the result queue would receive output as quickly as the consumer processes produced them and the Aggor process would be able to run for the most part at the same time as the consumers. Instead the Consumer processes run all the way to completion and exit before the Aggor process is able to get any results from the result queue.

Other variations I have tried include: multiprocessing.Queue instead of the manager.Queue and doing this with standalone functions called with mp.Process instead of Process objects. Pool and Map would be ideal, except that they want to keep a monolithic stack of my giant output matrices and there is no Reduce operation into which I could call my aggregator as a lambda. I am open to small tweaks and also to someone telling me these are simply the wrong tools for the job and to commit to a different one that is more functional.

  • As an aside, in method `Consumer.run` you should reverse the order statements `self.task_queue.task_done()` and `self.result_queue.put(answer)` otherwise you have a race condition in your main process with the statements `tasks.join(); results.put(None)` (the `None` can be put on the queue *before* the final item put by `Consumer.run`). But you don't need a joinable queue at all. Instead, just do joins on all the consumer processes. Also, in `Aggor.run` why don't you simply do a blocking `get` call instead of testing the queue for empy and sleeping? – Booboo Feb 24 '23 at 15:28

0 Answers0