0

Here is some pseudocode for what I'm doing

import multiprocessing as mp
from multiprocessing import Manager

from tqdm import tqdm

def loop(arg):
  # do stuff
  # ...
  results.append(result_of_stuff)

if __name__ == '__main__':
  manager = Manager()
  results = manager.list()

  with mp.get_context('spawn').Pool(4) as pool:
    list(tqdm(pool.imap(loop, ls), total=len(ls)))

  # do stuff with `results`
  # ...

So the issue here is that loop doesn't know about results. I have one working way to do this and it's by using "fork" instead of "spawn". But I need to use "spawn" for reasons beyond the scope of my question..

So what is the minimal change I need to make for this to work? And I really want to keep tqdm hence the use of imap

PS: I'm on Linux

Alexander Soare
  • 2,825
  • 3
  • 25
  • 53
  • Note: I think my question might be a duplicate of [this](https://stackoverflow.com/questions/11937895/python-multiprocessing-manager-initiates-process-spawn-loop). Waiting for confirmation from someone else. I guess I'd end up asking how using `apply_async` varies from `map`. – Alexander Soare Mar 22 '21 at 16:22
  • On the above note, I tried it and I'm getting an empty results list, which was the issue I was getting with the snippet in my main post – Alexander Soare Mar 22 '21 at 16:27

2 Answers2

1

You can use functools.partial to add the extra parameters:

import multiprocessing as mp
import os

from functools import partial
from multiprocessing import Manager

from tqdm import tqdm


def loop(results, arg):
    results.append(len(arg))


def main():
    ctx = mp.get_context("spawn")
    manager = Manager()
    l = manager.list()
    partial_loop = partial(loop, l)

    ls = os.listdir("/tmp")

    with ctx.Pool() as pool:
        results = list(tqdm(pool.imap(partial_loop, ls), total=len(ls)))

    print(f"Sum: {sum(l)}")


if __name__ == "__main__":
    main()

There is some overhead with this approach as it will spawn a child process to host the Manager server.

Since you will process the results in the main process anyway I would do something like this instead (but this depends on your circumstances of course):

import multiprocessing as mp
import os

from tqdm import tqdm


def loop(arg):
    return len(arg)


def main():
    ctx = mp.get_context("spawn")

    ls = os.listdir("/tmp")

    with ctx.Pool() as pool:
        results = list(tqdm(pool.imap(loop, ls), total=len(ls)))

    print(f"Sum: {sum(results)}")


if __name__ == "__main__":
    main()
HTF
  • 6,632
  • 6
  • 30
  • 49
1

I know you have already accepted an answer, but let me add my "two cents":

The other way of solving your issue is by initializing each process in your pool with the global variable results as originally intended. The problem was that when using spawn newly created processes do not inherit the address space of the main process (which included the definition of results). Instead execution starts from the top of the program. But the code that creates results never gets executed because of the if __name__ == '__main__' check. But that is a good thing because you do not want a separate instance of this list anyway.

So how do we share the same instance of global variable results across all processes? This is accomplished by using a pool initializer as follows. Also, if you want an accurate progress bar, you should really use imap_unordered instead of imap so that the progress bar is updated in task-completion order rather than in the order in which tasks were submitted. For example, if the first task submitted happens to be the last task to complete, then using imap would result in the progress bar not progressing until all the tasks completed and then it would shoot to 100% all at once.

But Note: The doumentation for imap_unordered only states that the results will be returned in arbitrary order, not completion order. It does however seem that when a chunksize argument of 1 is used (the default if not explicitly specified), the results are returned in completion order. If you do not want to rely on this, then use instead apply_async specifying a callback function that will update the progrss bar. See the last code example.

import multiprocessing as mp
from multiprocessing import Manager

from tqdm import tqdm

def init_pool(the_results):
  global results
  results = the_results


def loop(arg):
  import time
  # do stuff
  # ...
  time.sleep(1)
  results.append(arg ** 2)

if __name__ == '__main__':
  manager = Manager()
  results = manager.list()

  ls = list(range(1, 10))
  with mp.get_context('spawn').Pool(4, initializer=init_pool, initargs=(results,)) as pool:
    list(tqdm(pool.imap_unordered(loop, ls), total=len(ls)))
  print(results)

Update: Another (Better) Way

import multiprocessing as mp

from tqdm import tqdm

def loop(arg):
  import time
  # do stuff
  # ...
  time.sleep(1)
  return arg ** 2

if __name__ == '__main__':

  results = []
  ls = list(range(1, 10))
  with mp.get_context('spawn').Pool(4) as pool:
    with tqdm(total=len(ls)) as pbar:
      for v in pool.imap_unordered(loop, ls):
        results.append(v)
        pbar.update(1)
  print(results)

Update: The Safest Way

import multiprocessing as mp

from tqdm import tqdm

def loop(arg):
  import time
  # do stuff
  # ...
  time.sleep(1)
  return arg ** 2

def my_callback(v):
    results.append(v)
    pbar.update(1)

if __name__ == '__main__':

  results = []
  ls = list(range(1, 10))
  with mp.get_context('spawn').Pool(4) as pool:
    with tqdm(total=len(ls)) as pbar:
      for arg in ls:
        pool.apply_async(loop, args=(arg,), callback=(my_callback))
      pool.close()
      pool.join()
  print(results)
Booboo
  • 38,656
  • 3
  • 37
  • 60
  • Ah great thank you. I wish I could accept two answers. On another note, is my obsession with tqdm unjustified? I ask because I wonder why it's not standard, and if not, why I don't see it more often in other people's code. If I'm multiprocessing I definitely want to see the progress bar to make sure the last tweak I did had some impact. Even if it's not perfect, it's still probably good enough to see a 50% improvement/degradation. I actually will refuse to do anything without it lol – Alexander Soare Mar 23 '21 at 14:17
  • I was about to update the answer to recommend that what you really should be doing is to have `loop` just return its results back to the main process, which can then append the results to an ordinary list and then update the progress bar. – Booboo Mar 23 '21 at 14:27
  • I've updated the answer with the "better" approach. By the way, if you believe this answer is the "better" answer, you *can* un-accept one answer and accept another. I am not saying you *should*; I am saying it is *possible*; I have done it. – Booboo Mar 23 '21 at 14:33
  • I feel like that better version is the equivalent of `results = list(tqdm(pool.imap_unordered(loop, ls), total=len(ls))` no (referencing the snippet in the accepted answer)? Unless something else is happening under the hood? – Alexander Soare Mar 23 '21 at 17:49