Python multiprocessing: max. number of Pool worker processes?

Question

I am making use of Python's multiprocessor library and wondering what would be the maximum of worker processes I can call?

E.g. I have defined async.pool = Pool(100) which would allow me to have max 100 async processes running at the same time, but I have no clue what would be the real maximum value for this?

Does anyone know how to find the max value for my Pool? I'm guessing it depends on CPU or memory.

It pretty much depends on your algorithm and your machine. The only way to know is to benchmark. — Benjamin Toueg, Feb 25 '14 at 14:30

Hooked · Answer 1 · 2014-02-25T15:07:36.213

This is not a complete answer, but the source can help guide us. When you pass maxtasksperchild to Pool it saves this value as self._maxtasksperchild and only uses it in the creation of a worker object:

def _repopulate_pool(self):
    """Bring the number of pool processes up to the specified number,
    for use after reaping workers which have exited.
    """
    for i in range(self._processes - len(self._pool)):
        w = self.Process(target=worker,
                         args=(self._inqueue, self._outqueue,
                               self._initializer,
                               self._initargs, self._maxtasksperchild)
                        )

        ...

This worker object uses maxtasksperchild like so:

assert maxtasks is None or (type(maxtasks) == int and maxtasks > 0)

which wouldn't change the physical limit, and

while maxtasks is None or (maxtasks and completed < maxtasks):
    try:
        task = get()
    except (EOFError, IOError):
        debug('worker got EOFError or IOError -- exiting')
        break
    ...
    put((job, i, result))
    completed += 1

essentially saving the results from each task. While you could run into memory issues by saving too many results, you can achieve the same error by making a list too large in the first place. In short, the source does not suggest a limit to the number of tasks possible as long as the results can fit in memory once released.

Does this answer the question? Not entirely. However, on Ubuntu 12.04 with Python 2.7.5 this code, while inadvisable seems to run just fine for any large max_task value. Be warned that the output seems to take exponentially longer to run for large values:

import multiprocessing, time
max_tasks = 10**3

def f(x): 
    print x**2
    time.sleep(5)
    return x**2

P = multiprocessing.Pool(max_tasks)
for x in xrange(max_tasks):
    P.apply_async(f,args=(x,))
P.close()
P.join()

I can think of plenty of limits imposed by the operating system (max number of processes/threads per user, max number of processes/threds overall, max memory per user, max number of open file descriptors per process, max total number of open file descriptors, etc...) and at least of one platform limit (max physical memory) — isedev, Feb 25 '14 at 14:58
@isedev I agree, which is why I added a quick check to see if I could crash my system (I can't, though I'd be interested in knowing if you can). I looked to the source to see if there was any hard-coded value in the code (which there doesn't look to be). — Hooked, Feb 25 '14 at 15:08
You write about `maxtasksperchild` while the question is about the `processes` argument to `Pool`. — Janne Karila, Feb 26 '14 at 07:37

Pebermynte Lars · Answer 2 · 2015-11-05T13:53:15.477

You can use as many workers as you have memory for. That being said, if you set up a pool without any process flag, you'll get workers equal to the machine CPUs:

From Pool docs:

processes is the number of worker processes to use. If processes is None then the number returned by os.cpu_count() is used.

If you're doing CPU intensive work, i wouldn't want more workers in the pool than your CPU count. More workers would force the OS to context switch out your processes, which in turn lowers the system performance. Even resorting to using hyperthreading cores can, depending on your work, choke the processor.

On the other hand, if your task is like a webserver with many concurrent requests that individually are not maxing out your processor, go ahead and spawn as many workers as you've got memory and/or IO capacity for.

maxtasksperchild is something different. This flag forces the pool to release all resources accumulated by a worker, once the worker has been used/reused a certain number of times.

If you imagine your workers read from a disk, and this work has some setup overhead, maxtasksperchild will clear that overhead once a worker has done this many tasks.

Python multiprocessing: max. number of Pool worker processes?

2 Answers2