How to set chunk size when using pathos ProcessingPool's map?

Question

I'm running into inefficient parallelisation with Pathos' ProcessingPool.map() function: Towards the end of the processing, a single slow running worker processes the last tasks in the list sequentially while other workers are idle. I think this is due to "chunking" of the task list.

When using Python's own multiprocessing.Pool I can resolve this by forcing chunksize=1 when calling map. However, this argument is not supported by Pathos, and the source code suggests this may be an oversight or a todo on the developers' side:

return _pool.map(star(f), zip(*args)) # chunksize

(from Pathos' multiprocessing.py, line 137)

I'd like to keep Pathos because of it's ability to work with lamdbas.

Is there any way to get chunk size running in Pathos? Is there a workaround using one of Patho's other poorly documented pool implementations?

score 7 · Accepted Answer · answered Apr 10 '19 at 14:40

I'm the pathos developer. It's not an oversight... you can't use chunksize when using pathos.pools.ProcessingPool. The reason this was done, was that I wanted to have the map functions have the same interface as python's map... and to do that, based on the multiprocessing implementation, I either had to choose to make chunksize a keyword, or to allow *args and **kwds. So I choose the latter.

If you want to use chunksize, there is _ProcessPool, which retains the original multiprocessing.Pool interface, but has augmented serialization.

>>> import pathos
>>> p = pathos.pools._ProcessPool() 
>>> p.map(lambda x:x*x, range(4), chunksize=10)
[0, 1, 4, 9]
>>>

I'm sorry you feel the documentation is lacking. The code is primarily composed of a fork of multiprocessing from the python standard library... and I didn't change the documentation where the functionality has been reproduced. For example, here I am recycling the STL docs, as the functionality is the same:

>>> p = pathos.pools._ProcessPool()
>>> print(p.map.__doc__)

        Equivalent of `map()` builtin

>>> p = multiprocessing.Pool()
>>> print(p.map.__doc__)

        Equivalent of `map()` builtin
>>>

... and in the cases where I have modified functionality, I did write new docs:

>>> p = pathos.pools.ProcessPool()
>>> print(p.map.__doc__)
run a batch of jobs with a blocking and ordered map

Returns a list of results of applying the function f to the items of
the argument sequence(s). If more than one sequence is given, the
function is called with an argument list consisting of the corresponding
item of each sequence.

>>>

Admittedly, the docs could be better. Especially the docs coming from the STL could be improved upon. Please feel free to add a ticket on GitHub, or even better, a PR to extend the docs.

Thanks for this detailed and incredibly quick answer, it works! I think a hint on `chunksize` would greatly improve the docs. There's no reference to it on pathos.readthedocs.io (at least not one retuned by the search). Or even a `map_with_chunksize` method in `Pool`, or a `chunk_size` param in the constructor? I understand you design choice here, but for my use case the parameter turns out to be quite essential. — DCS, Apr 10 '19 at 18:19
Point taken about the docs. In reality, my decision about the documentation was to flatly reuse what was in the STL unless I wrote new functionality (which honestly wasn't too much). I wrote those docs probably a decade ago, but they could use revisiting. I'll add a GitHub ticket to improve the docs. — Mike McKerns, Apr 11 '19 at 12:59

How to set chunk size when using pathos ProcessingPool's map?

1 Answers1