Python: multiprocessing, pathos and what not

Question

I have to apologise in advance 'cause this question is quite general and may be not clear enough. The question is: how would you run in parallel a Python function that itself uses a pool of processes for some subtasks and does lots of heavy I/O operations? Is it even a valid task?

I will try to provide some more information. I've got a procedure, say test_reduce(), that I need to run in parallel. I tried several ways to do that (see below), and I seem to lack some knowledge to understand why all of them fail.

This test_reduce() procedure does lots of things. Some of those are more relevant to the question than others (and I list them below):

It uses the multiprocessing module (sic!), namely a pool.Pool instance,
It uses a MongoDB connection,
It relies heavily on numpy and scikit-learn libs,
It uses callbacks and lambdas,
It uses the dill lib to pickle some stuff.

First I tried to use a multiprocessing.dummy.Pool (which seems to be a thread pool). I don't know what is specific about this pool and why it is, eh, "dummy"; the whole thing worked, and I got my results. The problem is CPU load. For parallelized sections of test_reduce() it was 100% for all cores; for synchronous sections it was around 40-50% most of the time. I can't say there was any increase in overall speed for this type of "parallel" execution.

Then I tried to use a multiprocessing.pool.Pool instance to map this procedure to my data. It failed with the following:

File "/usr/lib/python2.7/multiprocessing/pool.py", line 251, in map
    return self.map_async(func, iterable, chunksize).get()
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 558, in get
    raise self._value
cPickle.PicklingError: Can't pickle <type 'thread.lock'>: attribute lookup thread.lock failed

I made a guess that cPickle is to blame, and found the pathos lib that uses a far more advanced pickler dill. However it also fails:

File "/local/lib/python2.7/site-packages/dill/dill.py", line 199, in load
    obj = pik.load()
  File "/usr/lib/python2.7/pickle.py", line 858, in load
    dispatch[key](self)
  File "/usr/lib/python2.7/pickle.py", line 1083, in load_newobj
    obj = cls.__new__(cls, *args)
TypeError: object.__new__(generator) is not safe, use generator.__new__()

Now, this error is something I don't understand at all. I've got no output to stdout from my procedure when it works in a pool, so it's hard to guess what's going on. The only thing I know is that test_reduce() runs successfully when no multiprocessing is used.

So, how would you run in parallel something that heavy and complicated?

Do you have to run it in parallel because you don't want to freeze a GUI ? I was in this situation, and to run something heavy I uses QT `QProcess`, which is similar to the [subprocess](https://docs.python.org/2/library/subprocess.html) library. It's usually less complicated than using thread. — Mel, Jul 07 '15 at 15:25
I would guess that multiple threads can't access the same file with pickle (or other file access methods) at the same time. As possible solution, you could use a different name for the output file you pickle to on each thread (with filename derived from the current thread number). At the end, you can run a script to read and combine all the separate pickle files. — Ed Smith, Jul 07 '15 at 15:26
@EdSmith I'm afraid, this fails long before I do any pickling. I would say it's `multiprocessing` (it uses pickling extensively). — oopcode, Jul 07 '15 at 15:34
@tmoreau No, unfortunately not. I'm training some complex classification models. — oopcode, Jul 07 '15 at 15:34
Using all your cores at ~50% when there is a lot of syncronisation sounds pretty good to me. — mdurant, Jul 07 '15 at 15:44
Your setup can be as complicated as it needs to be and still work fine as long as the data you pass to and from separate processes is pickle-able. This is how multiprocessing internally sends data (args/results) back and forth. If you are on linux and have a lot of read-only data, you can take advantage of the fact that it uses fork and creates a copy of global data in each child process (but changes are not shared). — bj0, Jul 07 '15 at 20:05
How can you what? It's hard to be specific without code that can be run. — bj0, Jul 07 '15 at 23:03
I'm the `dill` and `pathos` author. Yes, it's possible to nest one map within another. There are some examples on SO (search for "hierarchical parallel).. but `map(f1, map(f2, x, y))` absolutely should work unless you run into a serialization issue. It looks like you are trying to pickle a generator `(i for i in x)`, which `dill` can't handle. — Mike McKerns, Jul 08 '15 at 07:06
@MikeMcKerns Thank you for your answer. So it can be the case that a Mongo cursor, which is a generator, causes the problem, right? — oopcode, Jul 08 '15 at 12:36

score 1 · Accepted Answer · answered Jul 08 '15 at 13:09

So, thanks to @MikeMcKerns' answer, I found how to get the job done with the pathos lib. I needed to get rid of all pymongo cursors, which (being generators) could not be pickled by dill; doing that solved the problem and I managed to run my code in parallel.

Python: multiprocessing, pathos and what not

1 Answers1