4

I have to apologise in advance 'cause this question is quite general and may be not clear enough. The question is: how would you run in parallel a Python function that itself uses a pool of processes for some subtasks and does lots of heavy I/O operations? Is it even a valid task?

I will try to provide some more information. I've got a procedure, say test_reduce(), that I need to run in parallel. I tried several ways to do that (see below), and I seem to lack some knowledge to understand why all of them fail.

This test_reduce() procedure does lots of things. Some of those are more relevant to the question than others (and I list them below):

  • It uses the multiprocessing module (sic!), namely a pool.Pool instance,
  • It uses a MongoDB connection,
  • It relies heavily on numpy and scikit-learn libs,
  • It uses callbacks and lambdas,
  • It uses the dill lib to pickle some stuff.

First I tried to use a multiprocessing.dummy.Pool (which seems to be a thread pool). I don't know what is specific about this pool and why it is, eh, "dummy"; the whole thing worked, and I got my results. The problem is CPU load. For parallelized sections of test_reduce() it was 100% for all cores; for synchronous sections it was around 40-50% most of the time. I can't say there was any increase in overall speed for this type of "parallel" execution.

Then I tried to use a multiprocessing.pool.Pool instance to map this procedure to my data. It failed with the following:

File "/usr/lib/python2.7/multiprocessing/pool.py", line 251, in map
    return self.map_async(func, iterable, chunksize).get()
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 558, in get
    raise self._value
cPickle.PicklingError: Can't pickle <type 'thread.lock'>: attribute lookup thread.lock failed

I made a guess that cPickle is to blame, and found the pathos lib that uses a far more advanced pickler dill. However it also fails:

File "/local/lib/python2.7/site-packages/dill/dill.py", line 199, in load
    obj = pik.load()
  File "/usr/lib/python2.7/pickle.py", line 858, in load
    dispatch[key](self)
  File "/usr/lib/python2.7/pickle.py", line 1083, in load_newobj
    obj = cls.__new__(cls, *args)
TypeError: object.__new__(generator) is not safe, use generator.__new__()

Now, this error is something I don't understand at all. I've got no output to stdout from my procedure when it works in a pool, so it's hard to guess what's going on. The only thing I know is that test_reduce() runs successfully when no multiprocessing is used.

So, how would you run in parallel something that heavy and complicated?

oopcode
  • 1,912
  • 16
  • 26
  • Do you have to run it in parallel because you don't want to freeze a GUI ? I was in this situation, and to run something heavy I uses QT `QProcess`, which is similar to the [subprocess](https://docs.python.org/2/library/subprocess.html) library. It's usually less complicated than using thread. – Mel Jul 07 '15 at 15:25
  • I would guess that multiple threads can't access the same file with pickle (or other file access methods) at the same time. As possible solution, you could use a different name for the output file you pickle to on each thread (with filename derived from the current thread number). At the end, you can run a script to read and combine all the separate pickle files. – Ed Smith Jul 07 '15 at 15:26
  • @EdSmith I'm afraid, this fails long before I do any pickling. I would say it's `multiprocessing` (it uses pickling extensively). – oopcode Jul 07 '15 at 15:34
  • @tmoreau No, unfortunately not. I'm training some complex classification models. – oopcode Jul 07 '15 at 15:34
  • Using all your cores at ~50% when there is a lot of syncronisation sounds pretty good to me. – mdurant Jul 07 '15 at 15:44
  • Your setup can be as complicated as it needs to be and still work fine as long as the data you pass to and from separate processes is pickle-able. This is how multiprocessing internally sends data (args/results) back and forth. If you are on linux and have a lot of read-only data, you can take advantage of the fact that it uses fork and creates a copy of global data in each child process (but changes are not shared). – bj0 Jul 07 '15 at 20:05
  • How can you what? It's hard to be specific without code that can be run. – bj0 Jul 07 '15 at 23:03
  • I'm the `dill` and `pathos` author. Yes, it's possible to nest one map within another. There are some examples on SO (search for "hierarchical parallel).. but `map(f1, map(f2, x, y))` absolutely should work unless you run into a serialization issue. It looks like you are trying to pickle a generator `(i for i in x)`, which `dill` can't handle. – Mike McKerns Jul 08 '15 at 07:06
  • @MikeMcKerns Thank you for your answer. So it can be the case that a Mongo cursor, which is a generator, causes the problem, right? – oopcode Jul 08 '15 at 12:36

1 Answers1

1

So, thanks to @MikeMcKerns' answer, I found how to get the job done with the pathos lib. I needed to get rid of all pymongo cursors, which (being generators) could not be pickled by dill; doing that solved the problem and I managed to run my code in parallel.

oopcode
  • 1,912
  • 16
  • 26