3

I wrote a program which parses ~2600 text documents into Python objects. Those objects have a lot of references to other objects and as a whole they describe the structure of a document.

Serializing those objects with pickle is no problem and super fast.

After parsing those documents into Python objects I have to do some heavy computation on them, which I would like to make parallel. My current solution passes one of those document objects to a worker function which then is doing those heavy computations.

The results of those computations get written into objects which are attributes in the document object. The worker function then returns those changed objects (only those attribute objects and not the original document object).

All of this works with the following simplified code:

def worker(document_object):
    # Doing calculations on information of document_object and altering objects which are attributes of document_object
    return document_object.attribute_objects

def get_results(attribute_objects):
    # Save results into memory of main process

# Parsing documents
document_objects = parse_documents_into_python_objects()

# Dividing objects into smaller pieces (8 and smaller worked for me)
for chunk in chunker(document_objects, 8):
    pool = multiprocessing.Pool()
    pool.map_async(worker, chunk, callback=get_results)
    pool.close()
    pool.join()

However there are several problems:

  • It only works when I pass small chunks of all document_objects to map_async(). Otherwise I get memory errors even with 15GB of RAM.
  • htop tells me that only 2-3 of all 8 cores are being used.
  • I have the feeling it is not that much faster then the single process version (I could be wrong about this).

I understand that every document_object has to be pickled and copied into a worker process and that map_async() is keeping all this data in memory until pool.join() is happening.

I don't understand though why this is taking so much memory (up to ~12GB). When I pickle a single document_object into a file, the file turns out to be around 500KB maximum.

  • Why is this using so much memory?
  • Why only 2-3 cores are being used?
  • Is there a better way of doing this? For example is there a way where I can save the results to the main process directly after a single worker function finishes, so I don't have to wait for join() until all the memory becomes available again and the results get available through the callback function?

Edit: I'm using Python 2.7.6 on Ubuntu 14.04 and Debian Wheezy.

Edit: When I am printing out the start and end of the worker function, like dano suggested in the comments, I get something like the following which doesn't look parallel at all. Also there are ~5 seconds between each end and start.

start <Process(PoolWorker-161, started daemon)>
end <Process(PoolWorker-161, started daemon)>
(~5 seconds delay)
start <Process(PoolWorker-162, started daemon)>
end <Process(PoolWorker-162, started daemon)>
(~5 seconds delay)
start <Process(PoolWorker-163, started daemon)>
end <Process(PoolWorker-163, started daemon)>
(~5 seconds delay)
start <Process(PoolWorker-164, started daemon)>
end <Process(PoolWorker-164, started daemon)>
(~5 seconds delay)
start <Process(PoolWorker-165, started daemon)>
end <Process(PoolWorker-165, started daemon)>

Solution

First of all the problem could not be found in the simplified version of my code that I posted here.

The problem was that I wanted to use an instance method as my worker function. This does not work out-of-the-box in Python since instance methods can't be pickled. However there is a workaround by Steven Bethard which can solve this (and which I used).

The problem with this workaround however is, that it needs to pickle an instance of the class which contains the worker method. In my case that instance has attributes which are references to huge data structures. So everytime the instance method got pickled, all those huge data structures were getting copied around which resulted in my problems described above.

tymm
  • 543
  • 1
  • 6
  • 18
  • 1
    Try putting a `print("start {}".format(multiprocessing.current_process()))` at the beginning of `worker`, and `print("end {}".format(multiprocessing.current_process()))` at the end. Do you notice long delays between a worker finishing one task and starting the other? Or a long delay between the first and second workers beginning their first task? Also, what platform and version of Python are you using? – dano Oct 05 '14 at 01:44
  • Thank you for your answer. I added the platform information to my original post. There are indeed quite long delays of 5 seconds between one worker finishing and a new one starting. – tymm Oct 05 '14 at 10:06

0 Answers0