7

I have a complex data structure (user-defined type) on which a large number of independent calculations are performed. The data structure is basically immutable. I say basically, because though the interface looks immutable, internally some lazy-evaluation is going on. Some of the lazily calculated attributes are stored in dictionaries (return values of costly functions by input parameter). I would like to use Pythons multiprocessing module to parallelize these calculations. There are two questions on my mind.

  1. How do I best share the data-structure between processes?
  2. Is there a way to handle the lazy-evaluation problem without using locks (multiple processes write the same value)?

Thanks in advance for any answers, comments or enlightening questions!

Björn Pollex
  • 75,346
  • 28
  • 201
  • 283
  • How large / complex are you talking? When an `independent calculation` is submitted, do you know before the start which lazy attributes are needed? – MattH Aug 10 '10 at 10:24
  • The problem is basically a leave-one-out cross-validation on a large set of data-samples. It takes about two hours on my machine on a single core, but I have access to a machine with 24 cores and would like to leverage that power. I do not know in advance which of the attributes will be needed by a single calculation, but I know that eventually (over all calculations) all attributes will be needed, so I could just load them all up front (would have to test that though). – Björn Pollex Aug 10 '10 at 10:30

1 Answers1

8

How do I best share the data-structure between processes?

Pipelines.

origin.py | process1.py | process2.py | process3.py

Break your program up so that each calculation is a separate process of the following form.

def transform1( piece ):
    Some transformation or calculation.

For testing, you can use it like this.

def t1( iterable ):
    for piece in iterable:
        more_data = transform1( piece )
        yield NewNamedTuple( piece, more_data )

For reproducing the whole calculation in a single process, you can do this.

for x in t1( t2( t3( the_whole_structure ) ) ):
    print( x )

You can wrap each transformation with a little bit of file I/O. Pickle works well for this, but other representations (like JSON or YAML) work well, too.

while True:
    a_piece = pickle.load(sys.stdin)
    more_data = transform1( a_piece )
    pickle.dump( NewNamedTuple( piece, more_data ) )

Each processing step becomes an independent OS-level process. They will run concurrently and will -- immediately -- consume all OS-level resources.

Is there a way to handle the lazy-evaluation problem without using locks (multiple processes write the same value)?

Pipelines.

S.Lott
  • 384,516
  • 81
  • 508
  • 779
  • Wow, that answers solves two problems that where not even in my question (how to send a complex object to another process, how to do this in python when the multiprocessing-module is not available)! – Björn Pollex Aug 10 '10 at 10:22
  • The point is that OS-level (shared buffer) process management is (a) simpler and (b) can be as fast as more complex multi-threaded, shared-everything techniques. – S.Lott Aug 10 '10 at 11:08
  • @S.Lott I want to share numpy random state of a parent process with a child process. I've tried using `Manager` but still no luck. Could you please take a look at my question [here](https://stackoverflow.com/questions/49372619/how-to-share-numpy-random-state-of-a-parent-process-with-child-processes) and see if you can offer a solution? I can still get different random numbers if I do `np.random.seed(None)` every time that I generate a random number, but this does not allow me to use the random state of the parent process, which is not what I want. Any help is greatly appreciated. – Amir Mar 20 '18 at 02:41