Sharing data between processes in Python

Question

I have a complex data structure (user-defined type) on which a large number of independent calculations are performed. The data structure is basically immutable. I say basically, because though the interface looks immutable, internally some lazy-evaluation is going on. Some of the lazily calculated attributes are stored in dictionaries (return values of costly functions by input parameter). I would like to use Pythons multiprocessing module to parallelize these calculations. There are two questions on my mind.

How do I best share the data-structure between processes?
Is there a way to handle the lazy-evaluation problem without using locks (multiple processes write the same value)?

Thanks in advance for any answers, comments or enlightening questions!

How large / complex are you talking? When an `independent calculation` is submitted, do you know before the start which lazy attributes are needed? — MattH, Aug 10 '10 at 10:24
The problem is basically a leave-one-out cross-validation on a large set of data-samples. It takes about two hours on my machine on a single core, but I have access to a machine with 24 cores and would like to leverage that power. I do not know in advance which of the attributes will be needed by a single calculation, but I know that eventually (over all calculations) all attributes will be needed, so I could just load them all up front (would have to test that though). — Björn Pollex, Aug 10 '10 at 10:30

score 8 · Accepted Answer · answered Aug 10 '10 at 10:16

How do I best share the data-structure between processes?

Pipelines.

origin.py | process1.py | process2.py | process3.py

Break your program up so that each calculation is a separate process of the following form.

def transform1( piece ):
    Some transformation or calculation.

For testing, you can use it like this.

def t1( iterable ):
    for piece in iterable:
        more_data = transform1( piece )
        yield NewNamedTuple( piece, more_data )

For reproducing the whole calculation in a single process, you can do this.

for x in t1( t2( t3( the_whole_structure ) ) ):
    print( x )

You can wrap each transformation with a little bit of file I/O. Pickle works well for this, but other representations (like JSON or YAML) work well, too.

while True:
    a_piece = pickle.load(sys.stdin)
    more_data = transform1( a_piece )
    pickle.dump( NewNamedTuple( piece, more_data ) )

Each processing step becomes an independent OS-level process. They will run concurrently and will -- immediately -- consume all OS-level resources.

Is there a way to handle the lazy-evaluation problem without using locks (multiple processes write the same value)?

Pipelines.

Wow, that answers solves two problems that where not even in my question (how to send a complex object to another process, how to do this in python when the multiprocessing-module is not available)! — Björn Pollex, Aug 10 '10 at 10:22
The point is that OS-level (shared buffer) process management is (a) simpler and (b) can be as fast as more complex multi-threaded, shared-everything techniques. — S.Lott, Aug 10 '10 at 11:08
@S.Lott I want to share numpy random state of a parent process with a child process. I've tried using `Manager` but still no luck. Could you please take a look at my question [here](https://stackoverflow.com/questions/49372619/how-to-share-numpy-random-state-of-a-parent-process-with-child-processes) and see if you can offer a solution? I can still get different random numbers if I do `np.random.seed(None)` every time that I generate a random number, but this does not allow me to use the random state of the parent process, which is not what I want. Any help is greatly appreciated. — Amir, Mar 20 '18 at 02:41

Sharing data between processes in Python

1 Answers1