Multiprocess Slower When Variable Defined Inside Function

Question

I have the following code:

import numpy as np
from multiprocess import Pool

data = np.zeros((50,50))

def foo():
    # data = np.zeros((50,50)) # This slows the code.
    
    def bar():
        data.shape
        
    with Pool() as pool:
            async_results = [pool.apply_async(bar) for x in range(20000)]
            out = [async_result.get() for async_result in async_results]
  
foo()

As written, it takes 3 seconds to run. But when I uncomment the first line of foo(), the code takes 10 seconds.

Commenting out the initial definition of data doesn't fix the issue. So I think the bottleneck isn't when data is initialized. I suspect the problem is passing data to each of the processes, but I can't confirm this. And I don't know why defining data outside of foo would help.

Why is there a discrepancy in speeds?

@DarkKnight: It's almost certainly [the third-party module that reimplements the built-in `multiprocessing` on top of `dill` instead of `pickle`](https://pypi.org/project/multiprocess/), allowing tasks to contain unpicklable objects. Neither version of this code would work with normal `multiprocessing`, because nested functions are unpicklable (regular functions are pickled by a simple marker indicating it's a function, followed by their qualified name; a nested function has no legal qualified name, so it can't be pickled). — ShadowRanger, May 18 '23 at 14:41
@ShadowRanger I deleted my comment after discovering this non-standard module. In which case I suspect OP's problem is due to the fact that the array (*data*) local to foo() has to be serialised — DarkKnight, May 18 '23 at 14:43

ShadowRanger · Accepted Answer · 2023-05-18T14:46:11.117

The discrepancy is because globals get copied to the workers "for free" (either essentially completely free due to forking, or free from the perspective of the parent process because the child processes recreate them on launch), while closure-scoped variables can't be (they're copied in the fork scenario, not in any other scenario, and even when they're copied, there's no meaningful way to look them up in the child process, so they get copied again for each task).

To support serializing closures, dill (the extended version of pickle underlying multiprocess that allows it to dispatch closure functions at all) has to serialize the array, send it across the IPC mechanisms with the rest of the data for that task, and deserialize it in the worker, repeating this once for every task. It may also be required to use a more complex format for the function itself (there are some weird optimizations they might use to keep nested, but non-closure, functions cheap to serialize, that would break down for true closures).

It's essentially the same problem described in Python multiprocessing - Why is using functools.partial slower than default arguments? caused by dill trying to solve the same problem that functools.partial needed to solve to make it picklable. While regular multiprocessing doesn't support pickling nested functions at all, dills support for closures is effectively performing the same work as pickleing a partial, and the same costs get paid.

TL;DR: At global scope, you don't have to package the array data with each task. At closure scope, you do, dramatically increasing the work done to dispatch each task.

Very nice analysis. Does this mean that *multiprocess* always forks (rather than spawns)? — DarkKnight, May 18 '23 at 14:48
@DarkKnight: No, it wouldn't matter which one it did. Even in the spawn scenario, if it followed the design of `multiprocessing`, `spawn`ed workers are initialized by running the main script entry point from scratch (with the name `__mp_main__` instead of `__main__` so import-guarded code isn't run), so globals get recreated once, in the background, when the child processes are launched (a number of times equal to the pool size, usually the number of cores), rather than serializing them, piping them, and deserializing them for every task (20,000 times in this case). — ShadowRanger, May 18 '23 at 14:54
Even when `fork`ing, it can't use the fact that the closure scope variable is created once and would be CoW-ed, because `dill` doesn't know the workers were spawned after the closure was formed. If `foo` returned `bar`, and the `Pool` was created outside it first, then `foo` called, then the resulting `bar` used to pass the tasks, the closure wouldn't exist in the children even under `fork`; `dill` can't tell which scenario it's in in any event (it's designed for generalized serializing, possibly to disk for a subsequent run to pull in), so it can't do anything to optimize for `fork`ed stuff. — ShadowRanger, May 18 '23 at 14:56

Multiprocess Slower When Variable Defined Inside Function

1 Answers1