Serializing objects for multiprocessing is slow -- is there a way to serialize just once?

Question

I am trying to parallelize a function that takes in an object in Python:

In using Pathos, the map function automatically dills the object before distributing it among the processors.

However, it takes ~1 min to dill the object each time, and I need run this function up to 100 times. All in all, it is taking nearly 2 hours to just serialize the object before even running it.

Is there a way to just serialize it once, and use it multiple times?

Thanks very much

I'm just getting started with Python, but could this be a job for a Generator? — asetniop, Jun 12 '18 at 21:08
@asetniop it isn't obvious how that would help in the context. — juanpa.arrivillaga, Jun 12 '18 at 21:09
Can you include a short example of how you are doing the parallelization? That way we don't have to guess about which MP functions you use. — tdelaney, Jun 12 '18 at 21:20

abarnert · Answer 1 · 2018-06-12T21:25:26.457

The easiest thing to do is to do this manually.

Without an example of your code, I have to make a lot of assumptions and write something pretty vague, so let's take the simplest case.

Assume you're using dill manually, so your existing code looks like this:

obj = function_that_creates_giant_object()
for i in range(zillions):
    results.append(pool.apply(func, (dill.dumps(obj),)))

All you have to do is move the dumps out of the loop:

obj = function_that_creates_giant_object()
objpickle = dill.dumps(obj)
for i in range(zillions):
    results.append(pool.apply(func, (objpickle,)))

But depending on your actual use, it may be better to just stick a cache in front of dill:

cachedpickle = functools.lru_cache(maxsize=10)(dill.dumps)

obj = function_that_creates_giant_object()
for i in range(zillions):
    results.append(pool.apply(wrapped_func, (cachedpickle(obj),))

Of course if you're monkeypatching multiprocessing to use dill in place of pickle, you can just as easy patch it to use this cachedpickle function.

If you're using multiprocess, which is a forked version of multiprocessing that pre-substitutes dill for pickle, it's less obvious how to patch that; you'll need to go through the source and see where it's using dill and get it to use your wrapper. But IIRC, it just does a import dill as pickle somewhere and then uses the same code as (a slightly out-of-date version of multiprocessing), so it isn't all that different.

In fact, you can even write a module that exposes the same interface as pickle and dill:

import functools
import dill

def loads(s):
    return dill.loads(s)

@lru_cache(maxsize=10)
def dumps(o):
    return dill.dumps(o)

… and just replace the import dill as pickle with import mycachingmodule as pickle.

… or even monkeypatch it after loading with multiprocess.helpers.pickle = mycachingmodule (or whatever the appropriate name is—you're still going to have to find where that relevant import happens in the source of whatever you're using).

And that's about as complicated as it's likely to get.

Thanks very much for your response! I've tried the above implementation, but I am still running into the error: PicklingError: Can't pickle : attribute lookup __builtin__.function failed It seems like the apply function is still trying to pickle the object after serialization. Is there a way around this? — user154510, Jun 13 '18 at 16:35
@user154510 That doesn't make sense. The result of calling `pickle.dumps` (or a cached wrapper around it) is a `bytes`, and pickling a `bytes` can't raise that exception. My guess is that either you got some of the details wrong, or you didn't anticipate that, e.g., return values from pool tasks also get pickled. But I can't debug it without seeing some code. — abarnert, Jun 13 '18 at 18:02

Serializing objects for multiprocessing is slow -- is there a way to serialize just once?

1 Answers1