1

PySwig is a C++ wrapper which creates objects which (for unavoidable reasons) cannot be pickled by Python. I want to access a method on that object to run across a large dataset (around 1M entries) in parallel.

I can run a function with multiple cores when I load the object at a top level, so there seems to be no issue doing this in principle:

pyswig_obj = make_object()

def call(x):
    return pyswig_obj.method(x)

p = Pool(8)

results = p.map(call, xs)

However if I wrap this in a function the default python pickle cannot pickle the call() function (only top-level functions can be pickled), which means I can't include this in a library. I've tried using dill (via pathos) to bypass this, but this results in trying to pickle the PySwig object itself, which doesn't work.

In principle, one workaround would be to create the object once in each process rather than sharing it, since it's fairly lightweight, but I'm not sure how this is possible in Python.

martineau
  • 119,623
  • 25
  • 170
  • 301
Tarquinnn
  • 501
  • 3
  • 8
  • 1
    To implement the workaround mentioned at the end: When you create a `Pool` you can specify an [*`initializer`*](https://docs.python.org/3/library/multiprocessing.html#module-multiprocessing.pool) argument to have each worker process call a function when it starts and that function will be passed *`*initargs`*. This provides a relatively easy way to have an object created once in each process. – martineau Dec 22 '21 at 12:35
  • 1
    If you follow @martineau's suggestion, then you'll want to use a `pathos.pools._ProcessPool`, which takes an initializer. Also, you can attempt to use different pickling variants with `dill.settings`, but I expect that it won't help with a `PySwig` object. – Mike McKerns Dec 22 '21 at 13:58
  • Thanks both, I have this working now with base multiprocessing, think this avoids the need for pathos for now. – Tarquinnn Dec 22 '21 at 15:34

0 Answers0