Context
I am trying to use multiprocessing, specifically Pool().starmap()
, within a specific method of a class. Let's name the file with the code for the Quant class model.py. I'd like to be able to import Quant, declare an instance of it, myobj
, and call myobj.calculate()
from another file called tester.py. My overall goal here is to get tester.py to run as fast as possible, with the cleanest syntax possible.
model.py
import multiprocessing
from optcode import optimizer
def f(arg1, arg2, arg3, arg4):
return arg1.run(arg_a = arg2, arg_b=arg3, arg_c=arg4)
Class Quant
def __init__(self, name):
self.name = name
def calculate(self):
optimization = optimizer()
args = [(optimization, x, y, z),(optimization, q, r, z), (optimization, l, m, n)]
cpus = multiprocessing.cpu_count()
with multiprocessing.Pool(processes=cpus) as pool:
tasks = pool.starmap(f, args)
self.results = tasks
tester.py
from model import Quant
if __name__ == '__main__':
myobj = Quant('My Instance')
myobj.calculate()
print(myobj.results)
Questions
- From what I can tell, the
myobj.calculate()
line needs to be withinif __name__ == '__main__':
to prevent everything from freezing (a fork bomb?). It appears that when I move theif __name__ == '__main__':
line up to include everything within tester.py (as I have in the above example), it then prevents Python from re-importing (and executing) tester.py ncpu times when I execute tester.py once. Is my understanding correct, and is there an alternative to needing to use these conditions? Trying to set things up for less saavy users. My actual application has more computationally intensive code within bothQuant.__init__()
andQuant.calculate()
. - How else can I speed this up (more) ? I have read that
pathos.multiprocessing
has superior serialization. Would that help in this context? In my actual application,args[0]
is a tuple of pandas DataFrames and floats. I have also read that there may be a better type of mapping to use with the processing pool. I do need the result of the pool to be ordered the same way as args, but I don't need intermediate results from each worker process; each process is totally independent of one another. Would something likeimap()
ormap_async()
make for a more efficient setup? I haven't been able to get the syntax for pathos to work and I've read all the examples I can find. Something aboutargs[0]
being a tuple of arguments, andargs
itself being an iterable (list) seems to be the issue. I knowpathos.multiprocessing.ProcessPool().map()
can handle multi-arg functions, but I can't figure out how to use it, given the structure of my inputs.