1

Context

I am trying to use multiprocessing, specifically Pool().starmap(), within a specific method of a class. Let's name the file with the code for the Quant class model.py. I'd like to be able to import Quant, declare an instance of it, myobj, and call myobj.calculate() from another file called tester.py. My overall goal here is to get tester.py to run as fast as possible, with the cleanest syntax possible.

model.py

import multiprocessing
from optcode import optimizer

def f(arg1, arg2, arg3, arg4):
    return arg1.run(arg_a = arg2, arg_b=arg3, arg_c=arg4)

Class Quant
   def __init__(self, name):
      self.name = name

   def calculate(self):

      optimization = optimizer()
      args = [(optimization, x, y, z),(optimization, q, r, z), (optimization, l, m, n)]

      cpus = multiprocessing.cpu_count()
      with multiprocessing.Pool(processes=cpus) as pool:
         tasks = pool.starmap(f, args)
  
      self.results = tasks

tester.py

from model import Quant

if __name__ == '__main__':
   
   myobj = Quant('My Instance')
   myobj.calculate()
   print(myobj.results)

Questions

  1. From what I can tell, the myobj.calculate() line needs to be within if __name__ == '__main__': to prevent everything from freezing (a fork bomb?). It appears that when I move the if __name__ == '__main__': line up to include everything within tester.py (as I have in the above example), it then prevents Python from re-importing (and executing) tester.py ncpu times when I execute tester.py once. Is my understanding correct, and is there an alternative to needing to use these conditions? Trying to set things up for less saavy users. My actual application has more computationally intensive code within both Quant.__init__() and Quant.calculate().
  2. How else can I speed this up (more) ? I have read that pathos.multiprocessing has superior serialization. Would that help in this context? In my actual application, args[0] is a tuple of pandas DataFrames and floats. I have also read that there may be a better type of mapping to use with the processing pool. I do need the result of the pool to be ordered the same way as args, but I don't need intermediate results from each worker process; each process is totally independent of one another. Would something like imap() or map_async() make for a more efficient setup? I haven't been able to get the syntax for pathos to work and I've read all the examples I can find. Something about args[0] being a tuple of arguments, and args itself being an iterable (list) seems to be the issue. I know pathos.multiprocessing.ProcessPool().map() can handle multi-arg functions, but I can't figure out how to use it, given the structure of my inputs.
Seabird86
  • 11
  • 1
  • It seems you want an ensemble of optimizers running in parallel. You might try `mystic`, which can utilize `pathos` for just such a thing. If you prefer the `multiprocessing` interface over that in `pathos`, you can use `multiprocess` to get the same serialization benefits with the same interface as `multiprocessing`. – Mike McKerns Jun 27 '20 at 10:59

0 Answers0