I have an embarrassingly parallelizable problem consisting on a bunch of tasks that get solved independently of each other. Solving each of the tasks is quite lengthy, so this is a prime candidate for multi-processing.
The problem is that solving my tasks requires creating a specific object that is very time consuming on its own but can be reused for all the tasks (think of an external binary program that needs to be launched), so in the serial version I do something like this:
def costly_function(task, my_object):
solution = solve_task_using_my_object
return solution
def solve_problem():
my_object = create_costly_object()
tasks = get_list_of_tasks()
all_solutions = [costly_function(task, my_object) for task in tasks]
return all_solutions
When I try to parallelize this program using multiprocessing, my_object
cannot be passed as a parameter for a number of reasons (it cannot be pickled, and it should not run more than one task at the same time), so I have to resort to create a separate instance of the object for each task:
def costly_function(task):
my_object = create_costly_object()
solution = solve_task_using_my_object
return solution
def psolve_problem():
pool = multiprocessing.Pool()
tasks = get_list_of_tasks()
all_solutions = pool.map_async(costly_function, tasks)
return all_solutions.get()
but the added costs of creating multiple instances of my_object
makes this code only marginally faster than the serialized one.
If I could create a separate instance of my_object
in each process and then reuse them for all the tasks that get run in that process, my timings would significantly improve. Any pointers on how to do that?