Creating and reusing objects in python processes

Question

I have an embarrassingly parallelizable problem consisting on a bunch of tasks that get solved independently of each other. Solving each of the tasks is quite lengthy, so this is a prime candidate for multi-processing.

The problem is that solving my tasks requires creating a specific object that is very time consuming on its own but can be reused for all the tasks (think of an external binary program that needs to be launched), so in the serial version I do something like this:

def costly_function(task, my_object):
    solution = solve_task_using_my_object
    return solution

def solve_problem():
    my_object = create_costly_object()
    tasks = get_list_of_tasks()
    all_solutions = [costly_function(task, my_object) for task in tasks]
    return all_solutions

When I try to parallelize this program using multiprocessing, my_object cannot be passed as a parameter for a number of reasons (it cannot be pickled, and it should not run more than one task at the same time), so I have to resort to create a separate instance of the object for each task:

def costly_function(task):
    my_object = create_costly_object()
    solution = solve_task_using_my_object
    return solution

def psolve_problem():
    pool = multiprocessing.Pool()
    tasks = get_list_of_tasks()
    all_solutions = pool.map_async(costly_function, tasks)
    return all_solutions.get()

but the added costs of creating multiple instances of my_object makes this code only marginally faster than the serialized one.

If I could create a separate instance of my_object in each process and then reuse them for all the tasks that get run in that process, my timings would significantly improve. Any pointers on how to do that?

score 11 · Accepted Answer · answered Nov 27 '13 at 22:32

I found a simple way of solving my own problem without bringing in any tools besides the standard library, I thought I'd write it down here in case somebody else has a similar problem.

multiprocessing.Pool accepts an initializer function (with arguments) that gets run when each process is launched. The return value of this function is not stored anywhere, but one can take advantage of the function to set up a global variable:

def init_process():
    global my_object
    my_object = create_costly_object()

def costly_function(task):
    global my_object
    solution = solve_task_using_my_object
    return solution

def psolve_problem():
    pool = multiprocessing.Pool(initializer=init_process)
    tasks = get_list_of_tasks()
    all_solutions = pool.map_async(costly_function, tasks)
    return all_solutions.get()

Since each process has a separate global namespace, the instantiated objects do not clash, and they are created only once per process.

Probably not the most elegant solution, but it's simple enough and gives me a near-linear speedup.

score 1 · Answer 2 · answered Nov 27 '13 at 18:01

1

you can have celery project handle all this for you, among many other features it also have a way to run some task initialization that can be used latter by all tasks

answered Nov 27 '13 at 18:01

Guy Gavriely

11,228
6
27
42

1

Thanks, celery looks great, but too much of an overkill for my purposes, separate broker for messaging and the whole app boilerplate is a bit excessive for me... – Javier Nov 27 '13 at 22:23

score 1 · Answer 3 · answered Nov 27 '13 at 18:01

1

You are right that you are constrained to pickable objects when using multiprocessing. Are you absolutely sure that your object is un-pickleable?

Have you tried dill? If you import it in, anytime pickle is called it will use the dill bindings. It worked for me, when I was trying to use multiprocessing on sympy equations.

answered Nov 27 '13 at 18:01

William Denman

3,046
32
34

Even if I managed to pickle my object (which would be hard, as it involves system binaries running outside of the python interpreter), it needs to keep a persistent state to complete a task; either that state would be overwritten if passing the same object to two different tasks running simultaneously, or I need to make the use of the object blocking (which defeats the purpose of the multiprocessing). – Javier Nov 27 '13 at 18:28

Creating and reusing objects in python processes

3 Answers3

Linked