joblib parallelization of 2 independent calculations on 2 cores is slower than serial

Question

I am trying to parallelize some data expansion with numpy, and I am finding that the parallelized version takes orders of magnitude longer than the serial version, so I must be making some silly mistake.

First, some fake data to set up the problem:

Ngroups = 1.e6
some_group_property = np.random.uniform(0, 100, Ngroups)
mem1_occupation = np.random.random_integers(0, 5, Ngroups)
mem2_occupation = np.random.random_integers(0, 5, Ngroups)
occupation_list = [mem1_occupation, mem2_occupation]

Now for the serial calculation: I will expand the group data to arrays for the group members:

mem1_property = np.repeat(some_group_property, mem1_occupation)
mem2_property = np.repeat(some_group_property, mem2_occupation)

Here is the parallel version:

import functools
from joblib import Parallel, delayed

def expand_data(prop, occu_list, index):
    return np.repeat(prop, occu_list[index])
exp_data_1argfunc = functools.partial(expand_data, some_group_property, occupation_list)

result = Parallel(n_jobs=2)(delayed(exp_data_1argfunc)(i) for i in range(len(occupation_list)))

I am running this code on a 4-core machine, so in principle performing the calculations independently for the two populations should give me roughly a 2x speedup. Instead, the serial calculation takes ~0.1s while the parallel calculation takes 9s. What is going on here?

ali_m · Answer 1 · 2015-01-21T00:44:57.863

First of all:

I am running this code on a 4-core machine, so in principle performing the calculations independently for the two populations should give me roughly a 2x speedup.

No. In general, a speed-up that scales linearly with the number of threads would be the absolute best-case scenario, assuming that:

The limiting factor determining how long it takes to execute your code is the number of CPU clock cycles required (i.e. not disk I/O, memory allocation etc.)
There is no part of your code that cannot be parallelized
There is no additional overhead incurred due to parallelization

In practice these criteria are never fully met, so you definitely shouldn't automatically assume that a linear speed-up is possible.

Having said that, your parallel example can be speeded up quite a lot by getting rid of the intermediate functools.partial function declaration:

def parallel_1(some_group_property, occupation_list):
    """ original version """
    exp_partial = functools.partial(expand_data, some_group_property,
                                    occupation_list)
    return Parallel(n_jobs=2)(
        delayed(exp_partial)(i)
        for i in range(len(occupation_list))
    )


def parallel_2(some_group_property, occupation_list):
    """ get rid of functools.partial """
    return Parallel(n_jobs=2)(
        delayed(expand_1)(some_group_property, occupation_list, i)
        for i in range(len(occupation_list))
    )


In [40]: %timeit parallel_1(some_group_property, occupation_list)
1 loops, best of 3: 7.24 s per loop

In [41]: %timeit parallel_2(some_group_property, occupation_list)
1 loops, best of 3: 375 ms per loop

In Python multiprocessing, the function and its arguments are pickled before sending them to the worker threads, where they are then unpickled and the function is executed. I suspect the slow-down may have something to do with functools.partial objects being more difficult to pickle/unpickle, although I'm really not sure exactly what the cause is.

You can also do a tiny bit better by passing only the "occupation" array needed by a particular thread, rather than a list containing all of them:

def parallel_3(some_group_property, occupation_list):
    """ pass only a single occu array to each worker thread """
    return Parallel(n_jobs=2)(
        delayed(np.repeat)(some_group_property, oo)
        for oo in occupation_list
    )

In [44]: %timeit parallel_3(some_group_property, occupation_list)
1 loops, best of 3: 353 ms per loop

However, this still isn't even close to matching the performance of the single-threaded version:

def serial_version(some_group_property, occupation_list):
    return [np.repeat(some_property_group, oo)
            for oo in occupation_list]

In [46]: %timeit serial_version(some_group_property, occupation_list)
10 loops, best of 3: 46.1 ms per loop

This probably just means that the extra overhead involved in parallelization (starting the two worker threads, pickling/unpicking the functions and their arguments etc.) greatly outweighs the performance gains from computing the two arrays in parallel.

I suppose you might potentially see some benefit in parallelization for much larger arrays, where the parallel version spends proportionally more time actually performing the useful computation rather than just setting up and killing the worker threads.

joblib parallelization of 2 independent calculations on 2 cores is slower than serial

1 Answers1