First of all:
I am running this code on a 4-core machine, so in principle performing the calculations independently for the two populations should give me roughly a 2x speedup.
No. In general, a speed-up that scales linearly with the number of threads would be the absolute best-case scenario, assuming that:
- The limiting factor determining how long it takes to execute your code is the number of CPU clock cycles required (i.e. not disk I/O, memory allocation etc.)
- There is no part of your code that cannot be parallelized
- There is no additional overhead incurred due to parallelization
In practice these criteria are never fully met, so you definitely shouldn't automatically assume that a linear speed-up is possible.
Having said that, your parallel example can be speeded up quite a lot by getting rid of the intermediate functools.partial
function declaration:
def parallel_1(some_group_property, occupation_list):
""" original version """
exp_partial = functools.partial(expand_data, some_group_property,
occupation_list)
return Parallel(n_jobs=2)(
delayed(exp_partial)(i)
for i in range(len(occupation_list))
)
def parallel_2(some_group_property, occupation_list):
""" get rid of functools.partial """
return Parallel(n_jobs=2)(
delayed(expand_1)(some_group_property, occupation_list, i)
for i in range(len(occupation_list))
)
In [40]: %timeit parallel_1(some_group_property, occupation_list)
1 loops, best of 3: 7.24 s per loop
In [41]: %timeit parallel_2(some_group_property, occupation_list)
1 loops, best of 3: 375 ms per loop
In Python multiprocessing
, the function and its arguments are pickled before sending them to the worker threads, where they are then unpickled and the function is executed. I suspect the slow-down may have something to do with functools.partial
objects being more difficult to pickle/unpickle, although I'm really not sure exactly what the cause is.
You can also do a tiny bit better by passing only the "occupation" array needed by a particular thread, rather than a list containing all of them:
def parallel_3(some_group_property, occupation_list):
""" pass only a single occu array to each worker thread """
return Parallel(n_jobs=2)(
delayed(np.repeat)(some_group_property, oo)
for oo in occupation_list
)
In [44]: %timeit parallel_3(some_group_property, occupation_list)
1 loops, best of 3: 353 ms per loop
However, this still isn't even close to matching the performance of the single-threaded version:
def serial_version(some_group_property, occupation_list):
return [np.repeat(some_property_group, oo)
for oo in occupation_list]
In [46]: %timeit serial_version(some_group_property, occupation_list)
10 loops, best of 3: 46.1 ms per loop
This probably just means that the extra overhead involved in parallelization (starting the two worker threads, pickling/unpicking the functions and their arguments etc.) greatly outweighs the performance gains from computing the two arrays in parallel.
I suppose you might potentially see some benefit in parallelization for much larger arrays, where the parallel version spends proportionally more time actually performing the useful computation rather than just setting up and killing the worker threads.