Why is multiprocessing slower than single-core? Would using joblib or dask make a difference?

Question

The issue

I am trying to optimise some calculations which lend themselves to so-called embarrassingly parallel calculations, but I am finding that using python's multiprocessing package actually slows things down.

My question is: am I doing something wrong, or is there an intrinsic reason why parallelisation actually slows things down? Is it because I am using numba? Would other packages like joblib or dak make much of a difference?

There are loads of similar questions, in which the answer is always that the overhead costs more than the time savings, but all those questions tend to revolve around very simple functions, whereas I would have expected something with nested loops to lend itself better to parallelisation. I have also not found comparisons among joblib, multiprocessing and dask.

My function

I have a function which takes a one-dimensional numpy array as argument of shape n, and outputs a numpy array of shape (n x t), where each row is independent, i.e. row 0 of the output depends only on item 0 of the input, row 1 on item 1, etc. Something like this:

The underlying calculation is optimised with numba , which speeds things up by various orders of magnitude.

Toy example - results

I cannot share the exact code, so I have come up with a toy example. The calculation defined in my_fun_numba is actually irrelevant, it's just some very banal number crunching to keep the CPU busy.

With the toy example, the results on my PC are these, and they are very similar to what I get with my actual code.

As you can see, splitting the input array into different chunks and sending each of them to multiprocessing.pool actually slows things down vs just using numba on a single core.

What I have tried

I have tried various combinations of the cache and nogil options in the numba.jit decorator, but the difference is minimal.

I have profiled the code (not the timeit.Timer part, just a single run) with PyCharm and, if I understand the output correctly, it seems most of the time is spent waiting for the pool.

Sorted by time:

Sorted by own time:

Toy example - the code

import numpy as np
import pandas as pd
import multiprocessing
from multiprocessing import Pool
import numba
import timeit


@numba.jit(nopython = True, nogil = True, cache = True)
def my_fun_numba(x):
    dim2 = 10
    out = np.empty((len(x), dim2))
    n = len(x)
    for r in range(n):   
        for c in range(dim2):
            out[r,c] = np.cos(x[r]) ** 2 + np.sin(x[r]) ** 2
    return out

def my_fun_non_numba(x):
    dim2 = 10
    out = np.empty((len(x), dim2))
    n = len(x)
    for r in range(n):   
        for c in range(dim2):
            out[r,c] = np.cos(x[r]) ** 2 + np.sin(x[r]) ** 2
    return out


def my_func_parallel(inp, func, cpus = None):
    if cpus == None:
        cpus = max(1, multiprocessing.cpu_count() - 1)
    else:
        cpus = cpus
        
    inp_split = np.array_split(inp,cpus)
    pool = Pool(cpus)    
    out = np.vstack(pool.map(func, inp_split) )    
    pool.close()
    pool.join()
    return out

if __name__ == "__main__":
    inputs = np.array([100,10e3,1e6] ).astype(int)
    res = pd.DataFrame(index = inputs, columns =['no paral, no numba','no paral, numba','numba 6 cores','numba 12 cores'])
    
    r = 3
    n = 1

    
    for i in inputs:
        my_arg = np.arange(0,i)

        
        res.loc[i, 'no paral, no numba'] = min(
            timeit.Timer("my_fun_non_numba(my_arg)", globals=globals()).repeat(repeat=r, number=n)
            )
        
        res.loc[i, 'no paral, numba'] = min(
            timeit.Timer("my_fun_numba(my_arg)", globals=globals()).repeat(repeat=r, number=n)
            )
        
        res.loc[i, 'numba 6 cores'] = min(
            timeit.Timer("my_func_parallel(my_arg, my_fun_numba, cpus = 6)", globals=globals()).repeat(repeat=r, number=n)
            )
        
        res.loc[i, 'numba 12 cores'] = min(
            timeit.Timer("my_func_parallel(my_arg, my_fun_numba, cpus = 12)", globals=globals()).repeat(repeat=r, number=n)
            )

The fact that your parallel running times are pretty much constant suggests that you simply haven't tried a large enough input to make the overhead of distributing the data to and collecting the results from multiple parallel instances. — chepner, Feb 22 '21 at 22:30
@chepner But the overheads seem to be ca. 2 seconds on 6 cores and ca. 4 seconds on 12 cores. Isn't that too much? What exactly must happen every time you distribute to one more core and why does it take so long? is that because of some numba initialiation? — Pythonista anonymous, Feb 22 '21 at 22:33
Most likely, the master process sends data to each other core as it starts, rather than shared memory being used. Also, with so few cores, a single core takes care of starting all the others, rather than there being a sort of fan where an exponentially larger number of cores gets initialized at each step. — chepner, Feb 22 '21 at 22:41
It seems that you my_fun_numba is so fast that there is no need to parallelize it. If you my_fun_numba took 5 minutes to compute, that would be usefull to parallelize, as the gain would overcome the overhead of distributing data and collecting results. This is normal. — Malo, Feb 22 '21 at 22:42
"Most of the time is spent waiting for the pool" -- if you're only benchmarking the parent process and not the children, you'll _only_ see time waiting on RPC, and not the time spent actually doing the work. (That's not to say that the time spent serializing/deserializing/copying data around _isn't_ more than the time actually doing the work; we'd need to know more about your data and its shape to say, and in general, all that overhead is slow -- remember too that anything that's CPU-bound and implemented in native Python code holds the GIL, so the serialization/deserialization can be pricey). — Charles Duffy, Feb 22 '21 at 23:51
It is not clear to me, why you want to parallelize a numba function using multiprocessing. `for r in numba.prange(n):` and setting `parallel = True` is enough to get a multithreaded solution which has by far less overhead. But on such a small problem even that may not be beneficial. — max9111, Feb 23 '21 at 08:25
@max91111 I have managed to optimise it a bit more with `prange`. Thank you for the tip. Please feel free to publish it as an answer. — Pythonista anonymous, Feb 23 '21 at 21:32
Looks like your workflow is bound by serialization cost. Using `ThreadPool` instead of `Pool` might help here. — pavithraes, Oct 06 '21 at 09:39