Automatic parallelization with @jit

Question

I want to make use of the numba Automatic parallelization with @jit, to help me speed runing my function as i am using very larg inputs, i have no experience with parralisation before, i tried several solutions using multiprocessing, i had many issues and i post it as question in stack overflow but there was almost no response, so i moved for something that can help me automatically and looks much simpler,

I sow a simple example for implementing numba @jit:

@jit(nopython=True, parallel=True)
def f(x, y):
    return x + y

So i tried to use it simply in my code:

from numba import njit, prange
@njit(parallel=True)
def defineMatrices(C0x, C0y, C0xy, C0yx, Cxx_err, Cyy_err, Cxy_err, Cyx_err, dCx, dCy, dCxy,dCyx):
    Nk = len(dCx)  # number of free parameters
    Nm = len(C0x)  # number of measurements
    #print('NK:', Nk)
    #print('Nm:', Nm)
    Ax = np.zeros([Nk, Nk])
    Ay = np.zeros([Nk, Nk])
    Axy = np.zeros([Nk, Nk])
    Ayx = np.zeros([Nk, Nk])
    A = np.zeros([4 * Nk, Nk])
    ##
    Bx = np.zeros([Nk, 1])
    By = np.zeros([Nk, 1])
    Bxy = np.zeros([Nk, 1])
    Byx = np.zeros([Nk, 1])
    B = np.zeros([4 * Nk, 1])
    ##
    Dx = (Cxx_err[0:Nm, :] - C0x[0:Nm, :])  ### dk ?
    Dy = (Cyy_err[0:Nm, :] - C0y[0:Nm, :])
    Dxy = (Cxy_err[0:Nm, :] - C0xy[0:Nm, :])
    Dyx = (Cyx_err[0:Nm, :] - C0yx[0:Nm, :])
    for i in range(Nk):  ## i represents each quad
        # print('done A:', 100.* i ,'%')
        for j in range(Nk):
            Ax[i, j] = np.sum(np.dot(dCx[i], dCx[j].T))
            Ay[i, j] = np.sum(np.dot(dCy[i], dCy[j].T))
            Axy[i, j] = np.sum(np.dot(dCxy[i], dCxy[j].T))
            Ayx[i, j] = np.sum(np.dot(dCyx[i], dCyx[j].T))
        A[i, :] = Ax[i, :]
        A[i + Nk, :] = Ay[i, :]
        A[i + 2 * Nk, :] = Axy[i, :]
        A[i + 3 * Nk, :] = Ayx[i, :]
    ##
    for i in range(Nk):
        Bx[i] = np.sum(np.dot(dCx[i], Dx.T))
        By[i] = np.sum(np.dot(dCy[i], Dy.T))
        Bxy[i] = np.sum(np.dot(dCxy[i], Dxy.T))
        Byx[i] = np.sum(np.dot(dCyx[i], Dyx.T))
        B[i] = Bx[i]
        B[i + Nk] = By[i]
        B[i + 2 * Nk] = Bxy[i]
        B[i + 3 * Nk] = Byx[i]
    
    return A, B

My inputs are :

all C: are two dim matrices all dC : 3 D matrices

output:

A: two dim B: one dimension

But i see the error:

TypingError                               Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_17908/1125053646.py in <module>
      1 start_time = time.time()
----> 2 A, B = defineMatrices(C0x, C0y, C0xy, C0yx, Cxx1, Cyy1, Cxy1, Cyx1, dCx, dCy, dCxy,dCyx)
      3 end_time = time.time()
      4 
      5 # Calculate the time taken and print it

~\Anaconda3\lib\site-packages\numba\core\dispatcher.py in _compile_for_args(self, *args, **kws)
    418                 e.patch_message(msg)
    419 
--> 420             error_rewrite(e, 'typing')
    421         except errors.UnsupportedError as e:
    422             # Something unsupported is present in the user code, add help info

~\Anaconda3\lib\site-packages\numba\core\dispatcher.py in error_rewrite(e, issue_type)
    359                 raise e
    360             else:
--> 361                 raise e.with_traceback(None)
    362 
    363         argtypes = []

TypingError: Failed in nopython mode pipeline (step: nopython frontend)
No implementation of function Function(<function fill_diagonal at 0x0000019FF09F3940>) found for signature:
 
 >>> fill_diagonal(array(float64, 1d, C), float64)
 
There are 2 candidate implementations:
     - Of which 2 did not match due to:
     Overload in function 'np_fill_diagonal': File: numba\np\arraymath.py: Line 2928.
       With argument(s): '(array(float64, 1d, C), float64)':
      Rejected as the implementation raised a specific error:
        TypingError: The first argument must be at least 2-D (found 1-D)
  raised from C:\Users\musa\Anaconda3\lib\site-packages\numba\np\arraymath.py:2958

During: resolving callee type: Function(<function fill_diagonal at 0x0000019FF09F3940>)
During: typing of call at C:\Users\musa\AppData\Local\Temp/ipykernel_17908/2593834973.py (30)


File "..\..\..\AppData\Local\Temp\ipykernel_17908\2593834973.py", line 30:
<source missing, REPL/exec in use?>

For minimual reducable example:

C0x = [[2.969129, 3.065011, 3.294439],
       [3.087682, 1.674345, 2.930011],
       [3.355765, 2.863887, 2.588477]]

C0y = [[6.748940, 9.673919, 7.662236],
       [9.527495, 11.328892, 10.147356],
       [7.738452, 10.280770, 7.684719]]

C0xy = [[0.0, 0.0, 0.0],
        [0.0, 0.0, 0.0],
        [0.0, 0.0, 0.0]]

C0yx = [[0.0, 0.0, 0.0],
        [0.0, 0.0, 0.0],
        [0.0, 0.0, 0.0]]

Cxx1 = [[2.970383, 3.065395, 3.295505],
        [3.088111, 1.674282, 2.930299],
        [3.356823, 2.864131, 2.589346]]

Cyy1 = [[6.733786, 9.656677, 7.646381],
        [9.510379, 11.309615, 10.129516],
        [7.722598, 10.262803, 7.668157]]

Cxy1 = [[0.0, 0.0, 0.0],
        [0.0, 0.0, 0.0],
        [0.0, 0.0, 0.0]]

Cyx1 = [[0.0, 0.0, 0.0],
        [0.0, 0.0, 0.0],
        [0.0, 0.0, 0.0]]

dCx = [array([[-0.88391347, -0.9165879 , -1.00445679],
        [-0.9102315 , -0.94390882, -1.03437531],
        [-0.99567547, -1.03249243, -1.13146452]]),
 array([[-0.93911752, -0.49819029, -0.88115751],
        [-0.51945798, -0.2755809 , -0.48739065],
        [-0.87413297, -0.46370905, -0.82018765]])]
dCy = [array([[4.54319981, 6.52256593, 5.20680836],
        [6.38565572, 9.16779592, 7.31840445],
        [5.19995576, 7.46547485, 5.95950073]]),
 array([[ 9.36071422, 11.09190807,  9.94111537],
        [10.98180048, 13.0128158 , 11.66270877],
        [ 9.93855597, 11.7766108 , 10.55478873]])]
dCxy = [array([[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]]),
 array([[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]])]
dCyx = [array([[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]]),
 array([[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]])]

A, B = defineMatrices(C0x, C0y, C0xy, C0yx, Cxx1, Cyy1, Cxy1, Cyx1, dCx, dCy, dCxy,dCyx)

Outputs:

A = [[26.203265, 17.026796],
     [17.026796, 11.763451],
     [1138.015961, 1913.702747],
     [1913.702747, 3238.589801],
     [0.0, 0.0],
     [0.0, 0.0],
     [0.0, 0.0],
     [0.0, 0.0]]

B = [-0.016327,     -0.011957,     -2.966823,     -5.028449,     0.0,     0.0,     0.0,     0.0]

The code should already parallelized since Numpy use a BLAS implementation for `np.dot` (the default implementation is OpenBLAS which is parallel). That being said, the parallelization might not be efficient. It is impossible to say without the size for all matrices. Besides, The Numba code seems to fail because of a function call which is not present in the provided code. In fast, the provided code is not correcly indented so it cannot run. Please provide a [complete minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example). This one is not. — Jérôme Richard, Mar 23 '23 at 20:44
I thought this code looked familiar and that I'd commented on it. Turns out you posted something similar a couple of days ago, and I commented on it - but you deleted it. https://stackoverflow.com/questions/75792840/numpy-and-parallel-processing. I too asked for a [mcve]. — hpaulj, Mar 23 '23 at 22:11
There's nothing "automatic" about using `numba` or "parallelization". Without trying to understand your code, I suspect the `i,j` loops with lines like `np.sum(np.dot(dCx[i], dCx[j].T))` can be written with faster `numpy` code. `np.matmul` and `np.einsum` can be used to perform multidimensional "dot" without slow Python level iteration. — hpaulj, Mar 23 '23 at 22:16
@hpaulj I have edit the code to include minimal reproducible example., i have edit my pervious post with example too, but i though it was not clear enough so i delete it . — Accelerator, Mar 24 '23 at 08:52
@JérômeRichard , I have edited my post with example, do you mean that my code is already running with maximum ability ? is there any other way around to optimize the code such that it finish fast with large input? — Accelerator, Mar 24 '23 at 08:54
@hpaulj I though that the for loops it self that cause the code to be slow, so you suggest that i keep the for loops and only modify the np.sum(np.dot(dCx[i], dCx[j].T)) using np.einsum — Accelerator, Mar 24 '23 at 08:56
@Accelerator No, being paralle does not mean being fast. In fact, BLAS functions can also be clearly sub-optimal for small matrices. Is the provided input size realistic ones? If so, using np.dot will be very inefficient. I am not sure Numba use BLAS functions in this case (if so, this will also be quite slow -- but certainly significantly faster than Numpy). — Jérôme Richard, Mar 24 '23 at 12:57
@Accelerator If you use large matrices, the BLAS matrix multiplication will be very good and I doubt Numba could help unless you try to write a faster implementation than BLAS one (which is possible but pretty hard in Numba -- I did once). Anyway, for large matrices the current code should use the hardware rather efficiently (including multiple cores). That being said, I am not even sure the matrix multiplications are actually needed... — Jérôme Richard, Mar 24 '23 at 13:09
The current code does not run in Numba because you use lists of array and not Numpy arrays. Numba does not support reflected lists. You need to convert the input to Numpy array (in fact, not converting them is inefficient in the pure-Python Numpy version). Overall, the goal of this question is to parallelise the code but it is already parallel and the Numba issues seems to have have nothing to do with parallelism. In fact, you certainly want a faster code, not just a parallel one. Maybe you should adapt the focus of the question, or write a more-focused new one. — Jérôme Richard, Mar 24 '23 at 13:10

hpaulj · Answer 1 · 2023-03-24T17:17:22.970

In your example dCx is a list of 2 (3,3) arrays. Let's make such a list:

In [113]: dcx = [np.arange(9).reshape(3,3), np.arange(10,19).reshape(3,3)]; dcx
Out[113]: 
[array([[0, 1, 2],
        [3, 4, 5],
        [6, 7, 8]]),
 array([[10, 11, 12],
        [13, 14, 15],
        [16, 17, 18]])]

And the Ax part of your double loop:

In [114]: Ax = np.zeros((2,2))
     ...: for i in range(2):
     ...:         for j in range(2):
     ...:             Ax[i, j] = np.sum(np.dot(dcx[i], dcx[j].T))

In [115]: Ax
Out[115]: 
array([[ 450., 1530.],
       [1530., 5310.]])

What I was trying to suggest was that you could replace the double loop with a "vectorized" operation. einsum, matmul, and even dot can work with higher dimensions.

Your list can be cast as a 3d array:

In [116]: x =np.array(dcx); x
Out[116]: 
array([[[ 0,  1,  2],
        [ 3,  4,  5],
        [ 6,  7,  8]],

       [[10, 11, 12],
        [13, 14, 15],
        [16, 17, 18]]])

Here's an equivalent to Ax using einsum:

In [117]: np.einsum('ikl,jml->ij', x,x)
Out[117]: 
array([[ 450, 1530],
       [1530, 5310]])

On further thought I can also use np.dot:

In [118]: np.dot(x,x.transpose(0,2,1)).sum(axis=(1,3))
Out[118]: 
array([[ 450, 1530],
       [1530, 5310]])

The matmul version is a bit messier (it has different rules for the "batch" dimensions):

In [121]: (x[:,None,:,:]@x.transpose(0,2,1)[None,:,:,:]).sum(axis=(2,3))
Out[121]: 
array([[ 450, 1530],
       [1530, 5310]])

Of these the einsum is fastest, and noticeably better than the double nest

In [124]: timeit np.einsum('ikl,jml->ij', x,x)
13 µs ± 22.7 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [125]: %%timeit
     ...: Ax = np.zeros((2,2))
     ...: for i in range(2):
     ...:         for j in range(2):
     ...:             Ax[i, j] = np.sum(np.dot(dcx[i], dcx[j].T))
91.2 µs ± 696 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

And using your sample dCx:

In [127]: dCx = [np.array([[-0.88391347, -0.9165879 , -1.00445679],
     ...:         [-0.9102315 , -0.94390882, -1.03437531],
     ...:         [-0.99567547, -1.03249243, -1.13146452]]),
     ...:  np.array([[-0.93911752, -0.49819029, -0.88115751],
     ...:         [-0.51945798, -0.2755809 , -0.48739065],
     ...:         [-0.87413297, -0.46370905, -0.82018765]])]

In [128]: x=np.array(dCx)

In [129]: np.einsum('ikl,jml->ij', x,x)
Out[129]: 
array([[26.20326497, 17.02679642],
       [17.02679642, 11.7634506 ]])

thank you for the clear explination, i have tested your suggestion with larger example where dCx is of shape (40, 40, 40) and A has shape (160, 40). I run my old code comparing the time with the new code but somehow i noticed that the time taken with the second code was more, however i am talking about seconds, but for my real test i need to make sure that the time will not incresed, i am keeping the B loop as it is.@hpaulj — Accelerator, Mar 25 '23 at 18:24

score 0 · Answer 2 · answered Mar 25 '23 at 18:31

I have implemented the np.einsum solution: i am using here dCx is of shape (40, 40, 40) and A has shape (160, 40)

lt = time.time()

for i in range(Nk):  ## i represents each quad
    # print('done A:', 100.* i ,'%')
    for j in range(Nk):
        Ax[i, j] = np.sum(np.dot(dCx[i], dCx[j].T))
        Ay[i, j] = np.sum(np.dot(dCy[i], dCy[j].T))
        Axy[i, j] = np.sum(np.dot(dCxy[i], dCxy[j].T))
        Ayx[i, j] = np.sum(np.dot(dCyx[i], dCyx[j].T))
    A[i, :] = Ax[i, :]
    A[i + Nk, :] = Ay[i, :]
    A[i + 2 * Nk, :] = Axy[i, :]
    A[i + 3 * Nk, :] = Ayx[i, :]
le = time.time()
elapsed_time = le - lt
print('Execution time:', elapsed_time, 'seconds')
Execution time: 0.43982982635498047 seconds

I compare the time with my code with two for loop

lt = time.time()

x =np.array(dCx)
Ax = np.einsum('ikl,jml->ij', x,x)
x =np.array(dCy)
Ay = np.einsum('ikl,jml->ij', x,x)
x =np.array(dCxy)
Axy = np.einsum('ikl,jml->ij', x,x)
x =np.array(dCyx)
Ayx = np.einsum('ikl,jml->ij', x,x)
A = np.vstack([Ax, Ay, Axy, Ayx])

le = time.time()
elapsed_time = le - lt
print('Execution time:', elapsed_time, 'seconds')
Execution time: 1.488039493560791 seconds

I did not expect the the time taken by the second solution will be longer than the for loops

Automatic parallelization with @jit

2 Answers2