Numba implementation for multidimensional arrays is very slow

Question

I need to perform a calculation on two multidimensional arrays, where components are convolved with each other. I have written a normal Python function to do this on nested loops (which is supposed to be very slow) and the same one using Numba, which is supposed to be more efficient for these nested loop implementations. I think I am doing things very wrong because the Numba implementation of the function takes even longer (much longer) than the normal Python function; here is the code example:

import numpy as np
from numba import jit

@jit( nopython=True )
def convolve_arrays_jit( A, B, T, dx, md, sd, M, mid_indices ):
    
    a_dtype = A.dtype      
    ms      = 2*M
    mt      = 4*M
    N       = B.shape[1]
    mid_N   = int( N/2 )
    
    res     = np.zeros( N , a_dtype )
        
    for m1 in range( md ):        
        m1_index = m1 + md        
        for m2 in range( md ):
            m2_index = m2 + md
            for gs in range( 9 ):
                
                aindx = mid_indices[ gs, 0 ]
                bindx = mid_indices[ gs, 1 ]
                
                As    = A[ aindx ]
                Bs    = B[ bindx ]
                
                for i in range( sd ):
                    for j in range( sd ):
                        for k in range( sd ):
                            for l in range( sd ):                                
                                for n in range( 0, mt ):       
                                    
                                    a_sel  =  As[ :, - m1_index - n + ms, -m1_index, i , j  ]
                                    b_sel  =  Bs[ :, m2_index + n - ms  ,  m2_index, k , l  ]                          
                                    res    = dx*np.convolve( a_sel , b_sel )[ mid_N:mid_N + N ]                                    
                                    T[ m1,m2,gs,i,j,k,l, : ] += res
    return T        


def convolve_arrays( A, B, T, dx, N, md, sd, M, mid_indices ):
    
    a_dtype = A.dtype
        
    mid_N   = int( N/2 )
    ms      = 2*M
    mt      = 4*M
    res     = np.zeros( N , a_dtype )
    
    for m1 in range( md ):        
        m1_index = m1 + md        
        for m2 in range( md ):
            m2_index = m2 + md
            for gs in range( 9 ):
                
                aindx = mid_indices[ gs, 0 ]
                bindx = mid_indices[ gs, 1 ]
                
                As    = A[ aindx, :, :, :, :, : ]
                Bs    = B[ bindx, :, :, :, :, : ]
                
                for i in range( sd ):
                    for j in range( sd ):
                        for k in range( sd ):
                            for l in range( sd ):                                
                                for n in range( 0, mt ):                                    
                                    res = ( dx*np.convolve( As[ :, - m1_index - n + ms, -m1_index, i , j  ]  , Bs[ :, m2_index + n - ms, m2_index, k, l ] ) )[ mid_N:mid_N + N ]
                                    T[ m1,m2,gs,i,j,k,l, : ] += res
    return T   

N  = 3**7
T1 = np.zeros( ( 3, 3, 9, 3, 3, 3, 3, N ), dtype=np.complex128  ) 
T2 = np.zeros( ( 3, 3, 9, 3, 3, 3, 3, N ), dtype=np.complex128  ) 


mid_indices = np.array( [  np.array( [ np.array( [ i,j ] ) for j in range(3)] ) for i in range(3) ] )
mid_indices = np.reshape( mid_indices, (9,2) )

A = np.random.rand( 3, N, 7,7,3,3  ) + 1j*np.random.rand( 3, N, 7,7,3,3  )
B = np.random.rand( 3, N, 7,7,3,3  ) + 1j*np.random.rand( 3, N, 7,7,3,3  )

dx = 0.01

from time import time

ts = time()

T1 = convolve_arrays( A, B, T1, dx, N, 3, 3, 1, mid_indices )

print( time() - ts )

T2 = convolve_arrays_jit( A, B, T2, dx, 3, 3, 1, mid_indices )


#use again
T2 = np.zeros( ( 3, 3, 9, 3, 3, 3, 3, N ), dtype=np.complex128  )

ts = time()
T2 = convolve_arrays_jit( A, B, T2, dx, 3, 3, 1, mid_indices )

print( time() - ts )
print( np.allclose( T1, T2 ) )

Why is this Numba implementation so slow? Any help would be very helpful, I am very new to using Numba; also any other suggested ways to perform the calculation in a more efficient way would be nice.

COMPARING SCIPY FFT-CONVOLVE WITH NUMBA According to the Numba documentation, the function np.convolve is supported only with the first two arguments. That means the convolution is done in "full" mode. So I have tried with this minimal code example and I still see Numba is way slower than the scipy implementation of fftconvolve:

import numpy as np
from numba import jit
from scipy.signal import fftconvolve
from time import time

@jit( nopython=True )
def conv_jitted( a, b, d, N ):
    
    res = np.empty( (nd, N ), dtype=np.complex128 )    
    v   = np.empty( N     , dtype=np.complex128 )
          
    for k in range( d ):
        v = np.convolve( a[k], b[k] )
        res[ k ] = v
    return res
    

nd = 1000
N  = 3**7
Nf = 2*N - 1

a = np.random.rand( nd, N ) + 1j*np.random.rand( nd, N  )
b = np.random.rand( nd, N ) + 1j*np.random.rand( nd, N  )


t1 = time()

res  = fftconvolve( a, b, mode="full", axes=-1)

print( time()-t1 )

res2 = conv_jitted( a, b, nd, Nf )

t2 = time()

res2 = conv_jitted( a, b, nd, Nf ) 

print( time()-t2 )

print( np.allclose( res, res2 ) )

Both arrays end up being the same, but something is wrong with the Numba call in this code, can anyone see what is going on?

You might be interested in the Q&A posted [here](https://stackoverflow.com/questions/56992398/numba-on-nested-numpy-arrays). Note that not all Numpy-functions are supported by numba - Make sure your code is compatible (otherwise you'll not have any advantages). — Markus, Mar 30 '21 at 20:13
Thanks for the link, the only numpy function that is being called is the "zeros" initialization and the "convolve", which according to the latest Numba release documentation, both are supported as used in the code example above. I also thought this could be related to the slicing of the ndarrays, but apparently both basic slicing and indexing are supported in all its features for version >0.51. So I am a bit puzzled by this — Zarathustra, Mar 30 '21 at 20:49
One thing that also comes in my mind: How many test-runs did you use respectivley did you try `%timeit`? On the first run the C-code for numba has to be build which takes time - You should time the compiled code. Another thing I can imagine is the problem I ran in [here](https://stackoverflow.com/questions/57055257/performance-decreases-with-increasing-nesting-of-array-elements). I found that higher matrix-dimensions led to higher runtimes. Your shape is pretty intensive; thus, it might also be you problem (unfortunately, I never found a solution - If you ever find one, I'm intersted :) ). — Markus, Mar 30 '21 at 21:27
Sure %timeit would be a better way to measure times with more runs, however, here I am calling the jitted function one time; and then I call it again a second time (where I measure times). This second call should already (if I understand correctly how the @jit decorator works) make use of the compiled version, and therefore should be already much faster; so using %timeit or not, I still think the error is that the execution is bein in "objectmode" rather than "nopython" or something like it. — Zarathustra, Mar 30 '21 at 22:03
The two functions are quote different — they even have different signatures. Are you certain this is an apples-to-apples comparison? — senderle, Mar 31 '21 at 09:33
Appareantly there is an issue open for this on the numba github: https://github.com/numba/numba/issues/4119 — Jens Renders, Apr 03 '21 at 18:41
I think you compare a fast FFT-based algorithm with a `O(n log n)` complexity with one having a quadratic complexity (`O(n^2)`). Complexity matters more than using a jit for big inputs like the one you use. The FFT implementation of Numpy should already be optimized so it should not be faster with Numba (which obviously do not use a FFT to perform the convolution). — Jérôme Richard, Aug 08 '21 at 11:26
Here is a related post with some news about `np.convolve`: https://stackoverflow.com/questions/70311592/numba-np-convolve-really-slow/70312564 — Jérôme Richard, Dec 11 '21 at 04:56

Numba implementation for multidimensional arrays is very slow

0 Answers0