Numba matrix multiplication much slower than NumPy

Question

I am implementing a simple matrix multiplication function with Numba and am finding it to be significantly slower than NumPy. In the below example, Numba is 40X slower. Is there a way to further speed up Numba? Thanks in advance for your feedback.

import time
import numpy as np
import numba
from numba import njit, prange

@numba.jit('void(float64[:,:],float64[:,:],float64[:,:])', fastmath=True, parallel=True)
def matmul(matrix1,matrix2,rmatrix):
    a = matrix1.shape[0]
    b = matrix2.shape[1]
    c = matrix2.shape[0]
    for i in prange(a):
        for j in prange(b):
            for k in prange(c):
                rmatrix[i,j] += matrix1[i,k] * matrix2[k,j]

M = np.random.normal(0,10,(10,10))**2
N = np.random.normal(0,10,(10,10))**2
A = np.random.normal(0,10,(10,10))**2
matmul(M,N,A) #to make sure compiled

n = 3000
M = np.random.normal(0,10,(n,1000))**2
N = np.random.normal(0,10,(1000,n))**2
A = np.zeros((3000,3000))

t = time.time()
matmul(M,N,A)
print("Numba:", time.time()-t)

t = time.time()
np.dot(np.log(M),np.log(N))
print("NumPy:", time.time()-t)

@max9111's answer [here](https://stackoverflow.com/questions/59347796/minimizing-overhead-due-to-the-large-number-of-numpy-dot-calls/59356461#59356461) explains the reasoning why: BLAS (which is implemented in machine code, and which `numpy.dot` uses) is significantly faster than compiled c-code (which is what `numba` generates) if the matrices are larger than 20x20 or so. Yours are much larger. It's not really a dupe but any answer is going to be the same or inferior to that one so I'll point it there. — Daniel F, Feb 11 '20 at 11:53
Does this answer your question? [Minimizing overhead due to the large number of Numpy dot calls](https://stackoverflow.com/questions/59347796/minimizing-overhead-due-to-the-large-number-of-numpy-dot-calls) — Daniel F, Feb 11 '20 at 11:55
If you are interested in implementing a matrix- matrix product yourself this https://gist.github.com/nadavrot/5b35d44e8ba3dd718e595e40184d03f0 would be a good start. But Numba doesn't have a few features (explicit SIMD-vectorization) which makes it easier to implement a efficient matrix- matrix multiplication. If you want to use a BLAS implementation (only float32 and float64 supported) you can use np.dot in Numba or in Numpy. — max9111, Feb 25 '20 at 09:35

Numba matrix multiplication much slower than NumPy

0 Answers0