Why is dgemm and sgemm much slower (200x) than numpy's dot?

Question

Why is dgemm and sgemm much slower (200x) than numpy's dot? Is it expected and normal?

The following is the code I use to test:

from scipy.linalg import blas
import numpy as np
import time


x2 = np.zeros((1000000, 512))
x1 = np.zeros((1, 512))

t1 = time.time()
for i in range(10):
    np.dot(x1, x2.T)
t2 = time.time()
print("np.dot: ", t2-t1)
t1 = time.time()
for i in range(10):
    blas.dgemm(alpha=1.0, a=x1, b=x2, trans_b=True)
t2 = time.time()
print("dgemm: ", t2-t1)
t1 = time.time()
for i in range(10):
    blas.sgemm(alpha=1.0, a=x1, b=x2, trans_b=True)
t2 = time.time()
print("sgemm: ", t2-t1)

The result I got is:

np.dot:  0.1820526123046875
dgemm:  34.11782765388489
sgemm:  25.33052659034729

The following is my scipy's config which shows that it is compiled with OpenBLAS:

   >>> import scipy
    >>> scipy.__config__.show()
    openblas_lapack_info:
        libraries = ['openblas', 'openblas']
        library_dirs = ['/usr/local/lib']
        language = c
        define_macros = [('HAVE_CBLAS', None)]
    lapack_opt_info:
        libraries = ['openblas', 'openblas']
        library_dirs = ['/usr/local/lib']
        language = c
        define_macros = [('HAVE_CBLAS', None)]
    blas_mkl_info:
      NOT AVAILABLE
    openblas_info:
        libraries = ['openblas', 'openblas']
        library_dirs = ['/usr/local/lib']
        language = c
        define_macros = [('HAVE_CBLAS', None)]
    blas_opt_info:
        libraries = ['openblas', 'openblas']
        library_dirs = ['/usr/local/lib']
        language = c
        define_macros = [('HAVE_CBLAS', None)]

The following is my numpy config which is largely the same as scipy's:

>>> import numpy
>>> numpy.__config__.show()
blas_mkl_info:
  NOT AVAILABLE
blis_info:
  NOT AVAILABLE
openblas_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
lapack_mkl_info:
  NOT AVAILABLE
openblas_lapack_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
lapack_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]

Am I using it wrong?

the overhead of calling blas function from f2py is the culprit I think. Make your arrays Fortran contiguous to start with otherwise a copy of your array is created to pass to GEMM also make your data type float for dgemm not int — percusse, Jun 04 '18 at 08:19
@percusse Why make it int? I use np.zeros just as dummy variable. I am dealing with floating point matrices. — , Jun 04 '18 at 08:45
If I add `order='F'` to the zeros functions dgemm starts to win. sgemm would be slower anyways since the data should be converted to single precision internally. — percusse, Jun 04 '18 at 14:09
@percusse I can't find the source code for numpy.dot. Is it using the same scipy.blas.gemm functions behind the scene? The speed is very close, which makes me suspect that the difference is in the overhead of function calling. — , Jun 05 '18 at 04:01
could you please check, what numpy is linked against? and i assume that you use the packaged openblas? also have you looked, how sgemv and dgemv perform in comparison. — Kaveh Vahedipour, Jun 18 '18 at 11:10

Why is dgemm and sgemm much slower (200x) than numpy's dot?

0 Answers0