So long story short, I built a simple multiplication function in Cython
, invoking scipy.linalg.cython_blas.dgemm
, compile it and run it against the benchmark Numpy.dot
. I have heard the myth about the 50% to 100x performance gain I will witness when I use tricks like static definition, array-dimension-preallocation, memory view, turning-off-checks, etc. But then I wrote my own my_dot
function (after compilation), it's 4 times slower than default Numpy.dot
. I don't really know what is the reason, so I can only have a few guess:
1) The BLAS
library is not linked
2) There might be some memory overhead that I didn't catch
3) dot
is using some hidden magic
4) badly written setup.py
and the c
code is not optimally compiled
5) My my_dot
function is not efficiently written
Below is my code snippet and all the relevant information I can think of that may help solve this puzzle. I appreciate if anyone can provide some insights on what I did wrong, or how to boost the performance to at least on par with the default Numpy.dot
File 1: model_cython/multi.pyx
. You will also need a model_cython/init.py
in the folder as well.
#cython: language_level=3
#cython: boundscheck=False
#cython: nonecheck=False
#cython: wraparound=False
#cython: infertypes=True
#cython: initializedcheck=False
#cython: cdivision=True
#distutils: extra_compile_args = -Wno-unused-function -Wno-unneeded-internal-declaration
from scipy.linalg.cython_blas cimport dgemm
import numpy as np
from numpy cimport ndarray, float64_t
from numpy cimport PyArray_ZEROS
cimport numpy as np
cimport cython
np.import_array()
ctypedef float64_t DOUBLE
def my_dot(double [::1, :] a, double [::1, :] b, int ashape0, int ashape1,
int bshape0, int bshape1):
cdef np.npy_intp cshape[2]
cshape[0] = <np.npy_intp> ashape0
cshape[1] = <np.npy_intp> bshape1
cdef:
int FORTRAN = 1
ndarray[DOUBLE, ndim=2] c = PyArray_ZEROS(2, cshape, np.NPY_DOUBLE, FORTRAN)
cdef double alpha = 1.0
cdef double beta = 0.0
dgemm("N", "N", &ashape0, &bshape1, &ashape1, &alpha, &a[0,0], &ashape0, &b[0,0], &bshape0, &beta, &c[0,0], &ashape0)
return c
File 2: model_cython/example.py
. The script that does the benchmark test
setup_str = """
import numpy as np
from numpy import float64
from multi import my_dot
a = np.ones((2,3), dtype=float64, order='F')
b = np.ones((3,2), dtype=float64, order='F')
print(a.flags)
ashape0, ashape1 = a.shape
bshape0, bshape1 = b.shape
"""
import timeit
print(timeit.timeit(stmt='c=my_dot(a,b, ashape0, ashape1, bshape0, bshape1)', setup=setup_str, number=100000))
print(timeit.timeit(stmt='c=a.dot(b)', setup=setup_str, number=100000))
File 3: setup.py
. Compile .so
file
from distutils.core import setup, Extension
from Cython.Build import cythonize
from Cython.Distutils import build_ext
import numpy
import os
basepath = os.path.dirname(os.path.realpath(__file__))
numpy_path = numpy.get_include()
package_name = 'multi'
setup(
name='multi',
cmdclass={'build_ext': build_ext},
ext_modules=[Extension(package_name,
[os.path.join(basepath, 'model_cython', 'multi.pyx')],
include_dirs=[numpy_path],
)],
)
File 4: run.sh
. Shell script that executes setup.py
and moves things around
python3 setup.py build_ext --inplace
path=$(pwd)
rm -r build
mv $path/multi.cpython-37m-darwin.so $path/model_cython/
rm $path/model_cython/multi.c
Below is a screenshot of the compilation message:
And regarding BLAS
, my Numpy
is properly linked to it at /usr/local/lib
, and clang -bundle
seems also add -L/usr/local/lib
in compiling. But maybe that's not enough?