Why dgemm (Cython compiled) is slower than numpy.dot

Question

So long story short, I built a simple multiplication function in Cython, invoking scipy.linalg.cython_blas.dgemm, compile it and run it against the benchmark Numpy.dot. I have heard the myth about the 50% to 100x performance gain I will witness when I use tricks like static definition, array-dimension-preallocation, memory view, turning-off-checks, etc. But then I wrote my own my_dot function (after compilation), it's 4 times slower than default Numpy.dot. I don't really know what is the reason, so I can only have a few guess:

1) The BLAS library is not linked

2) There might be some memory overhead that I didn't catch

3) dot is using some hidden magic

4) badly written setup.py and the c code is not optimally compiled

5) My my_dot function is not efficiently written

Below is my code snippet and all the relevant information I can think of that may help solve this puzzle. I appreciate if anyone can provide some insights on what I did wrong, or how to boost the performance to at least on par with the default Numpy.dot

File 1: model_cython/multi.pyx. You will also need a model_cython/init.py in the folder as well.

#cython: language_level=3 
#cython: boundscheck=False
#cython: nonecheck=False
#cython: wraparound=False
#cython: infertypes=True
#cython: initializedcheck=False
#cython: cdivision=True
#distutils: extra_compile_args = -Wno-unused-function -Wno-unneeded-internal-declaration


from scipy.linalg.cython_blas cimport dgemm
import numpy as np
from numpy cimport ndarray, float64_t
from numpy cimport PyArray_ZEROS
cimport numpy as np
cimport cython

np.import_array()
ctypedef float64_t DOUBLE

def my_dot(double [::1, :] a, double [::1, :] b, int ashape0, int ashape1, 
        int bshape0, int bshape1):
    cdef np.npy_intp cshape[2]
    cshape[0] = <np.npy_intp> ashape0
    cshape[1] = <np.npy_intp> bshape1

    cdef:
        int FORTRAN = 1
        ndarray[DOUBLE, ndim=2] c = PyArray_ZEROS(2, cshape, np.NPY_DOUBLE, FORTRAN)

    cdef double alpha = 1.0
    cdef double beta = 0.0
    dgemm("N", "N", &ashape0, &bshape1, &ashape1, &alpha, &a[0,0], &ashape0, &b[0,0], &bshape0, &beta, &c[0,0], &ashape0)
    return c

File 2: model_cython/example.py. The script that does the benchmark test

setup_str = """
import numpy as np
from numpy import float64
from multi import my_dot

a = np.ones((2,3), dtype=float64, order='F')
b = np.ones((3,2), dtype=float64, order='F')
print(a.flags)
ashape0, ashape1 = a.shape
bshape0, bshape1 = b.shape
"""
import timeit
print(timeit.timeit(stmt='c=my_dot(a,b, ashape0, ashape1, bshape0, bshape1)', setup=setup_str, number=100000))
print(timeit.timeit(stmt='c=a.dot(b)', setup=setup_str, number=100000))

File 3: setup.py. Compile .so file

from distutils.core import setup, Extension
from Cython.Build import cythonize
from Cython.Distutils import build_ext
import numpy 
import os
basepath = os.path.dirname(os.path.realpath(__file__))
numpy_path = numpy.get_include()
package_name = 'multi'
setup(
        name='multi',
        cmdclass={'build_ext': build_ext},
        ext_modules=[Extension(package_name, 
            [os.path.join(basepath, 'model_cython', 'multi.pyx')], 
            include_dirs=[numpy_path],
            )],
        )

File 4: run.sh. Shell script that executes setup.py and moves things around

python3 setup.py build_ext --inplace
path=$(pwd)
rm -r build
mv $path/multi.cpython-37m-darwin.so $path/model_cython/
rm $path/model_cython/multi.c

Below is a screenshot of the compilation message:

And regarding BLAS, my Numpy is properly linked to it at /usr/local/lib, and clang -bundle seems also add -L/usr/local/lib in compiling. But maybe that's not enough?

You're dealing with tiny tiny arrays. I suspect the majority of the runtime is just going to be taken up by the Python call and validating your types, not the computation you're doing. (You could also get the sizes inside your function if you wanted rather than passing them in, although I doubt that gives a speed change) — DavidW, Dec 22 '19 at 13:12
That was my original thought, but then `numpy.dot` is supposed to do the same thing. When I pass `a` and `b` to `dot`, I didn't tell them what to expect. For `my_dot`, at least I provided the dimension information. I don't know how to further simplified my function without making unfair comparison. And the actual program I am trying to do is a loop over tensor with (small n * small m * big S), so I am trying to optimise my calculation with small sized matrices. Another thing is this is cython with static types, so I don't thing validating types is that much of an issue here. — fnosdy, Dec 22 '19 at 14:47
But on each call to `my_dot` Cython has to check the types are what you'd expect since it's receiving them from Python and can't trust that they're correct. You'd be better off using a `cdef` function (so you can call it quickly _from Cython_) and doing the loop over the big S within Cython — DavidW, Dec 22 '19 at 15:35
But that is the point right? `munpy.dot` also has to do the same check under the hood (I think). But overall, what is the best way to compare apple to apple? I am not trying to get the crazy 100x gain, given the simplicity of the example, but I do hope I am at least not making matrix multiplication worse, so when I am writing a `.pyx` replacement to `.py`, I can replace `dot` with `dgemm` without fear of underperformance in the matrix multiplication part. And for reference, many high performance packages use `dgemm` instead of `dot` in their `.pyx`, so it has to perform better. — fnosdy, Dec 22 '19 at 15:48
My point is: essentially _all_ you're measuring is the checks/overhead, and if Numpy is doing it in a more efficient way then that's what you see. Like I said before a better test would be to do the tensor with the full loop in Cython. In that case you might well see a speed-up since the checks/overhead would form a tiny part of the test. Python is slow when you're doing lots of small operations from Python, however you do it (which is what you're seeing here) — DavidW, Dec 22 '19 at 16:46
Thanks. I don't know well enough the working of python/cython/c to agree/disagree with your assessment, but I totally agree that I can spread out the overhead by doing the full tensor. I posted the question here because I wanted to avoid potential pitfalls in doing matrix operations in `cython`, so that the performance gain by doing things in `c` is not compromised by bad practice in how I handle matrix multiplication. Thanks again for your input, I will try that. — fnosdy, Dec 22 '19 at 19:15
1) With this tiny function you are measuring basically function call overhead. Large speedups are only possible when doing this on larger 3d arrays. 2) A BLAS call for (2,3)x(3,2) is usually slower than a inlined computation. 3) Think of your extra compile args like (-march-native , -fastmath) Have a look at https://stackoverflow.com/a/59356461/4045774 — max9111, Dec 23 '19 at 09:19
Thanks for the comment and attached links! My original attend was to make apple-vs-apple comparison, so I figure it's only fair to do the type validation in my own function (since `dot` does it). I am not sure what you mean by inlined computation, but I guess my code is not apple-vs-apple comparison after all. — fnosdy, Dec 23 '19 at 10:47
For a many small arrays you can do a lot faster than in the answer of @DavidW or einsum or @ operator. eg. `A=np.random.rand(10_000,3,2)` `B=np.random.rand(10_000,2,3)` you can get `dot323(A,B) generated with my gen_dot_nm(3,2,3) -> 18.3 µs` instead of `np.einsum("xik,xkj->xij",A,B)-> 3.99 ms` or `A@B -> 3.92 ms` or `np.dot(A,B -> 24.8 s`. — max9111, Dec 23 '19 at 11:11
@max9111 thanks for the input. I read through your answer in that link and have a question regarding Numba. My project is to operate on the tensor sequentially, ie. input of X[:,:,k] depends on output of X[:,:,k-1], and there are many operations in between. If I user your function, I assume I will have some thing like this: `def proj(): for loop: xxx; c=my_prod22(); xxx`, call it many times in python. Does it still maintain the same performance as you mentioned? In cython, I can just wrap everything up in `cdef` and compile it. — fnosdy, Dec 23 '19 at 13:11

DavidW · Accepted Answer · 2019-12-23T11:18:51.037

Cython is good at optimizing loops (which are often slow in Python), and is also a convenient way of calling C (which is what you want to do). However, calling a Cython function from Python can be relatively slow - especially since all the types you've specified need to be checked for consistency. Therefore you'd typically try to hide a big chunk of work behind one Cython call so the overhead is small.

You've picked almost the worst case: a tiny chunk of work behind a lot of calls. It's fairly arbitrary whether Cython or np.dot will have more overhead, but either way it's this you're measuring, not np.dot vs BLAS dgemm.

From your comments it looks like you actually want to do a dot product of the first two dimensions of two 3D arrays. A more useful test is therefore to try to reproduce that. Here are three versions:

def einsum_mult(a,b):
    # use np.einsum, won't benefit from Cython
    return np.einsum("ijh,jkh->ikh",a,b)

def manual_mult(a,b):
    # multiply one matrix at a time with numpy dot
    # (could probably be optimized a bit with Cython)
    c = np.empty((a.shape[0],b.shape[1],a.shape[2]),
                 dtype=np.float64, order='F')
    for n in range(a.shape[2]):
        c[:,:,n] = a[:,:,n].dot(b[:,:,n])
    return c

def blas_version(double[::1,:,:] a,double[::1,:,:] b):
    # uses dgemm
    cdef double[::1,:,:] c = np.empty((a.shape[0], b.shape[1], a.shape[2]),
                                      dtype=np.float64, order='F')
    cdef double[::1,:] c_part
    cdef int n
    cdef double alpha = 1.0
    cdef double beta = 0.0
    cdef int ashape0 = a.shape[0], ashape1 = a.shape[1], bshape0 = b.shape[0], bshape1 = b.shape[1]

    assert a.shape[2]==b.shape[2]
    assert a.shape[1]==b.shape[0]

    for n in range(a.shape[2]):
        c_part = c[:,:,n]
        dgemm("N", "N", &ashape0, &bshape1, &ashape1, &alpha, &a[0,0,n], &ashape0, 
              &b[0,0,n], &bshape0, &beta, &c_part[0,0], &ashape0)
    return c

With arrays of size (2,3,10000) and (3,2,10000), 100 repetitions, I get:

manual_mult 1.6531286190001993 s    (i.e. quite bad)
einsum 0.3215398370011826 s         (pretty good)
blas_version 0.15762194800481666 s  (best, pretty close to the "myth" performance gain you mention)

The BLAS version is fast if you use Cython to your advantage and keep the loops within the compiled code. (I've spent no effort on optimizing this so you could probably beat it if you tried, but it's just supposed to illustrate the point)

Thanks for the excellent answer! Yeah, this is exactly what I wanted to do in the end. The reason I built a small `my_fun` first is that I've never done cython before, so I just want to make sure I don't do stupid things on the most computationally intensive operations. Do a modular test on components, and check things one at a time. But I guess I did stupid things after all, lol. Thanks again for your answer. — fnosdy, Dec 23 '19 at 10:52
And do you mind sharing with me your `setup.py` and preamble of the `.pyx` code (e.g. those `#cython: xxxx`)? I heard how the code is compiled also makes a difference in performance. Just want to learn from the experienced. — fnosdy, Dec 23 '19 at 10:55
You'll be disappointed. I just ran at the command line: `cythonize -i filename.pyx` (to build in place). I suspect that's wrong - I should be linking with BLAS, but because I import Numpy first I got away with it. It was only written as a quick demo... — DavidW, Dec 23 '19 at 11:03
I used absolutely no `#cython` directives. I think turning off `boundschecking` and `wraparound` might help here. I don't think any other directives would make much/any difference. My view is that you should make the compile directives as local as possible, as minimal as possible and only when you need them (i.e. only apply `boundscheck` when you know it does something and when you've put in things like my `assert` statements to ensure it can't lead to a crash) — DavidW, Dec 23 '19 at 11:04

Why dgemm (Cython compiled) is slower than numpy.dot

1 Answers1