0

I wrote a Cython program calling Intel MKL for matrix multiplication, with the purpose of making it parallel. It was based on an old SO post linking to BLAS and used a bunch of Cython methods I've never seen, but got it working and it was much slower than NumPy (also linked to MKL). In order to speed it up, I used the typical Memoryview format (it was using ndarray np.float64_t datatype for a couple operations). But now it no longer works using double[::1] Memoryviews. Here's the error generated: 'type cast': cannot convert from '__Pyx_memviewslice' to 'double *'

And as a result of the type cast not working, the MKL function only sees 3 of 5 arguments: error C2660: 'cblas_ddot': function does not take 3 arguments

Here is the .PYX code:

import numpy as np
cimport numpy as np
cimport cython
from cython cimport view
from cython.parallel cimport prange     #this is your OpenMP portion
from openmp cimport omp_get_max_threads #only used for getting the max # of threads on the machine 

cdef extern from "mkl_cblas.h" nogil: #import a function from Intel's MKL library
    double ddot "cblas_ddot"(int N,
                             double *X, 
                             int incX,
                             double *Y, 
                             int incY)

@cython.boundscheck(False)
@cython.wraparound(False)
@cython.cdivision(True)
cpdef matmult(double[:,::1] A, double[:,::1] B):
    cdef int Ashape0=A.shape[0], Ashape1=A.shape[1], Bshape0=B.shape[0], Bshape1=B.shape[1], Arowshape0=A[0,:].shape[0] #these are defined here as they aren't allowed in a prange loop

    if Ashape1 != Bshape1:
        raise TypeError('Inner dimensions are not consistent!')

    cdef int i, j
    cdef double[:,::1] out = np.zeros((Ashape0, Bshape1))
    cdef double[::1] A_row = np.zeros(Ashape0)
    cdef double[:] B_col = np.zeros(Bshape1) #no idea why this is not allowed to be [::1]
    cdef int Arowstrides = A_row.strides[0] // sizeof(double)
    cdef int Bcolstrides = B_col.strides[0] // sizeof(double)
    cdef int maxthreads = omp_get_max_threads()

    for i in prange(Ashape0, nogil=True, num_threads=maxthreads, schedule='static'): # to use all cores

        A_row = A[i,:]
        for j in range(Bshape1):
            B_col = B[:,j]
            out[i,j] = ddot(Arowshape0, #call the imported Intel MKL library
                            <double*>A_row,
                            Arowstrides, 
                            <double*>B_col,
                            Bcolstrides) 

return np.asarray(out)

I'm sure this is easy for someone on SO to point out. And please advise if you see where improvement can be made - this was hacked and chopped together and I don't think the i / j loops are even needed. The cleanest example around: https://gist.github.com/JonathanRaiman/f2ce5331750da7b2d4e9 which I finally compiled is actually much faster (2x) but gives no results so I'll put that in another post (here: Calling BLAS / LAPACK directly using the SciPy interface and Cython - also how to add MKL)

Much appreciated.

Matt
  • 2,602
  • 13
  • 36
  • MKL's dgemm function does matrix multiplication and is multi-threaded – user7138814 Jun 23 '17 at 20:42
  • @user7138814 I saw that as well. I saw your comment on my other Cython post - apparently I could use the same pointer based approach instead of the above??? – Matt Jun 23 '17 at 21:29

1 Answers1

2

To get a pointer from a memoryview you need to take the address of the first element

ddot(Arowshape0, #call the imported Intel MKL library
                        &A_row[0],
                        Arowstrides, 
                        &B_col[0],
                        Bcolstrides)
DavidW
  • 29,336
  • 6
  • 55
  • 86
  • Are you saying then as well by passing the pointer I can get rid of all the loops above? That would be perfect. – Matt Jun 24 '17 at 14:10
  • I don't think so - you need the loops to make `A_row` and `B_col`. It's possible that some function other than `ddot` can do the whole thing though. – DavidW Jun 24 '17 at 16:11
  • Thanks I'm really just building out my Cython toolkit. – Matt Jun 24 '17 at 17:44
  • So the memoryview pointer on a `double[:,::1] myarray` is likewise `&myarray[0,0]` I seem to have figured out... if you have a chance to confirm. – Matt Jul 01 '17 at 21:47
  • Thanks @DavidW for confirming, I got more code working with `memset` instead of NP..zeros – Matt Jul 02 '17 at 20:34