6

I have compiled numpy 1.6.2 and scipy with MKL hoping to have a better performance. Currently I have a code that relies heavily on np.einsum(), and I was told that einsum is not good with MKL, because there is almost none vectorization. =( So I was thinking to re write some of my code with np.dot() and slicing, just to be able to get some multi-core speed up. I really like the simplicity of np.einsum() and the readability is good to. Anyway, for example, I have a multi-dimensional matrix multiplication of the form:

np.einsum('mi,mnijqk->njqk',A,B)

So how do I transform something like this, or others 3,4 and 5 dimensional array multiplications in np.dot() efficient MKL operations?

I will ad more info: I am computing this equation:

enter image description here

For doing this, I am using the code:

np.einsum('mn,mni,nij,nik,mi->njk',a,np.exp(b[:,:,np.newaxis]*U[np.newaxis,:,:]),P,P,X)

That is not that fast, same thing coded in cython is 5x times faster:

    #STACKOVERFLOW QUESTION:
from __future__ import division
import numpy as np
cimport numpy as np
cimport cython

cdef extern from "math.h":
    double exp(double x)


DTYPE = np.float

ctypedef np.float_t DTYPE_t
@cython.boundscheck(False) # turn of bounds-checking for entire function
def cython_DX_h(np.ndarray[DTYPE_t, ndim=3] P, np.ndarray[DTYPE_t, ndim=1] a, np.ndarray[DTYPE_t, ndim=1] b, np.ndarray[DTYPE_t, ndim=2] U,  np.ndarray[DTYPE_t, ndim=2] X, int I, int M):
    assert P.dtype == DTYPE and a.dtype == DTYPE and b.dtype == DTYPE and U.dtype == DTYPE and X.dtype == DTYPE

cdef np.ndarray[DTYPE_t,ndim=3] DX_h=np.zeros((N,I,I),dtype=DTYPE)
cdef unsigned int j,n,k,m,i
for n in range(N):
    for j in range(I):
        for k in range(I):
            aux=0
            for m in range(N):
                for i in range(I):
                    aux+=a[m,n]*exp(b[m,n]*U[n,i])*P[n,i,j]*P[n,i,k]*X[m,i]
            DX_h[n,j,k]=aux
return DX_h

Is there a way to do this in pure python with the performance of cython? (I havent been able to figure out how to tensordot this equation) Haven't been able to do prange in this cython code, lots gil and nogil errors.

tcapelle
  • 442
  • 3
  • 12
  • I was not aware that np.dot supports multiple processors. Could you tell me how you got this to work? – Magellan88 May 14 '14 at 09:26
  • @Magellan88 it depends on the BLAS library that you're linking. Some of them support multiple cores. – Midnighter May 14 '14 at 09:58
  • I have intel MKL compiled numpy – tcapelle May 14 '14 at 12:59
  • In case it helps, https://github.com/hpaulj/numpy-einsum/blob/master/einsum_py.py is a pure python rendition of `einsum`. The focus is on how `einsum` translates the 'ij' string into an `nditer` object. https://github.com/hpaulj/numpy-einsum/blob/master/sop.pyx is a Cython version of the sum-of-products calculation. – hpaulj May 26 '14 at 20:47
  • http://docs.scipy.org/doc/numpy-dev/reference/arrays.nditer.html is a nice tutorial in using `nditer` to collect all the iterations into one loop. – hpaulj May 27 '14 at 00:25
  • Should maybe mention that tensordot eats ram for breakfast (and stays hungry), while einsum does not – usethedeathstar Jun 17 '14 at 13:19
  • 1
    This is late, but in case anyone is still looking for a parallel einsum. I wrote this package [einsum2](https://github.com/jackkamm/einsum2) to do parallel implementation of some `einsum` operations. While your first example can be rewritten as `tensordot`, not all `einsum` operations can be, and it can be less readable to use `tensordot`. – jackkamm Oct 21 '16 at 19:56

1 Answers1

4

Alternatively, you can use numpy.tensordot():

np.tensordot(A, B, axes=[[0, 1], [0, 2]])

which will also use multiple cores, like numpy.dot().

ali_m
  • 71,714
  • 23
  • 223
  • 298
Saullo G. P. Castro
  • 56,802
  • 26
  • 179
  • 234
  • is tensordot as flexible as einsum? – tcapelle May 14 '14 at 12:59
  • @tcapelle. I believe `einsum()` is more flexible ([like in this case](http://stackoverflow.com/a/23592642/832621)), but for most of the cases that I've seen `tensordot()` is applicable. If you need to multiply more than two arrays at once, you will probably need `einsum()`... – Saullo G. P. Castro May 14 '14 at 13:09
  • being trying to use cython prange for doing this in just 5 fors, but havent got it working yet =P – tcapelle May 14 '14 at 13:49
  • @Saulo Castro, could you give me hand to tensordot this equation? – tcapelle May 15 '14 at 09:08
  • @tcapelle, I believe you could ask a new question with this edit. In case yes, you should roll back your question to the previous review. The change you've made is surely too big for an edit... – Saullo G. P. Castro May 15 '14 at 09:15
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/52731/discussion-between-saullo-castro-and-tcapelle) – Saullo G. P. Castro May 15 '14 at 09:16