4

I have written a numpy program which is very time consuming. After profiling it, I found that most of the time is spent in numpy.einsum.

Although numpy is a wrapper of LAPACK or BLAS, I don't know whether numpy.einsum's performance is comparable to its counterpart in LAPACK or BLAS.

So, will I get much performance enhancement if I switch to fortran or C?

atbug
  • 818
  • 6
  • 26
  • I am not an expert on this. But once I was told the following: You lose performance using a wrapper instead of the original code (c or fortran or whatever). On the other hand the wrapped code is usually highly optimized. You have to do much performance optimization if you want to beat the wrapper performance. – Glostas Feb 19 '16 at 09:07
  • 3
    [`einsum` *is* implemented in C](https://github.com/numpy/numpy/blob/e2805398f9a63b825f4a2aab22e9f169ff65aae9/numpy/core/src/multiarray/multiarraymodule.c#L2708). That said, `einsum` offers flexibility at the cost of some performance. If what you're doing can be expressed as a call to a BLAS routine instead then I would generally expect the BLAS call to be faster. – ali_m Feb 19 '16 at 12:32
  • @ali_m So `einsum` is not calling BLAS? What else would you suggest? – atbug Feb 19 '16 at 13:12
  • BLAS only supports certain primitive operations between two vectors or matrices in a specific order, while `einsum` is very flexible. `einsum` does use BLAS whenever possible (you can review the code linked by @ali_m), but wouldn't automatically adapt to use only BLAS even if it's possible to do so; it's up to the coder to determine. If switching to fortran/C means create a ufunc for it, then maybe you can get a bit of improvement with less function calls internally, but note that `einsum` is already written in C, and already uses BLAS for some operations. – syockit Feb 19 '16 at 13:50
  • Sorry, [this is probably the more relevant piece of code](https://github.com/numpy/numpy/blob/e2805398f9a63b825f4a2aab22e9f169ff65aae9/numpy/core/src/multiarray/einsum.c.src). @syockit Are you sure about that? I can't see any evidence of `einsum` using BLAS calls, even in very simple cases such as straight matrix-matrix or matrix-vector multiplication. – ali_m Feb 19 '16 at 14:16
  • @ali_m Oh, was I actually looking at a different code? I stand corrected then. No BLAS, though `einsum` does have lots of SSE optimization, the coder will at least have to consider that when writing the specialized ufunc. – syockit Feb 19 '16 at 14:26
  • @ali_m You're right, I was looking at a different part of the code and made a hasty assumption. – syockit Feb 19 '16 at 14:43
  • @ali_m how to check a whether a function uses BLAS or not? Can you show me how to confirm `numpy.dot` uses BLAS from the source code? – atbug Feb 20 '16 at 05:53

1 Answers1

3

Numpy wraps with BLAS only for primitive operations specified with BLAS. That includes dot, innerproduct, vdot, matmul (new in 1.10), and functions that depend on it (tensordot etc.). einsum, on the other hand, only calls BLAS for operations that allow to fall back to it (as of Numpy 1.14.0).

If your problem can be decomposed into several BLAS operations, then I suggest you try that first in Numpy itself. It may require some temporary arrays in between (it would still be the case even if you were to write C/FORTRAN that uses BLAS). You can eliminate certain array creation overhead by using the out= parameter of the functions.

But most of the time, you're using einsum because it's not expressible in BLAS. See a simple example:

a = np.arange(60.).reshape(3,4,5)
b = np.arange(24.).reshape(4,3,2)
c = np.einsum('ijk,jil->kl', a, b)

To express the above in primitive operations, you need to swap the first two axes in b, do a element-wise multiplication for the first two dimensions, then sum them up, for each index k and l.

c2 = np.ndarray((5, 2))
b2 = np.swapaxes(b, 0, 1)
def manualeinsum(c2, a, b):
    ny, nx = c2.shape
    for k in range(ny):
        for l in range(nx):
            c2[k, l] = np.sum(a[..., k]*b2[...,l])
manualeinsum(c2, a, b2)

You can't BLAS that. UPDATE: The above problem can be expressed as matrix multiplication which can be accelerated using BLAS. See comment by @ali_m. For large enough arrays, BLAS approach is faster.

Meanwhile, note that einsum itself is written in C, creates a dimension-specific iterator for the indices given, and is also optimized for SSE.

Nico Schlömer
  • 53,797
  • 27
  • 201
  • 249
syockit
  • 5,747
  • 1
  • 24
  • 33
  • 2
    *"You can't BLAS that"* - sure you can! Just flatten out the dimensions you want to reduce over: `a.T.reshape(a.shape[-1], -1).dot(b.reshape(-1, b.shape[-1]))` – ali_m Feb 19 '16 at 16:22
  • @ali_m Right, why didn't I think of that? Maybe I should review the problem again. – syockit Feb 19 '16 at 16:27
  • @ali_m Hmm, the examples in the documentation for `einsum` are all expressible in BLAS. Are there any counter-examples for `einsum` of 2 arrays? Things sure get more complicated with 3 arrays or more, but I haven't established if they really are not decomposable. – syockit Feb 19 '16 at 18:04
  • Well there are lots of cases that can't be expressed using BLAS calls without generating a copy of one or more of the inputs. For example you might want to reduce over non-adjacent dimensions, e.g. `np.einsum('ijk,ilj->kl', ...)`. If memory is limiting then the copy might be sufficiently expensive to ensure that `einsum` beats BLAS. `einsum` is also vastly superior to `dot` for integer or boolean matrices (since `dot` can't leverage BLAS for this). – ali_m Feb 19 '16 at 19:15
  • @syockit, I found my codes actually slower when using `tensordot`. Is this really possible? – atbug Feb 20 '16 at 06:06
  • @ChongWang Compared to `einsum`? Could happen; refer to @ali_m's comment above yours. What `tensordot` does is to tranpose the arrays according to the given axes, flatten them and do a `dot`. There is a slight overhead on checking validity of input and figuring out the shapes of the arrays, but it should be minimal. What are the subscripts involved in your summation? – syockit Feb 21 '16 at 09:09
  • @syockit, very simple. It's 'k,ijk->ij'. I would have used `tensordot` if I knew this function at that time. – atbug Feb 21 '16 at 11:57
  • @ChongWang Looks simple enough. Can you `dot` the second array with the first one and see if it gets what you want? – syockit Feb 21 '16 at 14:57
  • @syockit, `dot` needs 12.4s and `einsum` needs 12.5s for a very small case. `einsum`'s performance is good then. – atbug Feb 22 '16 at 09:14