numpy.einsum substantially speeds up computation - but numpy.einsum_path shows no speedup, what am I missing?

Question

I have an odd case where I can see numpy.einsum speeding up a computation but can't see the same in einsum_path. I'd like to quantify/explain this possible speed-up but am missing something somewhere...

In short, I have a matrix multiplication where only the diagonal of the final product is needed.

a = np.arange(9).reshape(3,3)
print('input array')
print(a)
print('normal method')
print(np.diag(a.dot(a)))
print('einsum method')
print(np.einsum('ij,ji->i', a, a))

which produces the output:

input array
[[0 1 2]
 [3 4 5]
 [6 7 8]]
normal method
[ 15  54 111]
einsum method
[ 15  54 111]

When running on a large matrix, numpy.einsum is substantially faster.

A = np.random.randn(2000, 300)
B = np.random.randn(300, 2000)
print('normal method')
%timeit np.diag(A.dot(B))
print('einsum method')
%timeit np.einsum('ij,ji->i', A, B)

which produces:

normal method
17.2 ms ± 131 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
einsum method
1.02 ms ± 7.82 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

My intuition is that this speed up is possible as numpy.einsum is able to drop computations that would eventually be dropped by taking the diagonal - but, if I'm reading it correctly, the output of numpy.einsum_path is showing no speed up at all.

print(np.einsum_path('ij,ji->i',A,B,optimize=True)[1])
  Complete contraction:  ij,ji->i
         Naive scaling:  2
     Optimized scaling:  2
      Naive FLOP count:  1.200e+06
  Optimized FLOP count:  1.200e+06
   Theoretical speedup:  1.000
  Largest intermediate:  2.000e+03 elements
--------------------------------------------------------------------------
scaling                  current                                remaining
--------------------------------------------------------------------------
   2                    ji,ij->i                                     i->i

Questions:

Why can I see a practical speed-up that isn't reflected in the computational path?
Is there a way to quantify the speed up by ij,ji->i path in numpy.einsum?

hpaulj · Accepted Answer · 2022-07-04T15:59:45.827

That path just looks at alternative orders when working with more than 2 arguments. With just 2 arguments that analysis does nothing. Your diag(dot)

In [113]: np.diag(a.dot(a))
Out[113]: array([ 15,  54, 111])

The equivalent using einsum is:

In [115]: np.einsum('ii->i',np.einsum('ij,jk->ik',a,a))
Out[115]: array([ 15,  54, 111])

But we can skip the intermediate step:

In [116]: np.einsum('ij,ji->i',a,a)
Out[116]: array([ 15,  54, 111])

The indexed notation is flexible enough that it doesn't need to go through the full dot calculation.

Another way to get the same result is:

In [117]: (a*a.T).sum(axis=1)
Out[117]: array([ 15,  54, 111])

With matmul we can do the calc without the diag, treating the first dimension as a 'batch'. But requires some reshaping first:

In [121]: a[:,None,:]@a.T[:,:,None]
Out[121]: 
array([[[ 15]],

       [[ 54]],

       [[111]]])

In [122]: np.squeeze(a[:,None,:]@a.T[:,:,None])
Out[122]: array([ 15,  54, 111])

My times

normal method
135 ms ± 3.87 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
einsum method
3.09 ms ± 46.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [130]: timeit (A*B.T).sum(axis=0)
8.06 ms ± 78.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [131]: timeit np.squeeze(A[:,None,:]@B.T[:,:,None])
3.52 ms ± 195 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

dot creates a (2000,2000) array, and then extracts 2000 elements.

In terms of element wise multiplication, the dot is:

In [136]: (a[:,:,None]*a[None,:,:]).sum(axis=1)
Out[136]: 
array([[ 15,  18,  21],
       [ 42,  54,  66],
       [ 69,  90, 111]])

With A and B, the intermediate product would be (2000,300,2000), which is summed down to (2000,2000). The einsum does (effectively) 2000 calulations of size (300,300) reduced to (1,).

The einsum is closer to this calculation than the diag/dot, treating the size 2000 dimension as a 'batch' for 1d dot calculations:

In [140]: timeit np.array([A[i,:].dot(B[:,i]) for i in range(2000)])
9.46 ms ± 270 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Ali_Sh · Answer 2 · 2022-07-05T21:27:16.703

Beside the main aim of the question, if the performance is of importance, numba will be very fast and can be used in parallel scheme, limiting the calculations to the diagonal parameters only as:

import numpy as np
import numba as nb

@nb.njit(parallel=True)  # , fastmath=True
def diag_dot(a, b):
    res = np.zeros(a.shape[0])
    for i in nb.prange(a.shape[0]):
        for j in range(a.shape[1]):
            res[i] += a[i, j] * b[j, i]
    return res

Which took around the half of np.einsum for A and B in my tests.

numpy.einsum substantially speeds up computation - but numpy.einsum_path shows no speedup, what am I missing?

2 Answers2