Performance difference between einsum and matmul

Question

Related question BLAS with symmetry in higher order tensor in Fortran

I tried to use python code to exploit the symmetry in tensor contraction, A[a,b] B[b,c,d] = C[a,c,d] when B[b,c,d] = B[b,d,c] hence C[a,c,d] = C[a,d,c]. (Einstein summation convention assumed, i.e., repeated b means summation over it)

By the following code

import numpy as np
import time

# A[a,b] * B[b,c,d] = C[a,c,d]
na = nb = nc = nd = 100

A = np.random.random((na,nb))
B = np.random.random((nb,nc,nd))
C = np.zeros((na,nc,nd))
C2= np.zeros((na,nc,nd))
C3= np.zeros((na,nc,nd))

# symmetrize B
for c in range(nc):
    for d in range(c):
        B[:,c,d] = B[:,d,c]


start_time = time.time()
C2 = np.einsum('ab,bcd->acd', A, B)
finish_time = time.time()
print('time einsum', finish_time - start_time )


start_time = time.time()
for c in range(nc):
# c+1 is needed, since range(0) will be skipped
    for d in range(c+1):
       #C3[:,c,d] = np.einsum('ab,b->a', A[:,:],B[:,c,d] )
       C3[:,c,d] = np.matmul(A[:,:],B[:,c,d] )
       
for c in range(nc):
    for d in range(c+1,nd):
        C3[:,c,d] = C3[:,d,c] 

finish_time = time.time()
print( 'time partial einsum', finish_time - start_time )


for a in range(int(na/10)):
    for c in range(int(nc/10)):
        for d in range(int(nd/10)):
            if abs((C3-C2)[a,c,d])> 1.0e-12:
                print('warning', a,c,d, (C3-C2)[a,c,d])

it seems to me that np.matmul is faster than np.einsum, e.g., by using np.matmul, I got

time einsum 0.07406115531921387
time partial einsum 0.0553278923034668

by using np.einsum, I got

time einsum 0.0751657485961914
time partial einsum 0.11624622344970703

Is the above performance difference general? I often took einsum for granted.

hpaulj · Accepted Answer · 2021-04-18T01:32:10.340

As a general rule I expect matmul to be faster, though with simpler cases it appears that einsum actually uses matmul.

But here my timings

In [20]: C2 = np.einsum('ab,bcd->acd', A, B)
In [21]: timeit C2 = np.einsum('ab,bcd->acd', A, B)
126 ms ± 1.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Your symmetry try with einsum:

In [22]: %%timeit
    ...: for c in range(nc):
    ...: # c+1 is needed, since range(0) will be skipped
    ...:     for d in range(c+1):
    ...:        C3[:,c,d] = np.einsum('ab,b->a', A[:,:],B[:,c,d] )
    ...:        #C3[:,c,d] = np.matmul(A[:,:],B[:,c,d] )
    ...: 
    ...: for c in range(nc):
    ...:     for d in range(c+1,nd):
    ...:         C3[:,c,d] = C3[:,d,c]
    ...: 
128 ms ± 3.39 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Same with matmul:

In [23]: %%timeit
    ...: for c in range(nc):
    ...: # c+1 is needed, since range(0) will be skipped
    ...:     for d in range(c+1):
    ...:        #C3[:,c,d] = np.einsum('ab,b->a', A[:,:],B[:,c,d] )
    ...:        C3[:,c,d] = np.matmul(A[:,:],B[:,c,d] )
    ...: 
    ...: for c in range(nc):
    ...:     for d in range(c+1,nd):
    ...:         C3[:,c,d] = C3[:,d,c]
    ...: 
81.3 ms ± 1.14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

And direct matmul:

In [24]: C4 = np.matmul(A, B.reshape(100,-1)).reshape(100,100,100)
In [25]: np.allclose(C2,C4)
Out[25]: True
In [26]: timeit C4 = np.matmul(A, B.reshape(100,-1)).reshape(100,100,100)
14.9 ms ± 167 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

einsum also has an optimize flag. I thought that only mattered with there are 3 or more arguments, but it seems to help here:

In [27]: timeit C2 = np.einsum('ab,bcd->acd', A, B, optimize=True)
20.3 ms ± 688 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Sometimes when the arrays are very big, some iteration is faster because it reduces memory management complexities. But I don't think it's worth it when trying exploit symmetry. Other SO have shown that in some cases matmul can detect symmetry, and use a custom BLAS call, but I don't think that's the case here (it can't detect symmetry in B without an expensive comparison.)

`optimize=True` allows np.einsum to dispatch to BLAS functionality at the cost of some overhead. The goal of `optimize=True` was to always do the right thing and will use matmul/dot where appropriate. It doesn't perform looped GEMM operations however (even more logic overhead) and it is typically possible to hand tune operations to beat it. — Daniel, Apr 23 '21 at 13:17

Performance difference between einsum and matmul

1 Answers1