I have the following tensor execution,
np.einsum('k,pjqk,yzjqk,yzk,ipqt->it', A, B, C, D, E)
And I noticed that when 'z' or 'q' expanded in dimension the execution time really suffered, although my intuition was that it probably shouldn't be that bad - maybe it was my input form that I could optimise by manual tensor contractions.
After a little digging I saw that optimize has two modes: 'optimal' and 'greedy'. If I evaluated my path against both modes I have, respectively:
(['einsum_path', (0, 3), (1, 3), (0, 2), (0, 1)],
' Complete contraction: k,pjqk,yzjqk,yzk,ipqt->it\n'
' Naive scaling: 8\n'
' Optimized scaling: 5\n'
' Naive FLOP count: 5.530e+04\n'
' Optimized FLOP count: 7.930e+02\n'
' Theoretical speedup: 69.730\n'
' Largest intermediate: 2.400e+01 elements\n'
'--------------------------------------------------------------------------\n'
'scaling current remaining\n'
'--------------------------------------------------------------------------\n'
' 3 yzk,k->yzk pjqk,yzjqk,ipqt,yzk->it\n'
' 5 yzk,yzjqk->jqk pjqk,ipqt,jqk->it\n'
' 4 jqk,pjqk->qp ipqt,qp->it\n'
' 4 qp,ipqt->it it->it')
and
(['einsum_path', (2, 3), (1, 3), (1, 2), (0, 1)],
' Complete contraction: k,pjqk,yzjqk,yzk,ipqt->it\n'
' Naive scaling: 8\n'
' Optimized scaling: 5\n'
' Naive FLOP count: 5.530e+04\n'
' Optimized FLOP count: 1.729e+03\n'
' Theoretical speedup: 31.981\n'
' Largest intermediate: 4.800e+01 elements\n'
'--------------------------------------------------------------------------\n'
'scaling current remaining\n'
'--------------------------------------------------------------------------\n'
' 5 yzk,yzjqk->jqk k,pjqk,ipqt,jqk->it\n'
' 4 jqk,pjqk->qkp k,ipqt,qkp->it\n'
' 5 qkp,ipqt->tik k,tik->it\n'
' 3 tik,k->it it->it')
The tested results are that 'optimal' is far faster for me, as indicated.
Question
Can anyone explain in simple terms what the difference is and why 'greedy' is set to default?
What is the downside to always using 'optimal'?
If my einsum calculation is going to be run 1000s of times (which it is as part of an optimisation iteration) should I restructure the execution to automatically benefit from the 'optimal' path without having to re-calculate it (or a 'greedy' path) everytime?