Comparison of Looping and Einsum Operations on a List of Arrays for Speed Optimization

Question

I want to make use of the einsum to speed up a code as following:

As simple example using a list of 2 (3,3) arrays called dcx:

That i created:

In [113]: dcx = [np.arange(9).reshape(3,3), np.arange(10,19).reshape(3,3)]; dcx
Out[113]: 
[array([[0, 1, 2],
        [3, 4, 5],
        [6, 7, 8]]),
 array([[10, 11, 12],
        [13, 14, 15],
        [16, 17, 18]])]

I wanted to perform the following operation:

In [114]: Ax = np.zeros((2,2))
     ...: for i in range(2):
     ...:         for j in range(2):
     ...:             Ax[i, j] = np.sum(np.dot(dcx[i], dcx[j].T))

In [115]: Ax
Out[115]: 
array([[ 450., 1530.],
       [1530., 5310.]])

The time taken ~ 0.001009225845336914 sec

Then i tried using einsum as folowing:

x =np.array(dcx)
In [117]: np.einsum('ikl,jml->ij', x,x)
Out[117]: 
array([[ 450, 1530],
       [1530, 5310]])

And the time taken in second in almost zero,

So i thought that einsum should be faster, i extended my example to more complex case as following:

dCx is now a list of 40 (40,40) arrays.

I noticed that the time take by:

In [114]: Ax = np.zeros((40,40))
     ...: for i in range(40):
     ...:         for j in range(40):
     ...:             Ax[i, j] = np.sum(np.dot(dcx[i], dcx[j].T))

Is 0.02168726921081543 sec

while the time taken by the einsum:

array_list = []
for _ in range(40):
    array = np.random.rand(40, 40)
    array_list.append(array)
t0 = time.time()
x =np.array(array_list)
np.einsum('ikl,jml->ij', x,x)
t1 = time.time()
print(f"Execution time : {t1-t0} sec")

0.15755462646484375 sec

I am confused why einsum now takes longer time? i wanted to make use of ituse it to speed up more complex code with dCx of dimensions (3000,3000,3000) where my loops takes very long time.

You compare codes apparently not working on the same arrays. The first code uses `dcx` while the second uses `array_list`. What is supposed to be `array_list` in the second code? Can you provide a fully reproducible code ? (eg. something we can copy past without missing variables) — Jérôme Richard, May 31 '23 at 12:43
With a (40,40,40) array, `timeit` gives me `148ms` for the iterative, and `260ms` for the `einsum`. That's consistent with my experience, that often a modest number of iterations on a complex task is faster than the equivalent whole-array operation. 'memory mangement' (paging, caching etc) complicates the running of very large array operations. — hpaulj, May 31 '23 at 15:59
@hpaulj I see, do you suggest any alternative methods to speed up the loop code for huge matrices ? [114]: Ax = np.zeros((3000,3000)) ...: for i in range(3000): ...: for j in range(3000): ...: Ax[i, j] = np.sum(np.dot(dcx[i], dcx[j].T)) — Accelerator, Jun 02 '23 at 08:03
@JérômeRichard array_list is the dCx with dimensions (40,40,40) i edited the post now to include the array_list generations — Accelerator, Jun 02 '23 at 08:06

score 1 · Answer 1 · answered Jun 02 '23 at 23:57

Both of the methods are inefficient. Einsum surprizingly fail to find the optimal einsum path in this case (even with the optional optimal parameter set). The dot approach is faster here (possibly because it explicitly benefit from the highly-optimized BLAS matrix multiplication implementation), but it is still clearly inefficient algorithmically since many operations are not actually needed.

One can drastically reduce the number of operation by factorizing the operations. This kind of optimisation requires you to do a bit of maths. Let us describe mathematically what np.sum(np.dot(x, y)) (with x = dcx[i] and y = dcx[j]) does and factorize the resulting expression:

For each (i,j), we can pre-compute the inner terms of the expression and reuse them:

tmp1 = np.sum(dcx[i], axis=0)      # Sum over i
tmp2 = np.sum(dcx[j], axis=0)      # Sum over j
Ax[i, j] = np.dot(tmp1, tmp2)      # Final sum over k

This is a huge improvement over the initial code since this approach runs in O(n**4) while the initial one runs in O(n**5) (due to the expensive matrix multiplication), where n is the side of the dcx matrix (40 in your example).

We can do better: we can pre-compute the sum of all dcx slice in a row so not to recompute them. Here is the final code:

tmp = np.sum(dcx, axis=1)          # Sum over i and j for all planes
Ax = tmp @ tmp.T                   # Final sum over k for all planes

This approach runs in O(n**3). It is optimal because the size of the input is O(n**3) too. It should be >1_000_000 times faster than the initial approach for n=3000, not to mention the code is also shorter!

Here is performance results for n=40:

einsum:     131_885 µs
loop:        17_142 µs
proposed:        47 µs   <----

Related post: How is numpy.einsum implemented?

Comparison of Looping and Einsum Operations on a List of Arrays for Speed Optimization

1 Answers1