Translating np.einsum to something more performant

Question

Using python/numpy, I have the following np.einsum:

np.einsum('abde,abc->bcde', X, Y)

Y is sparse: for each [a,b], only one c == 1; all others := 0. For an example of relative size of the axes, X.shape is on the order of (1000, 5, 30, 30), and Y.shape is equivalently (1000, 5, 300).

This operation is extremely costly; I want to make this more performant. For one thing, einsum is not parallelized. For another, beecause Y is sparse, I'm effectively computing 300x the number of multiplication operations I should be doing. In fact, when I wrote the equivalent of this einsum using a loop over n, I got a speed-up of around 3x. But that's clearly not very good.

How should I approach making this more performant? I've tried using np.tensordot, but I could not figure out how to get what I want from it (and I still run into the sparse/dense problem).

Are you using sparse arrays? https://docs.scipy.org/doc/scipy/reference/sparse.html — Martino, Nov 03 '22 at 09:56
What is the dtype of the arrays? Y is apparently integer, that rules out BLAS solutions. — user7138814, Nov 03 '22 at 11:32
@user7138814 Even with floats, Numpy should not optimize it using BLAS. First because there is no trivial way to use L3 primitives here (and L2 ones are not much more efficient than what Numpy do internally), but also because AFAIK Numpy do not perform such optimizations when the number of dimension is too big and it calls less optimized functions due to the number of possible combination (see: [How is numpy.einsum implemented?](https://stackoverflow.com/questions/72952005/how-is-numpy-einsum-implemented/72986292#72986292)). — Jérôme Richard, Nov 03 '22 at 12:04

Jérôme Richard · Answer 1 · 2022-11-03T11:58:58.590

You can do that pretty easily with Numba:

import numba

@numba.njit('float64[:,:,:,::1](float64[:,:,:,::1], float64[:,:,::1])', fastmath=True, parallel=True)
def compute(x, y):
    na, nb, nd, ne = x.shape
    nc = y.shape[2]
    assert y.shape == (na, nb, nc)

    out = np.zeros((nb, nc, nd, ne))

    for b in numba.prange(nb):
        for a in range(na):
            for c in range(nc):
                yVal = y[a, b, c]
                if np.abs(yVal) != 0:
                    for d in range(nd):
                        for e in range(ne):
                            out[b, c, d, e] += x[a, b, d, e] * yVal

    return out

Note that it is faster to iterate over a and then b for a sequential code. That being said, for the code to be parallel, the loop have been swapped and the parallelization is performed over b (which is a small axis). A parallel reduction over the axis a would be more efficient, but this is unfortunately not easy to do with Numba (one need to split matrices in multiple chunks since there is no simple way to create thread-local matrices).

Note you can replace values like nd and ne by the actual value (ie. 30) so for the compiler to generate a faster code specifically for this matrix size.

Here is the testing code:

np.random.seed(0)
x = np.random.rand(1000, 5, 30, 30)
y = np.random.rand(1000, 5, 300)
y[np.random.rand(*y.shape) > 0.1] = 0.0 # Make it sparse (90% of 0)

%time res = np.einsum('abde,abc->bcde', x, y)  # 2.350 s
%time res2 = compute(x, y)                     # 0.074 s (0.061 s with hand-written sizes)
print(np.allclose(res, res2))

This is about 32 times faster on a 10-core Intel Skylake Xeon processor. It reaches a 38x speed up with hand-written sizes. It does not scale very well due to the parallelization over the b axis but using other axis will cause a less efficient memory accesses.

If this is not enough, it may be a good idea to transpose x and y first so to improve data locality (thanks to a more contiguous access pattern along the a axis) and a better scaling (by parallelizing both the b and c axis). That being said, transpositions are generally expensive so one certainly need to optimize it so to get an even better speed up.

score 1 · Accepted Answer · answered Nov 03 '22 at 16:22

If Y only contains 1 and 0 then the einsum basically does this:

result = np.zeros(Y.shape[1:] + X.shape[2:], X.dtype)
I, J, K = np.nonzero(Y)
result[J, K] += X[I, J]

But this doesn't give the correct result due to duplicate j, k indices. I couldn't get numpy.add.at to work, but a loop over just these indices is still pretty fast, at least for the given shapes and sparsity.

result = np.zeros(Y.shape[1:] + X.shape[2:], X.dtype)
for i, j, k in zip(*np.nonzero(Y)):
    result[j, k] += X[i, j]

This is the test code that I used:

a, b, c, d, e = 1000, 5, 300, 30, 30
X = np.random.randint(10, size=(a,b,d,e))
R = np.random.rand(a, b, c)
K = np.argmax(R, axis=2)
I, J = np.indices((a, b), sparse=True)
Y = np.zeros((a, b, c), int)
Y[I, J, K] = 1

Translating np.einsum to something more performant

2 Answers2