0

Could someone expain, why numpy.einsum('ij,ji', A, B) is much slower than numpy.einsum('ij,ij', A, B), as it is shown below?

In [1]: import numpy as np
In [2]: a = np.random.rand(1000,1000)                                                                                                                                
In [3]: b = np.random.rand(1000,1000)                                                                                                                                
In [4]: timeit np.einsum('ij,ij', a, b)                                                                                                                               
532 µs ± 5.36 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [5]: timeit np.einsum('ij,ji', a, b)                                                                                                                               
1.28 ms ± 20.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Regards, Marek

mrkwjc
  • 1,080
  • 8
  • 17

1 Answers1

2

First suspect: https://en.wikipedia.org/wiki/Locality_of_reference#Spatial_and_temporal_locality_usage

In the first case both a's and b's memory is accessed in contiguous sequential pattern.

In the second case the b's memory is accessed in strides of 1000 elements (8000 bytes) in the inner loop.

Most modern x86 processors have 32KiB L1 caches with 64 byte cache lines meaning there are 512 cache lines in total. So the entire L1 cache is evicted ~twice between outer loop iterations.

Note that if you run

timeit np.einsum('ij,ji', a, b.T)

you should get roughly the same time as your first example.

Similarly

timeit np.einsum('ij,ij', a, b.T)

should give the same time as your second example.

Aleš Erjavec
  • 1,086
  • 7
  • 11
  • Also preprocessing 'b' array as b1 = np.asfortanarray(b) affects timings: np.einsum('ij,ji', a, b1) is then faster than np.einsum('ij,ij', a, b1). I think this confirms the answer. Thanks! – mrkwjc Nov 14 '19 at 08:17