0

I have some code that uses the following einsum:

y = np.einsum('wxyijk,ijkd->wxyd', x, f)

where (for example) the shape of x is (64, 26, 26, 3, 3, 3) and the shape of f is (3, 3, 3, 1), both having dtype=float

%timeit np.einsum('wxyijk,ijkd->wxyd', x, f)
# 2.01 ms ± 55.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

This is too slow for my application, which is time critical. Neither using the GPU (via CuPy) nor path speedups (via opt-einsum) seems to make this any faster. Is there any way to make it faster natively in NumPy, or is this just about as fast as it's going to get?

Nihar Karve
  • 230
  • 4
  • 15
  • You could reshape the arrays to (64,26,26,27) and (27,1), and use `@`. But in my testing that doesn't improve the speed. – hpaulj May 17 '20 at 17:11
  • not sure about this but 100 loops may not suffice to amortise the initial parsing of the string into a set of operations – Eelco Hoogendoorn May 17 '20 at 18:38

1 Answers1

1

You could use in this case the optimize keyword, implement it yourself, or use tensordot. All but, the first version should actually do the same thing (reshaping-> dot -> reshaping).

Your implementation

x=np.random.rand(64, 26, 26, 3, 3, 3)
f=np.random.rand(3, 3, 3, 1)
%timeit y = np.einsum('wxyijk,ijkd->wxyd', x, f)
#886 µs ± 3.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

With optimize="optimal"

%timeit y = np.einsum('wxyijk,ijkd->wxyd', x, f,optimize="optimal")
#275 µs ± 23.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Reshaping and BLAS-call

This normally leads to a quite comparable performance to optimize="optimal", maybe in this case a unnecessary array copy leads to the slowdown.

def contract(x,f):
    s1=x.shape
    x_=x.reshape(s1[0]*s1[1]*s1[2],s1[3]*s1[4]*s1[5])

    s2=f.shape
    f_=f.reshape(s2[0]*s2[1]*s2[2],s2[3])
    return np.dot(x_,f_).reshape(s1[0],s1[1],s1[2],s2[3])

%timeit contract(x,f)
#144 µs ± 3.09 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Tensordot

%timeit np.tensordot(x,f,axes=3)
#176 µs ± 4.32 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
max9111
  • 6,272
  • 1
  • 16
  • 33
  • `tensordot` does reshape (and transpose if needed) reducing the calculation to the `np.dot` that you do in `contract`. – hpaulj May 17 '20 at 18:31
  • The effect of `optimize` in `einsum` has shifted over numpy versions, as devs have tried different strategies. On my version (1.18) it doesn't help. – hpaulj May 17 '20 at 18:32
  • @hpaulj I also have 1.18.1 Does using tensordot make a difference on your system? Maybe it is the BLAS backend for this very special case (use of dot-> gemm for a matrix-vector multiplication). I have MKL – max9111 May 17 '20 at 18:38
  • Strangely, on my system, the naive version takes 2 ms, contract and tensordot both take 4.92 ms, while 'optimal' takes a whopping _5.42 ms!_ Is there some setting I need to change? – Nihar Karve May 18 '20 at 03:03
  • @DiehardTheTryhard Which BLAS back end are you using? `numpy.show_config()` Which CPU? All of your timings are really quite slow.... – max9111 May 18 '20 at 09:43
  • @max9111 CPU - i7-7700 @ 2.80 GHz; BLAS backend appears to be 'mkl_rt'. I thought MKL was optimised for Intel processors? – Nihar Karve May 18 '20 at 11:15
  • 1
    @DiehardTheTryhard I did my timings on a Core i5-8500 (dual channel RAM). I also tried it on a low-end device (Tablet with core-m3 and single channel RAM). There every method was in between 0.88 and 0.95ms. Your timings are far too slow for that hardware. I guess it would be the best to make a clean Anaconda installation and test it there. – max9111 May 20 '20 at 15:10