Speed up N-d array dot

Question

I'm trying to do some matrix computations as

def get_P(X,Z):
    n_sample,n_m,n_t,n_f = X.shape
    res = np.zeros((n_sample,n_m,n_t,n_t))
    for i in range(n_sample):
        res[i,:,:,:] = np.dot(X[i,:,:,:],Z[i,:,:])
    return res

Because the size of X and Z is large, it takes more than 600ms to compute one np.dot, and I have 10k rows in X.

Is there anyway we can speed it up?

I could very easily be misunderstanding it, but from reading the docs, I get the impression that `res = np.dot(X,Z)` gives the same result. I don't know if that's any fast, though. — Acccumulation, Sep 02 '21 at 00:56
There may be some ways, but without knowing the approx. shape of the arrays it is hard to give a general answer. Always provide a full working example including realistic shape and size of input arrays. This is an example on quite small (2x2) matrices https://stackoverflow.com/a/59356461/4045774 — max9111, Sep 02 '21 at 08:55
res = np.dot(X,Z) won't give the same results and will give errors. However np.matmul(Xtmp,Z[:,None,:,:]) will give the same results and three times faster but not that faster enough. — Linyun He, Sep 08 '21 at 18:12

score 1 · Answer 1 · answered Sep 01 '21 at 20:38

Well, there might be some avoidable overhead posed by your zero initialization (which gets overwritten right away): Just use np.ndarray instead.

Other than that: numpy is fairly well-optimized. Probably you can speed things up if you used dtype=numpy.float32 instead of the default 64-bit floating point numbers for your X, Z and res – but that's about it. Dot products are mostly spending time going linear through RAM and multiplying and summing numbers – things that numpy, compilers and CPUs are radically good at these days.

Note that numpy will only use one CPU core at a time in its default configuration - it might make sense to parallelize; for example, if you've got 16 CPU cores, you'd make 16 separate res partitions and calculate subsets of your range(n_sample) dot products on each core; python does bring the multithreading / async facilities to do so – you'll find plenty of examples, and explaining how would lead too far.

If you can spend the development time, and need massive amounts of data, so that this pays: you can use e.g. GPUs to multiply matrices; these are really good at that, and cuBLASlt (GEMM) is an excellent implementation, but honestly, you'd mostly be abandoning Numpy and would need to work things out yourself – in C/C++.

`np.ndarray` is considered a low-level constructor, and `np.zeros` or `np.empty` is preferred. See its docs. — hpaulj, Sep 01 '21 at 21:21

score 1 · Answer 2 · answered Sep 02 '21 at 06:03

1

You can use numpy einsum to do this multiplication in one vectorized step. It will be much faster than this loop based dot product. For examples, check this link https://rockt.github.io/2018/04/30/einsum

answered Sep 02 '21 at 06:03

MSS

3,306
1
19
50

Speed up N-d array dot

2 Answers2