1

I'm trying to do some matrix computations as

def get_P(X,Z):
    n_sample,n_m,n_t,n_f = X.shape
    res = np.zeros((n_sample,n_m,n_t,n_t))
    for i in range(n_sample):
        res[i,:,:,:] = np.dot(X[i,:,:,:],Z[i,:,:])
    return res

Because the size of X and Z is large, it takes more than 600ms to compute one np.dot, and I have 10k rows in X.

Is there anyway we can speed it up?

Michael Szczesny
  • 4,911
  • 5
  • 15
  • 32
Linyun He
  • 11
  • 1
  • 1
    I could very easily be misunderstanding it, but from reading the docs, I get the impression that `res = np.dot(X,Z)` gives the same result. I don't know if that's any fast, though. – Acccumulation Sep 02 '21 at 00:56
  • 1
    There may be some ways, but without knowing the approx. shape of the arrays it is hard to give a general answer. Always provide a full working example including realistic shape and size of input arrays. This is an example on quite small (2x2) matrices https://stackoverflow.com/a/59356461/4045774 – max9111 Sep 02 '21 at 08:55
  • res = np.dot(X,Z) won't give the same results and will give errors. However np.matmul(Xtmp,Z[:,None,:,:]) will give the same results and three times faster but not that faster enough. – Linyun He Sep 08 '21 at 18:12

2 Answers2

1

Well, there might be some avoidable overhead posed by your zero initialization (which gets overwritten right away): Just use np.ndarray instead.

Other than that: numpy is fairly well-optimized. Probably you can speed things up if you used dtype=numpy.float32 instead of the default 64-bit floating point numbers for your X, Z and res – but that's about it. Dot products are mostly spending time going linear through RAM and multiplying and summing numbers – things that numpy, compilers and CPUs are radically good at these days.

Note that numpy will only use one CPU core at a time in its default configuration - it might make sense to parallelize; for example, if you've got 16 CPU cores, you'd make 16 separate res partitions and calculate subsets of your range(n_sample) dot products on each core; python does bring the multithreading / async facilities to do so – you'll find plenty of examples, and explaining how would lead too far.

If you can spend the development time, and need massive amounts of data, so that this pays: you can use e.g. GPUs to multiply matrices; these are really good at that, and cuBLASlt (GEMM) is an excellent implementation, but honestly, you'd mostly be abandoning Numpy and would need to work things out yourself – in C/C++.

Marcus Müller
  • 34,677
  • 4
  • 53
  • 94
  • 2
    `np.ndarray` is considered a low-level constructor, and `np.zeros` or `np.empty` is preferred. See its docs. – hpaulj Sep 01 '21 at 21:21
1

You can use numpy einsum to do this multiplication in one vectorized step. It will be much faster than this loop based dot product. For examples, check this link https://rockt.github.io/2018/04/30/einsum

MSS
  • 3,306
  • 1
  • 19
  • 50