0

For non-standard computations in optical wavefront propagation I need to deal with very large matrices (about 10^6*10^6 entries) which are not sparse or in any way "reducible".

The code roughly reads as follows:

import numpy as np

x = np.linspace(-Nx / 2, Nx / 2 - 1, Nx) * deltax
q = np.linspace(-Nq / 2, Nq / 2 - 1, Nq) * deltaq
xg, qg =np.meshgrid(x, q)

kernel = np.exp(1j * 2 * np.pi * qg * xig)

u = np.dot(kernel, u0)

Here, Nx, Nq are large numbers, deltax, deltaq are scaling factors, and u0 is a suitable vector. In fact, the kernel is not of the above Fourier type, but more complicated.

Of course, the matrix is too large for RAM, so the whole procedure has to be divided into smaller blocks. In fact, I compute blocks of rows of the kernel and then do the corresponding dot-product which gives blocks of u. However, I didn't include this procedure in the code above in order to keep the focus on the main problem.

There are two time consuming operations here:

  1. Computation of the kernel
  2. Matrix vector multiplication

So far, I use numexpr to speed up the kernel computation. Hence, this part makes use of the multi-core architecture (current machine: AMD FX 8320, 8 cores, 16 GB RAM). However, the matrix vector multiplication does not. Also, I didn't compile numpy to adjust it to my machine. This way, one go (involving several hundreds of such operations) takes several days!

I guess, further parallelisation is a possibility to reduce computation time. However, I'm a complete beginner in parallelisation. Googling this topic leads to a whole zoo of possibilites and (as a beginner) it's hard to judge which route to follow.

Could you please recommend some speedup procedures (in particular, involving parallelisation) that are accessible for a BEGINNER?

phlegmax
  • 419
  • 1
  • 4
  • 11

1 Answers1

1

Can you do something like this?

x = np.linspace(-Nx / 2, Nx / 2 - 1, Nx) * deltax
q = np.linspace(-Nq / 2, Nq / 2 - 1, Nq) * deltaq
u = np.empty_like(u0)
for i in np.arange(Nx).flat:
    u[i] = np.dot(np.exp(1j * 2 * np.pi * q * x[i]), u0)

It will be slower than a vectorized (since you're computing your kernel line by line) but shouldn't overflow your memory

Daniel F
  • 13,620
  • 2
  • 29
  • 55
  • In fact, this is essentially what I'm doing to prevent running out of RAM. However, I'm not doing this line by line, but in blocks of lines. These blocks of lines are chosen as large as possible to benefit from numpy vector operations. – phlegmax Mar 30 '17 at 13:39
  • That's probably your best option. – Daniel F Mar 30 '17 at 13:40
  • I edited my original post to make this point clear from the beginning. – phlegmax Mar 30 '17 at 13:45