For solving a PDE (Schrödinger equation), I need to compute the Laplace operator in three dimensions. My current solution is this (part of the code which requires by far the most time):
for n in range(Ntstep): # loop
for i in range(self.Nixyz[0]): # internal levels of wavefunction
wf.psi[i,:,:,:]=self.expu * wf.psi[i,:,:,:] # potential
if n < Ntstep - 1: # compute laplacian in 3d
wf.psi[i,:,:,:]=\
sf.ifft(self.expkx*sf.fft(wf.psi[i,:,:,:],
axis=0,**fft_args),axis=0,**fft_args)
wf.psi[i,:,:,:]=\
sf.ifft(self.expky*sf.fft(wf.psi[i,:,:,:],
axis=1,**fft_args),axis=1,**fft_args)
wf.psi[i,:,:,:]=\
sf.ifft(self.expkz*sf.fft(wf.psi[i,:,:,:],
axis=2,**fft_args),axis=2,**fft_args)
In order to get better performance, I tried/did/thought about the following:
Don't do the 3D FFT directly. The Laplacian is separable and thus can be splitted in three 1D FFTs, which should bring down the complexity from
n^3
to3n
. (Done in the code above.)I compiled numpy and scipy against the MKL with the hope to gain some performance, especially hoped to enable multi-threaded calculations. For some operations multiple threads are used (matrix vector multiplication), but neither numpy.fft nor scipy.fftpack utilizes multiple cores.
I compiled libfftw und pyfftw and used it as drop-in replacement for np/sp. I have an Intel Core i7-3770K, i.e. four cores and eight threads. I get about twice the performance compared to np/sp when using two or four threads with fftw. One thread or more than four threads is slower, for some reason.
So, my main questions are basically now:
Is the FFT(W) parallelizable in a way the performance scales with the number of cores/threads available? If yes, what do I need to consider? Currently, two to four threads seems to be the sweet spot for me. More (or less) is slower, although I have eight threads available on my CPU.
Should I try to parallelize my Python code? E.g. put the three 1D FFTs on three different cores. Of course I have to make sure I don't read and write from the same variable in different threads at the same time, so I need some additional "temp" variables in the code above, e.g.:
- Thread 1: TempA = FFT(psi..., axis=0)
- Thread 2: TempB = FFT(psi..., axis=1)
- Thread 3: TempC = FFT(psi..., axis=1)
- Final step: psi = TempA + TempB + TempC
The FFT of
axis=0
takes twice(!) as long as for the other axes. Is it possible to get rid of this difference and make all FFTs equally fast?(New) Is the FFT approach the best choice after all, or is the finite differences approach by user Rory always better, at least in terms of performance?
I think efficiently computing the Laplacian is a topic which has been extensivly researched, so even some links or hints to papers, books etc. may be helpful.