Efficiently computing the 3D Laplacian using FFT and Python

Question

For solving a PDE (Schrödinger equation), I need to compute the Laplace operator in three dimensions. My current solution is this (part of the code which requires by far the most time):

for n in range(Ntstep): # loop         

 for i in range(self.Nixyz[0]): # internal levels of wavefunction
    wf.psi[i,:,:,:]=self.expu * wf.psi[i,:,:,:] # potential      

    if n < Ntstep - 1: # compute laplacian in 3d
        wf.psi[i,:,:,:]=\
            sf.ifft(self.expkx*sf.fft(wf.psi[i,:,:,:],
                axis=0,**fft_args),axis=0,**fft_args)
        wf.psi[i,:,:,:]=\
            sf.ifft(self.expky*sf.fft(wf.psi[i,:,:,:],
                axis=1,**fft_args),axis=1,**fft_args)
        wf.psi[i,:,:,:]=\
            sf.ifft(self.expkz*sf.fft(wf.psi[i,:,:,:],
                axis=2,**fft_args),axis=2,**fft_args)

In order to get better performance, I tried/did/thought about the following:

Don't do the 3D FFT directly. The Laplacian is separable and thus can be splitted in three 1D FFTs, which should bring down the complexity from n^3 to 3n. (Done in the code above.)
I compiled numpy and scipy against the MKL with the hope to gain some performance, especially hoped to enable multi-threaded calculations. For some operations multiple threads are used (matrix vector multiplication), but neither numpy.fft nor scipy.fftpack utilizes multiple cores.
I compiled libfftw und pyfftw and used it as drop-in replacement for np/sp. I have an Intel Core i7-3770K, i.e. four cores and eight threads. I get about twice the performance compared to np/sp when using two or four threads with fftw. One thread or more than four threads is slower, for some reason.

So, my main questions are basically now:

Is the FFT(W) parallelizable in a way the performance scales with the number of cores/threads available? If yes, what do I need to consider? Currently, two to four threads seems to be the sweet spot for me. More (or less) is slower, although I have eight threads available on my CPU.
Should I try to parallelize my Python code? E.g. put the three 1D FFTs on three different cores. Of course I have to make sure I don't read and write from the same variable in different threads at the same time, so I need some additional "temp" variables in the code above, e.g.:
- Thread 1: TempA = FFT(psi..., axis=0)
- Thread 2: TempB = FFT(psi..., axis=1)
- Thread 3: TempC = FFT(psi..., axis=1)
- Final step: psi = TempA + TempB + TempC
The FFT of axis=0 takes twice(!) as long as for the other axes. Is it possible to get rid of this difference and make all FFTs equally fast?
(New) Is the FFT approach the best choice after all, or is the finite differences approach by user Rory always better, at least in terms of performance?

I think efficiently computing the Laplacian is a topic which has been extensivly researched, so even some links or hints to papers, books etc. may be helpful.

score 2 · Accepted Answer · answered Feb 26 '14 at 22:35

I don't really have an answer, but my fft laplacian looks more simple than yours:

def laplacian3d(field, KX, KY, KZ):
    return ifft(-KX**2*fft(field, axis = 0), axis = 0) + 
        ifft(-KY**2*fft(field, axis = 1), axis = 1) + 
        ifft(-KZ**2*fft(field, axis = 2), axis = 2)

where KX, KY, and KZ are 3D arrays made from: KX, KY, KZ = meshgrid(kx, ky, kz, indexing='ij'), and feild is the 3D real space field (the wavefunction)and kx = 2*pi*fftfreq(len(x), (x[1]-x[0])), (with x being the real space 1D array containing evenly spaced positions)

In practice, I have found finite difference laplacians implemented in cython to be about 10times faster:

cimport numpy as np
cimport cython
import numpy as np

#3D laplacian of a complex function
@cython.boundscheck(False) # turn of bounds-checking for entire function
def laplacianFD3dcomplex(np.ndarray[double complex, ndim=3] f, double complex dx, double complex dy, double complex dz):
    cdef unsigned int i, j, k, ni, nj, nk
    cdef double complex ifactor, jfactor, kfactor, ijkfactor
    ni = f.shape[0]
    nj = f.shape[1]
    nk = f.shape[2]
    cdef np.ndarray[double complex, ndim=3] lapf = np.zeros((ni,nj,nk)) +0.0J

    ifactor = 1/dx**2
    jfactor = 1/dy**2
    kfactor = 1/dz**2
    ijkfactor = 2.0*(ifactor + jfactor + kfactor)

    for i in xrange(1,ni-1):
        for j in xrange(1, nj-1):
            for k in xrange(1, nk-1):
                lapf[i, j, k] = (f[i, j, k-1] + f[i, j, k+1])*kfactor + (f[i, j-1, k] + f[i, j+1, k])*jfactor + (f[i-1, j, k] + f[i+1, j, k])*ifactor - f[i,j,k]*ijkfactor
    return lapf

#3D laplacian of a real function
@cython.boundscheck(False) # turn of bounds-checking for entire function
def laplacianFD3dreal(np.ndarray[double, ndim=3] f, double dx, double dy, double dz):
    cdef unsigned int i, j, k, ni, nj, nk
    cdef double ifactor, jfactor, kfactor, ijkfactor
    ni = f.shape[0]
    nj = f.shape[1]
    nk = f.shape[2]
    cdef np.ndarray[double, ndim=3] lapf = np.zeros((ni,nj,nk))

    ifactor = 1/dx**2
    jfactor = 1/dy**2
    kfactor = 1/dz**2
    ijkfactor = 2.0*(ifactor + jfactor + kfactor)

    for i in xrange(1,ni-1):
        for j in xrange(1, nj-1):
            for k in xrange(1, nk-1):
                lapf[i, j, k] = (f[i, j, k-1] + f[i, j, k+1])*kfactor + (f[i, j-1, k] + f[i, j+1, k])*jfactor + (f[i-1, j, k] + f[i+1, j, k])*ifactor - f[i,j,k]*ijkfactor
    return lapf

The above code can be copied into a file called "cython_finite_diff.pyx" and compiled using a setup.py file like this:

#To build the cython code in the .pyx file, type in the terminal:
#"python setup.py build_ext --inplace"
from distutils.core import setup
from distutils.extension import Extension
from Cython.Build import cythonize
import numpy

extensions = [
    Extension("cython_finite_diff", ["cython_finite_diff.pyx"],
                include_dirs = [numpy.get_include()]),
]

setup(
    name = "my_cython_fd",
    ext_modules = cythonize(extensions, annotate=True),
)

Sorry about the formatting, I'm a noob at posting on stack overflow. Also note that the finite difference laplacians give reflective boundary conditions. You and make them periodic by setting the loop to include the first row of points on the opposite boundary.

Thanks for your answer and for bringing my attention to the FD method. Now, I'm wondering if the FFT approach can be brought to an equal or greater performance. And what about multi-threading with your approach? — Micha, Feb 27 '14 at 10:25
Both methods are parallelizable, but it is beyond my expertise to know how. With the FD method, you can cut the entire domain into cubes and calculate the laplacian over each sub domain in parallel. With the FFT method, you are actually just calculating 3N^2 1D FFTs, so you can parallelize (almost) as much as you want. As always though, you need to be careful that the overhead of starting new processes etc, doesn't outweigh the gains you make by parallelising in the first place. — Rory, Mar 06 '14 at 22:15
As for which method is faster, the FFT method requires O(N^3logN) calculations(assuming N is a power of 2), the FD method requires O(N^3) calculations (with no constraints on N), so I think the FD method will always win (in speed), but the difference will grow slowly. — Rory, Mar 06 '14 at 22:34

Efficiently computing the 3D Laplacian using FFT and Python

1 Answers1