Parallel derivatives of multidimensional real data with FFTW

Question

I would like to build a 2D MPI-parallel spectral differentiation code. The following piece of code seems to work fine for the x-derivative, both in serial and in parallel:

alloc_local = fftw_mpi_local_size_2d(N0,N1,MPI_COMM_WORLD,&local_n0, &local_0_start);
fplan_u = fftw_mpi_plan_dft_2d(N0,N1,ptr_u,uhat,MPI_COMM_WORLD,FFTW_FORWARD,FFTW_ESTIMATE);
bplan_x = fftw_mpi_plan_dft_2d(N0,N1,uhat_x,ptr_ux,MPI_COMM_WORLD,FFTW_BACKWARD,FFTW_ESTIMATE);

ptr_u   = fftw_alloc_real(alloc_local);
ptr_ux  = fftw_alloc_real(alloc_local);
uhat    = fftw_alloc_complex(alloc_local);
uhat_x  = fftw_alloc_complex(alloc_local);


fftw_execute(fplan_u);

// first renormalize the transform...
for (int j=0;j<local_n0;j++)
for (int i=0;i<N1;i++)
    uhat[j*N1+i] /= (double)(N1*local_n0);

// then compute the x-derivative
for (int j=0;j<local_n0;j++)
for (int i=0;i<N1/2;i++)
    uhat_x[j*N1+i] = I*pow(i,1)*uhat[j*N1+i]/(double)N0;
for (int j=0;j<local_n0;j++)
for (int i=N1/2;i<N1;i++)
    uhat_x[j*N1+i] = -I*pow(N1-i,1)*uhat[j*N1+i]/(double)N0;

fftw_execute(bplan_x);

However, the following code for the y-derivative works well only in serial, and not in parallel.

bplan_y = fftw_mpi_plan_dft_2d(N0,N1,uhat_y,ptr_uy,MPI_COMM_WORLD,FFTW_BACKWARD,FFTW_ESTIMATE);

fftw_execute(fplan_u);
for (int j=0;j<local_n0;j++)
for (int i=0;i<N1;i++)
    uhat[j*N1+i] /= (double)(N1*local_n0);

if (size == 1){ // in the serial case do this
    for (int j=0;j<local_n0/2;j++)
    for (int i=0;i<N1;i++)
        uhat_y[j*N1+i] = I*pow(j,1)*uhat[j*N1+i]/(double)N0;
    for (int j=local_n0/2;j<local_n0;j++)
    for (unsigned int i=0;i<N1;i++)
        uhat_y[j*N1+i] = -I*pow(N1-j,1)*uhat[j*N1+i]/(double)N0;
} else {        // in the parallel case instead do this
    for (int j=0;j<local_n0;j++)
    for (int i=0;i<N1;i++)
    if (rank <(size/2))
        uhat_y[j*N1+i] = I*pow(j+local_n0*rank,1)*uhat[j*N1+i]/(double)N0;
    else
        uhat_y[j*N1+i] = -I*pow(N1-(j+local_n0*rank),1)*uhat[j*N1+i]/(double)N0;
}
fftw_execute(bplan_y);

where the conditional expression if(rank<(size/2)) is used for dealiasing purposes (because I am doing FFTs of real data). The dealiasing appears to be correctly coded in serial for the y derivative, and also in parallel for the x derivative. I know that FFTW provides a series of functions specific for this topic (fftw_mpi_plan_dft_r2c and ..._c2r) but I would like to use anyway the complex to complex routines, because in the future I might consider complex data.

Of course, using fftw_mpi_plan_transpose to transpose the data, take a x derivative and then transpose back works in parallel, but I would like to avoid this because in the future I plan solving diffusive-like problems implicitly, e.g.

uhat[i*N0+j] = fhat[i*N0+j]/(pow((double)i,2)+pow((double)j,2)))

and this would not be possible with the transpose plans.

Edit: strangely enough, if I run the above code with 3 MPI processes, e.g.:

my_rank: 0 alloc_local: 87552 local_n0: 171 local_0_start: 0
my_rank: 1 alloc_local: 87552 local_n0: 171 local_0_start: 171
my_rank: 2 alloc_local: 87040 local_n0: 170 local_0_start: 342

I get results that are still wrong, but look much closer to the correct solution than what I obtain with with 2,4, or 8 processes.

For the moment, the only issue I see is `pow(N1-(j+local_n0*rank),1)` which should be `pow(N0-(j+local_n0*rank),1)`. But since the same issue is also present in the sequential part of your code, I assume it does not solve your problem. Try to print `local_n0` : it may not be the same on all processes, thus making `(j+local_n0*rank)` a wrong frequency. Does the issue exist if `N0` is a multiple of the number of processes `size` ? — francis, Nov 19 '15 at 21:35
Indeed, the `N1` should be an `N0`. This is not really an issue since I'm using `N0=N1` at the moment. Regarding the second part of your comment, here's a first test: my_rank: 1 alloc_local: 65536 local_n0: 128 local_0_start: 128 my_rank: 0 alloc_local: 65536 local_n0: 128 local_0_start: 0 my_rank: 2 alloc_local: 65536 local_n0: 128 local_0_start: 256 my_rank: 3 alloc_local: 65536 local_n0: 128 local_0_start: 384 So `local_n0` is the same on all processes, and the issue exists in this case for `N0=512` and `size=2,4,8,...`. — JoeP, Nov 20 '15 at 07:23

Parallel derivatives of multidimensional real data with FFTW

0 Answers0