I would like to build a 2D MPI-parallel spectral differentiation code. The following piece of code seems to work fine for the x-derivative, both in serial and in parallel:
alloc_local = fftw_mpi_local_size_2d(N0,N1,MPI_COMM_WORLD,&local_n0, &local_0_start);
fplan_u = fftw_mpi_plan_dft_2d(N0,N1,ptr_u,uhat,MPI_COMM_WORLD,FFTW_FORWARD,FFTW_ESTIMATE);
bplan_x = fftw_mpi_plan_dft_2d(N0,N1,uhat_x,ptr_ux,MPI_COMM_WORLD,FFTW_BACKWARD,FFTW_ESTIMATE);
ptr_u = fftw_alloc_real(alloc_local);
ptr_ux = fftw_alloc_real(alloc_local);
uhat = fftw_alloc_complex(alloc_local);
uhat_x = fftw_alloc_complex(alloc_local);
fftw_execute(fplan_u);
// first renormalize the transform...
for (int j=0;j<local_n0;j++)
for (int i=0;i<N1;i++)
uhat[j*N1+i] /= (double)(N1*local_n0);
// then compute the x-derivative
for (int j=0;j<local_n0;j++)
for (int i=0;i<N1/2;i++)
uhat_x[j*N1+i] = I*pow(i,1)*uhat[j*N1+i]/(double)N0;
for (int j=0;j<local_n0;j++)
for (int i=N1/2;i<N1;i++)
uhat_x[j*N1+i] = -I*pow(N1-i,1)*uhat[j*N1+i]/(double)N0;
fftw_execute(bplan_x);
However, the following code for the y-derivative works well only in serial, and not in parallel.
bplan_y = fftw_mpi_plan_dft_2d(N0,N1,uhat_y,ptr_uy,MPI_COMM_WORLD,FFTW_BACKWARD,FFTW_ESTIMATE);
fftw_execute(fplan_u);
for (int j=0;j<local_n0;j++)
for (int i=0;i<N1;i++)
uhat[j*N1+i] /= (double)(N1*local_n0);
if (size == 1){ // in the serial case do this
for (int j=0;j<local_n0/2;j++)
for (int i=0;i<N1;i++)
uhat_y[j*N1+i] = I*pow(j,1)*uhat[j*N1+i]/(double)N0;
for (int j=local_n0/2;j<local_n0;j++)
for (unsigned int i=0;i<N1;i++)
uhat_y[j*N1+i] = -I*pow(N1-j,1)*uhat[j*N1+i]/(double)N0;
} else { // in the parallel case instead do this
for (int j=0;j<local_n0;j++)
for (int i=0;i<N1;i++)
if (rank <(size/2))
uhat_y[j*N1+i] = I*pow(j+local_n0*rank,1)*uhat[j*N1+i]/(double)N0;
else
uhat_y[j*N1+i] = -I*pow(N1-(j+local_n0*rank),1)*uhat[j*N1+i]/(double)N0;
}
fftw_execute(bplan_y);
where the conditional expression if(rank<(size/2)) is used for dealiasing purposes (because I am doing FFTs of real data). The dealiasing appears to be correctly coded in serial for the y derivative, and also in parallel for the x derivative. I know that FFTW provides a series of functions specific for this topic (fftw_mpi_plan_dft_r2c and ..._c2r) but I would like to use anyway the complex to complex routines, because in the future I might consider complex data.
Of course, using fftw_mpi_plan_transpose to transpose the data, take a x derivative and then transpose back works in parallel, but I would like to avoid this because in the future I plan solving diffusive-like problems implicitly, e.g.
uhat[i*N0+j] = fhat[i*N0+j]/(pow((double)i,2)+pow((double)j,2)))
and this would not be possible with the transpose plans.
Edit: strangely enough, if I run the above code with 3 MPI processes, e.g.:
my_rank: 0 alloc_local: 87552 local_n0: 171 local_0_start: 0
my_rank: 1 alloc_local: 87552 local_n0: 171 local_0_start: 171
my_rank: 2 alloc_local: 87040 local_n0: 170 local_0_start: 342
I get results that are still wrong, but look much closer to the correct solution than what I obtain with with 2,4, or 8 processes.