cuda fortran cufftPlanMany

Question

I am having troubles using cufftPlanMany. After creating the plans and taking the forward and inverse FFTs, I could not get the original data back. Please, find attached a minimum version of the code.

program test_cufft
  use cudafor
  use cufft

  integer :: plan_r2c
  integer :: plan_c2r
  real,allocatable,dimension(:,:,:,:), device :: eta_d
  complex,allocatable,dimension(:,:,:,:), device :: etak_d

  nv = 4
  nx = 256
  ny = 512
  nz = 512
  nx21 = nx/2+1

  allocate( eta_d(nv,nx,ny,nz) )
  allocate( etak_d(nv,nx21,ny,nz) )

  batch = nv;
  rank = 3;
  n = (/ nx, ny, nz /);
  idist = nx*ny*nz;
  odist = nx21*ny*nz;
  inembed = (/ nx, ny, nz /);
  onembed = (/ nx21, ny, nz /);
  istride = 1;
  ostride = 1;

  istat = cufftPlanMany( plan_r2c, rank, n, inembed, istride, idist, &
                         onembed, ostride, odist, CUFFT_R2C, batch )
  istat = cufftPlanMany( plan_c2r, rank, n, onembed, ostride, odist, &
                         inembed, istride, idist, CUFFT_C2R, batch )

  ! Initialize eta_d

  istat = cufftExecR2C( plan_r2c, eta_d, etak_d )
  istat = cufftExecC2R( plan_c2r, etak_d, eta_d )
  eta_d = eta_d/idist

end program test_cufft

The problem is after I did the forward and inverse FFTs, I could not get the original data back. Please, what am I doing wrong? Should the ordering of data be eta_d(batch,nx,ny,nz) or eta_d(nx,ny,nz,batch)?

This kind of forward/inverse transform will not yield the original data back unless you scale (divide) the final results by the size of the transform, i.e. `idist`. see [here](https://docs.nvidia.com/cuda/cufft/index.html#cufft-transform-directions). — Robert Crovella, Dec 10 '20 at 01:24
That is correct Robert. I did divide by ```idist```. I have edited the question to show that. What I'm not sure of is the ordering of ```n, inembed, inembed```. — Caleb Yenusah, Dec 10 '20 at 02:00

Robert Crovella · Accepted Answer · 2020-12-11T00:46:20.887

I would say the correct ordering is (nz, ny, nx, batch)

But it's important to relate these to your array indexing and storage order as well.

In CUFFT terminology, for a 3D transform(*) the nz direction is the fastest changing index, with typical usage (stride=1) being adjacent data in memory, corresponding to adjacent elements in a transform.

This direction (I think of it as the elements along a row, i.e. the "z" index being the column index) is also the direction of a multidimensional transform that gets "reduced" in the complex domain, for the R2C/C2R transform types.

With that in mind, I would rewrite your code this way:

$ cat t4.cuf
program test_cufft
  use cudafor
  use cufft

  integer :: plan_r2c
  integer :: plan_c2r
  real,allocatable,dimension(:,:,:,:), managed :: eta_d
  complex,allocatable,dimension(:,:,:,:), managed :: etak_d
  integer :: n(3), inembed(3), onembed(3),rank,istride,idist,ostride,odist,batch

  nv = 4
  nx = 8
  ny = 8
  nz = 4
  nz21 = nz/2+1

  allocate( eta_d(nz,ny,nx,nv) )
  allocate( etak_d(nz21,ny,nx,nv) )

  batch = nv;
  rank = 3;
  n = (/ nx, ny, nz /);
  idist = nx*ny*nz;
  odist = nx*ny*nz21;
  inembed = (/ nx, ny, nz /);
  onembed = (/ nx, ny, nz21 /);
  istride = 1;
  ostride = 1;

  istat = cufftPlanMany( plan_r2c, rank, n, inembed, istride, idist, onembed, ostride, odist, CUFFT_R2C, batch )
  istat = cufftPlanMany( plan_c2r, rank, n, onembed, ostride, odist, inembed, istride, idist, CUFFT_C2R, batch )

  ! Initialize eta_d
  eta_d(:,:,:,:) = 1.0
  eta_d(1,1,1,2) = 2.0
  istat = cufftExecR2C( plan_r2c, eta_d, etak_d )
  istat = cudaDeviceSynchronize()
  eta_d(:,:,:,:) = 0.0
  istat = cufftExecC2R( plan_c2r, etak_d, eta_d )
  istat = cudaDeviceSynchronize()
  eta_d = eta_d/idist
  print *,eta_d(1,1,1,1)
  print *,eta_d(1,1,1,2)
end program test_cufft
$ nvfortran t4.cuf -lcufft
$ ./a.out
    1.000000
    2.000000
$

(NVIDIA HPC SDK 20.9, Tesla V100 GPU)

and it seems to give expected results for my simple test case.

(*) For a 2D transform, the ny dimension would be the most rapidly varying, and for a 1D transform the nx dimension (of course) is the most rapidly varying.

The multi-dimensional transforms and advanced data layout sections of the CUFFT manual may also be useful reading.

Thanks. That works fine. But the CUFFT [terminology](https://docs.nvidia.com/cuda/cufft/index.html#function-cufftplan3d) is for C. Since fortran array is [column-major order](http://astroa.physics.metu.edu.tr/MANUALS/intel_ifc/mergedProjects/optaps_for/fortran/optaps_prg_arrs_f.htm) the leftmost subscript varies most rapidly with a stride of one, for eta_d(batch, nx, ny, nz) that would be batch. Could you please comment on this? — Caleb Yenusah, Dec 10 '20 at 21:22
Also, I found that the following definitions work fine as well. ```batch = nv; rank = 3; n = (/ nz, ny, nx /); idist = nz*ny*nx; odist = nz*ny*nx21; inembed = (/ nz, ny, nx /); onembed = (/ nz, ny, nx21 /); istride = 1; ostride = 1;```. With the arrays allocated as ```allocate( eta_d(nv,nx,ny,nz) ); allocate( etak_d(nv,nx21,ny,nz) )```. And everything else the same as in your solution. Can you comment on this also? — Caleb Yenusah, Dec 10 '20 at 21:27
Yes, you are correct, I had the allocation dimension orders backwards, I have reversed them. otherwise the code is unchanged. Regarding your second comment, I wouldn't specify `n` as `n = (/ nz, ny, nx /);`, it simply doesn't make sense to me. Also, just to be clear, CUFFT doesn't interpret anything in row major or column major order. The dimension interpretation will be as is described in the CUFFT manual. The sections I already linked explicitly specify how elements will be chosen for each aspect of the multidimensional transform, based on the way those elements are ordered in memory. — Robert Crovella, Dec 10 '20 at 23:21

cuda fortran cufftPlanMany

1 Answers1