1

I've been struggling the whole day, trying to make a basic CUFFT example work properly. However i run into a little problem which I cannot identify. Basically I have a linear 2D array vx with x and y coordinates. Then I just calculate a forward then backward CUFFT (in-place), that simple. Then I copy back the array vx, normalize it by NX*NY , then display.

#define NX 32
#define NY 32
#define LX (2*M_PI)
#define LY (2*M_PI)
float *x = new float[NX*NY];
float *y = new float[NX*NY];
float *vx = new float[NX*NY];
for(int j = 0; j < NY; j++){
    for(int i = 0; i < NX; i++){
        x[j*NX + i] = i * LX/NX;
        y[j*NX + i] = j * LY/NY;
        vx[j*NX + i] = cos(x[j*NX + i]);
    }
}
float *d_vx;
CUDA_CHECK(cudaMalloc(&d_vx, NX*NY*sizeof(float)));
CUDA_CHECK(cudaMemcpy(d_vx, vx, NX*NY*sizeof(float), cudaMemcpyHostToDevice));
cufftHandle planr2c;
cufftHandle planc2r;
CUFFT_CHECK(cufftPlan2d(&planr2c, NY, NX, CUFFT_R2C));
CUFFT_CHECK(cufftPlan2d(&planc2r, NY, NX, CUFFT_C2R));
CUFFT_CHECK(cufftSetCompatibilityMode(planr2c, CUFFT_COMPATIBILITY_NATIVE));
CUFFT_CHECK(cufftSetCompatibilityMode(planc2r, CUFFT_COMPATIBILITY_NATIVE));
CUFFT_CHECK(cufftExecR2C(planr2c, (cufftReal *)d_vx, (cufftComplex *)d_vx));
CUFFT_CHECK(cufftExecC2R(planc2r, (cufftComplex *)d_vx, (cufftReal *)d_vx));
CUDA_CHECK(cudaMemcpy(vx, d_vx, NX*NY*sizeof(cufftReal), cudaMemcpyDeviceToHost));
for (int j = 0; j < NY; j++){
    for (int i = 0; i < NX; i++){
        printf("%.3f ", vx[j*NX + i]/(NX*NY));
    }
    printf("\n");
}

When vx is defined as cos(x) or sin(x), it works fine, but when using sin(y) or cos(y), it gives me back the correct function (sin or cos) but with half amplitude (that is, oscillating between 0.5 and -0.5 instead of 1 and -1) ! Note that using sin(2*y) or cos(2*y) (or sin(4*y), cos(4*y), ...) works fine. Any idea?

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
François Laenen
  • 171
  • 4
  • 14
  • Why does your first cudaMemcpy operation specify cudaMemcpyDeviceToHost? That makes no sense and doesn't match the order of the pointers. If your CUDA_CHECK macro is not throwing an error there then something is wrong with your macro. – Robert Crovella Oct 05 '13 at 19:22
  • Sorry I didn't copy/paste the correct line, in my code the sense of the copy is correct.I update – François Laenen Oct 05 '13 at 20:59
  • 1
    That line of code you had previously could not have been correct anywhere. If that line of code shows up elsewhere in your code, it is wrong. By the way, SO expects you to provide an SSCCE.org code for questions like these. – Robert Crovella Oct 05 '13 at 21:44
  • I've just recomputed it in double precision and it works. But I can't explain why, probably something wrong in my memory layout. – François Laenen Oct 05 '13 at 21:45

1 Answers1

4

The problem here is that input and output of an in-place real to complex transform is a complex type whose size isn't the same as the input real data (it is twice as large). You haven't allocated enough memory to hold the intermediate complex results of the real to complex transform. Quoting from the documentation:

cufftExecR2C() (cufftExecD2Z()) executes a single-precision (double-precision) real-to-complex, implicitly forward, CUFFT transform plan. CUFFT uses as input data the GPU memory pointed to by the idata parameter. This function stores the nonredundant Fourier coefficients in the odata array. Pointers to idata and odata are both required to be aligned to cufftComplex data type in single-precision transforms and cufftDoubleComplex data type in double-precision transforms.

The solution is either to allocate a second device buffer to hold the intermediate result or enlarge the in place allocation so it is large enough to hold the complex data. So the core transform code changes to something like:

float *d_vx;
CUDA_CHECK(cudaMalloc(&d_vx, NX*NY*sizeof(cufftComplex)));
CUDA_CHECK(cudaMemcpy(d_vx, vx, NX*NY*sizeof(cufftComplex), cudaMemcpyHostToDevice));
cufftHandle planr2c;
cufftHandle planc2r;
CUFFT_CHECK(cufftPlan2d(&planr2c, NY, NX, CUFFT_R2C));
CUFFT_CHECK(cufftPlan2d(&planc2r, NY, NX, CUFFT_C2R));
CUFFT_CHECK(cufftSetCompatibilityMode(planr2c, CUFFT_COMPATIBILITY_NATIVE));
CUFFT_CHECK(cufftSetCompatibilityMode(planc2r, CUFFT_COMPATIBILITY_NATIVE));
CUFFT_CHECK(cufftExecR2C(planr2c, (cufftReal *)d_vx, d_vx));
CUFFT_CHECK(cufftExecC2R(planc2r, d_vx, (cufftReal *)d_vx));
CUDA_CHECK(cudaMemcpy(vx, d_vx, NX*NY*sizeof(cufftComplex), cudaMemcpyDeviceToHost));

[disclaimer: written in browser, never compiled or tested, use at own risk]

Note you will need to adjust the host code to match the size and type of the input and data.

As a final comment, would it have been that hard to add the additional 8 or 10 lines required to turn what you posted into a compilable, runnable example that someone trying to help you could work with?

talonmies
  • 70,661
  • 34
  • 192
  • 269
  • That's it! And as the unitary mode was at the end of the array, it was lost, hence the lose of half the energy. I will improve the presentation next time – François Laenen Oct 07 '13 at 17:56
  • For future reference: This is not true. For R2C/C2R transfroms you only need around half the memory of the complex type. You have (NX/2+1)*NY complex elements that is (NX+2)*NY real elements for even NX and (NX+1)*NY real elements for odd NX – Flamefire Jul 17 '15 at 12:34
  • note that in current versions of CUFFT (after 9.0) the `cufftSetCompatibilityMode` function has been removed, so CUFFT expects [padding](https://stackoverflow.com/questions/58821964/in-place-real-to-complex-fft-with-cufft/58828871#58828871) on 2D in-place R2C transforms – Robert Crovella Nov 13 '19 at 02:32