1

I am using CUFFT for 2D FFT on 128 images. Each of the image is of size 128 x 128. On MATLAB, doing one 2D FFT takes 0.3 ms, and to do FFT on all 128 images takes pretty much 128 times of that number of ms. Using CUFFT, the execution of the following code compute FFT for one image

cudaMalloc( (void**)idata, sizeof(cufftDoubleReal) * 128 * 128 );
cudaMalloc( (void**)odata, sizeof(cufftDoubleComplex) * 128 * 128 );
cudaMemcpy( *idata, in_real, 128 * 128 * sizeof(cufftDoubleReal), 
                                  cudaMemcpyHostToDevice );
cudaMemcpy( *idata, in_complex, 128 * 128 * sizeof(cufftDoubleComples), 
                                  cudaMemcpyHostToDevice );

cufftExecD2Z( plan, idata, odata );
cudaMemcpy( out_complex, *odata, 128 * 128 * sizeof(cufftDoubleComplex), cudaMemcpyDeviceToHost );

which takes around 0.4ms on my machine.

I tried executing the same code for more than one image, and the execution time is basically the number of images multiple of 0.4ms. The manner I did that is basically copying and pasting the above code many times, of course, with the variable changed for the corresponding images, which means

// For image1
cudaMalloc( (void**)idata, sizeof(cufftDoubleReal) * 128 * 128 );
cudaMalloc( (void**)odata, sizeof(cufftDoubleComplex) * 128 * 128 );
cudaMemcpy( *idata, in_real, 128 * 128 * sizeof(cufftDoubleReal), 
                                  cudaMemcpyHostToDevice );
cudaMemcpy( *idata, in_complex, 128 * 128 * sizeof(cufftDoubleComples), 
                                  cudaMemcpyHostToDevice );
cufftExecD2Z( plan, idata, odata );
cudaMemcpy( out_complex, *odata, 128 * 128 * sizeof(cufftDoubleComplex), cudaMemcpyDeviceToHost );

// For image 2
cudaMalloc( (void**)idata2, sizeof(cufftDoubleReal) * 128 * 128 );
cudaMalloc( (void**)odata2, sizeof(cufftDoubleComplex) * 128 * 128 );
cudaMemcpy( *idata2, in_real2, 128 * 128 * sizeof(cufftDoubleReal), 
                                  cudaMemcpyHostToDevice );
cudaMemcpy( *idata2, in_complex2, 128 * 128 * sizeof(cufftDoubleComples), 
                                  cudaMemcpyHostToDevice );
cufftExecD2Z( plan, idata2, odata2 );
cudaMemcpy( out_complex, *odata2, 128 * 128 * sizeof(cufftDoubleComplex), cudaMemcpyDeviceToHost );
...
// For image N
...

So I can expect that if I apply 2D FFT to all 128 images, the execution time would be pretty much on the same order of that of the MATLAB.

So my question: is the way I apply the execution correct? Do I fully utilize the parallel computing power of the GPU? Should I modify the way I execute the code, for example, do cudaMemcpy for all the 128 images first and execute time them, in order overlap some CPU and GPU executions?

Da Teng
  • 551
  • 4
  • 21

1 Answers1

4

First, I would recommend profiling your code. You don't have to profile all 100 images, but maybe 2-5 images.

Based on the profile data, you should compare the time spent transferring the data vs. the time spent in the CUFFT operation(s). If they are approximately equal (or if you can visually see that overlap would be beneficial), then try overlap of copy and (CUFFT) compute, and you would use CUDA streams to accomplish this. There are plenty of tutorials on CUDA stream usage as well as example questions here on the CUDA tag (incl. the CUFFT tag) which discuss using streams and using streams with CUFFT.

Separately, but related to above, I would suggest trying to use the CUFFT batch parameter to batch together maybe 2-5 image transforms, to see if it results in a net reduction in overall processing time for the 100 images.

You could actually combine the two ideas, meaning you could perform the transforms in batches, and then use copy/compute overlap using CUDA streams to overlap the copy operations associated with a batch of images with the compute operations from the previous batch.

Separately from all that, cudaMalloc operations are expensive. It's best not to have them in the performance (compute) loop, and this means, if possible, running them once, up-front, in your code. It's better to allocate all the space you need (say for 2-3 batches of images), and then re-use the space, rather than allocating new space for every image.

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257