I am using CUFFT for 2D FFT on 128 images. Each of the image is of size 128 x 128. On MATLAB, doing one 2D FFT takes 0.3 ms, and to do FFT on all 128 images takes pretty much 128 times of that number of ms. Using CUFFT, the execution of the following code compute FFT for one image
cudaMalloc( (void**)idata, sizeof(cufftDoubleReal) * 128 * 128 );
cudaMalloc( (void**)odata, sizeof(cufftDoubleComplex) * 128 * 128 );
cudaMemcpy( *idata, in_real, 128 * 128 * sizeof(cufftDoubleReal),
cudaMemcpyHostToDevice );
cudaMemcpy( *idata, in_complex, 128 * 128 * sizeof(cufftDoubleComples),
cudaMemcpyHostToDevice );
cufftExecD2Z( plan, idata, odata );
cudaMemcpy( out_complex, *odata, 128 * 128 * sizeof(cufftDoubleComplex), cudaMemcpyDeviceToHost );
which takes around 0.4ms on my machine.
I tried executing the same code for more than one image, and the execution time is basically the number of images multiple of 0.4ms. The manner I did that is basically copying and pasting the above code many times, of course, with the variable changed for the corresponding images, which means
// For image1
cudaMalloc( (void**)idata, sizeof(cufftDoubleReal) * 128 * 128 );
cudaMalloc( (void**)odata, sizeof(cufftDoubleComplex) * 128 * 128 );
cudaMemcpy( *idata, in_real, 128 * 128 * sizeof(cufftDoubleReal),
cudaMemcpyHostToDevice );
cudaMemcpy( *idata, in_complex, 128 * 128 * sizeof(cufftDoubleComples),
cudaMemcpyHostToDevice );
cufftExecD2Z( plan, idata, odata );
cudaMemcpy( out_complex, *odata, 128 * 128 * sizeof(cufftDoubleComplex), cudaMemcpyDeviceToHost );
// For image 2
cudaMalloc( (void**)idata2, sizeof(cufftDoubleReal) * 128 * 128 );
cudaMalloc( (void**)odata2, sizeof(cufftDoubleComplex) * 128 * 128 );
cudaMemcpy( *idata2, in_real2, 128 * 128 * sizeof(cufftDoubleReal),
cudaMemcpyHostToDevice );
cudaMemcpy( *idata2, in_complex2, 128 * 128 * sizeof(cufftDoubleComples),
cudaMemcpyHostToDevice );
cufftExecD2Z( plan, idata2, odata2 );
cudaMemcpy( out_complex, *odata2, 128 * 128 * sizeof(cufftDoubleComplex), cudaMemcpyDeviceToHost );
...
// For image N
...
So I can expect that if I apply 2D FFT to all 128 images, the execution time would be pretty much on the same order of that of the MATLAB.
So my question: is the way I apply the execution correct? Do I fully utilize the parallel computing power of the GPU? Should I modify the way I execute the code, for example, do cudaMemcpy for all the 128 images first and execute time them, in order overlap some CPU and GPU executions?