As part of my research work, i need to measure the planning time of the CUFFT library in different CUDA versions (i.e. espicially in CUDA 4 and CUDA 5.5 ). Let us see one of the results of the one dimensional FFT size 4096, as shown below,
CUDA 5.5 and GeForede GTX 770
FFT size: 4096
Planning time: 96322.7 us (micro seconds)
Loading Data : 36.6 us (micro seconds)
Execution time: 135.9 us (micro seconds)
Fetching Data: 42.5 us (micro seconds)
CUDA 4 and GeForce GTX 560
FFT size: 4096
Planning time: 102.7 us (micro seconds)
Loading Data : 26.4 us (micro seconds)
Execution time: 72.0 us (micro seconds)
Fetching Data: 27.3 us (micro seconds)
I really shocked to see the planning time of the CUFFT on CUDA 5.5 and GeForede GTX 770 is almost 900 times slower than that on CUDA 4 and GeForce GTX 560.
Instead, It should be faster in CUDA 5.5 and GeForede GTX 770 because of the following two reasons, 1) CUDA 5.5 is latest version, usually, the latest versions are faster, and 2) The GPU GTX 770 have better specifications than that of GTX 560.
My question is that why is that difference in the planning time?
For more details, please see the code below, about how I measured the time using cuda events,
/* creates 1D FFT plan */
cudaEventRecord(start0, 0);
cufftPlan1d(&plan, NX, CUFFT_C2C, BATCH);
cudaEventRecord(stop0, 0);
cudaEventSynchronize(stop0);
/* transfer to GPU memory */
cudaEventRecord(start1, 0);
cudaMemcpy(devPtr, data, sizeof(cufftComplex)*NX*BATCH, cudaMemcpyHostToDevice);
cudaEventRecord(stop1, 0);
cudaEventSynchronize(stop1);
cudaEventRecord(start2, 0);
/* executes FFT processes */
cufftExecC2C(plan, devPtr, devPtr, CUFFT_FORWARD);
cudaEventRecord(stop2, 0);
cudaEventSynchronize(stop2);
/* transfer results from GPU memory */
cudaEventRecord(start3, 0);
cudaMemcpy(data, devPtr, sizeof(cufftComplex)*NX*BATCH, cudaMemcpyDeviceToHost);
cudaEventRecord(stop3, 0);
cudaEventSynchronize(stop3);
I will be appreciated to see your comments. Thanks in Advance.