0

As part of my research work, i need to measure the planning time of the CUFFT library in different CUDA versions (i.e. espicially in CUDA 4 and CUDA 5.5 ). Let us see one of the results of the one dimensional FFT size 4096, as shown below,

CUDA 5.5 and GeForede GTX 770

FFT size: 4096

Planning time: 96322.7 us (micro seconds)

Loading Data : 36.6 us (micro seconds)

Execution time: 135.9 us (micro seconds)

Fetching Data: 42.5 us (micro seconds)

CUDA 4 and GeForce GTX 560

FFT size: 4096

Planning time: 102.7 us (micro seconds)

Loading Data : 26.4 us (micro seconds)

Execution time: 72.0 us (micro seconds)

Fetching Data: 27.3 us (micro seconds)

I really shocked to see the planning time of the CUFFT on CUDA 5.5 and GeForede GTX 770 is almost 900 times slower than that on CUDA 4 and GeForce GTX 560.

Instead, It should be faster in CUDA 5.5 and GeForede GTX 770 because of the following two reasons, 1) CUDA 5.5 is latest version, usually, the latest versions are faster, and 2) The GPU GTX 770 have better specifications than that of GTX 560.

My question is that why is that difference in the planning time?

For more details, please see the code below, about how I measured the time using cuda events,

   /* creates 1D FFT plan */

cudaEventRecord(start0, 0);
    cufftPlan1d(&plan, NX, CUFFT_C2C, BATCH);
cudaEventRecord(stop0, 0);
cudaEventSynchronize(stop0);

     /* transfer to GPU memory */

     cudaEventRecord(start1, 0);
    cudaMemcpy(devPtr, data, sizeof(cufftComplex)*NX*BATCH, cudaMemcpyHostToDevice);
cudaEventRecord(stop1, 0);
cudaEventSynchronize(stop1);


cudaEventRecord(start2, 0);

    /* executes FFT processes */
    cufftExecC2C(plan, devPtr, devPtr, CUFFT_FORWARD);
cudaEventRecord(stop2, 0);
cudaEventSynchronize(stop2);

    /* transfer results from GPU memory */

    cudaEventRecord(start3, 0);
cudaMemcpy(data, devPtr, sizeof(cufftComplex)*NX*BATCH, cudaMemcpyDeviceToHost);
    cudaEventRecord(stop3, 0);
cudaEventSynchronize(stop3);

I will be appreciated to see your comments. Thanks in Advance.

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
Sreehari
  • 11
  • 2
  • 1
    Can you provide a complete example? It shouldn't be that much larger than what you have shown already. What are the specifications (cpu, memory, OS, etc.) of the machines you are running these two test cases on? – Robert Crovella Feb 27 '14 at 18:44
  • 1
    To get a reliable comparison you would want to perform a controlled experiment where only one variable changes, e.g. either the GPU or the CUDA version, while the rest of the system (HW and SW) stays exactly the same. For example, as the call to the plan generator is presumably the first call into CUFFT your measurements may simply reflect the load time for the CUFFT DLL. One system may load from SSD and the other from HD, or the CUFFT DLL is already loaded on one system but not the other. You may also want to try the static CUFFT library available in recent CUDA versions. – njuffa Feb 27 '14 at 22:31
  • Like njuffa said. I think that you're a victim of http://en.wikipedia.org/wiki/Lazy_loading – llukas Feb 28 '14 at 00:44
  • njuffa, Thanks for your response :) I have aroused a couple of new questions based on your explanation. 1) What actually doea the CUFFT planning? Does it simply loads the CUFFT library functions from CPU to GPU or something else? 2) If it might be loading from the HD, how to make it load from SSD? Once again thanks in advance. – Sreehari Feb 28 '14 at 10:36
  • One thing you could do to avoid any loading effects is to run your entire sequence *twice*, and throw out the timing results from the first pass. You would want to do this *in the same executable*, i.e. write a single program which goes through the whole sequence *twice* and only consider the timing from the second pass. That is a variation on a common benchmarking technique. – Robert Crovella Mar 02 '14 at 21:40
  • Robert, Thank you soo much. Now the planning time in CUDA 5.5 reduced to 117.5us, which is almost the same as that of CUDA 4:) – Sreehari Mar 05 '14 at 16:49

0 Answers0