1

I have a cuda code which performs calculation on GPU. I am using clock(); to find out timings

My code structure is

__global__ static void sum(){

// calculates sum 
}

extern "C"
int run_kernel(int array[],int nelements){
 clock_t start, end;
  start = clock();
  //perform operation on gpu - call sum
 end = clock();
 double elapsed_time = ((double) (end - start)) / CLOCKS_PER_SEC;
 printf("time required : %lf", elapsed_time);
}

But the time is always 0.0000 I checked printing start and end time. Start has some value but end time is always zero.

Any idea what might be the cause? Any alternatives to measure time.

Any help would be appreciated.

Thanks

Coder
  • 3,090
  • 8
  • 49
  • 85
  • Is `elapsed_time` zero or `end` zero? I suppose you are using the GPU not the CPU so measuring CPU time will not give you GPU time. – devil Apr 30 '12 at 05:15
  • 2
    Perhaps the time you are measuring is smaller than the granularity of the clock? Try temporarily slapping a sleep(1) in between the start=clock() and end=clock() lines and see if you get a non-zero result then. – Jeremy Friesner Apr 30 '12 at 05:16
  • 2
    Might be the same reason as this question: http://stackoverflow.com/q/2134363/10077 – Fred Larson Apr 30 '12 at 05:17
  • Perhaps `clock` is measuring CPU time, (like `times` or `getrusage`) but CUDA run on the GPU card which is, from the kernel point of view, a peripheral device (like a printer) so most of the time the kernel is waiting for it. – Basile Starynkevitch Apr 30 '12 at 06:47

4 Answers4

7

There are two problems here:

  1. The clock() function has too low resolution to measure the duration of the event you are trying to time
  2. The CUDA kernel launch is an asynchronous operation, so it consumes almost no time (typically 10-20 microseconds on a sane platform). Unless you use a synchronous CUDA API call to force the host CPU to block until the kernel finishes running, you are not going to be measuring the execution time.

CUDA has its own high precision timing API, and it is the recommended way to time operations which run on the GPU. The code to use it would look something like this:

int run_kernel(int array[],int nelements){

    cudaEvent_t start,stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    cudaEventRecord(start, 0);

    //
    //perform operation on gpu - call sum
    //

    cudaEventRecord(stop, 0); 
    cudaEventSynchronize(stop); 
    float elapsedTime; 
    cudaEventElapsedTime(&elapsedTime, start, stop); 
    printf("time required : %f", elapsed_time); 

    cudaEventDestroy(start);
    cudaEventDestroy(stop);
}
talonmies
  • 70,661
  • 34
  • 192
  • 269
5

Don't use clock to time CUDA kernel launches. Use cudaEventElapsedTime. Even if clock was high enough precision to time your kernel (it isn't), kernel launches are asynchronous, which means that control flow returns to your calling function before the kernel is complete.

Here's how:

void run_kernel(...)
{
  // create "events" which record the start & finish of the kernel of interest
  cudaEvent_t start, end;
  cudaEventCreate(&start);
  cudaEventCreate(&end):

  // record the start of the kernel
  cudaEventRecord(start);

  // perform operation on gpu - call sum
  sum<<<...>>>(...);

  // record the end of the kernel
  cudaEventRecord(end);

  // get elapsed time. Note that this call blocks
  // until the kernel is complete
  float ms;
  cudaEventElapsedTime(&ms, start, end);

  printf("time required : %f milliseconds", ms);

  cudaEventDestroy(start);
  cudaEventDestroy(end);
}
Jared Hoberock
  • 11,118
  • 3
  • 40
  • 76
0

I believe that you should be using clock_gettime() with CLOCK_MONOTONIC to measure elapsed time to a high resolution nowadays. On my computer resolution is 1ns which is decent enough.

You can use it like

#include <time.h>
...

struct timespec start, end, res;

clock_getres(CLOCK_MONOTONIC, &res);
/* exact format string depends on your system, on mine time_t is long */
printf("Resolution is %ld s, %ld ns\n" res.tv_sec, res.tv_nsec);

clock_gettime(CLOCK_MONOTONIC, &start);
/* whatever */
clock_gettime(CLOCK_MONOTONIC, &end);

Compile with -lrt

Edit: I see that I took the wrong approach on this, obviously you should use CUDA timing if that is what you need. I followed the lines of your question where you timed the system.

r_ahlskog
  • 1,916
  • 1
  • 17
  • 26
0

cuda kernel launch is async, so you must add cudaThreadSynchronize() after kernel.

yyfn
  • 737
  • 4
  • 4