-1

A normal CUDA program:

  1. allocate memory space in CUDA device
  2. Memory copy from host to device
  3. call kernel
  4. Memory copy device to host
  5. ...etc

so if i measure the host to device time

    time = clock ();
    2. mem host to device;
    cudaDeviceSynchronize;
    time = clock () - time ; 

and i will get a value of 0.1 s in my case. but my PCI bus speed is actually 24GB/s which is suppose to yield 1000 time smaller time value, so i make assumption that 0.1 s is the time that is used to activate the PCI bus.

so i tried to loop the the host to device time by 1000 time, and for the first time it show 0.1s and the rest of the time is just 0.000 s (can't go beyond millisecond) , and the total time of the 1000 loop is just 0.12s.

so i have to keep my device PCI bus activated in order to reduce the host to device time. i tried using cudaDeviceSynchronize as shown below:

    cudaDeviceSynchronize; //---to keep PCI bus activate
    time = clock ();
    2. mem host to device;
    cudaDeviceSynchronize;
    time = clock () - time ; 

and the time i get is 0.000s which the time spent on host to device is minimized. Is that correct? is the 0.1s = time to "activate" the PCI bus?

talonmies
  • 70,661
  • 34
  • 192
  • 269
CHONG
  • 37
  • 2

1 Answers1

1

As Robert Crovella suggests the time you are measuring with the first call to a CUDA function is related to the CUDA initialization.

Furthermore if you are measuring so small periods you are most probably just measuring the overhead of the function call. You should try increasing the size of the memory that you are copying in order to obtain more significant numbers.

If you are interested in measuring copy times between the CPU and the GPU you should definitely try to play with pinned memory as explained in the documentation.

Community
  • 1
  • 1
mompes
  • 53
  • 5