A normal CUDA program:
- allocate memory space in CUDA device
- Memory copy from host to device
- call kernel
- Memory copy device to host
- ...etc
so if i measure the host to device time
time = clock ();
2. mem host to device;
cudaDeviceSynchronize;
time = clock () - time ;
and i will get a value of 0.1 s in my case. but my PCI bus speed is actually 24GB/s which is suppose to yield 1000 time smaller time value, so i make assumption that 0.1 s is the time that is used to activate the PCI bus.
so i tried to loop the the host to device time by 1000 time, and for the first time it show 0.1s and the rest of the time is just 0.000 s (can't go beyond millisecond) , and the total time of the 1000 loop is just 0.12s.
so i have to keep my device PCI bus activated in order to reduce the host to device time. i tried using cudaDeviceSynchronize as shown below:
cudaDeviceSynchronize; //---to keep PCI bus activate
time = clock ();
2. mem host to device;
cudaDeviceSynchronize;
time = clock () - time ;
and the time i get is 0.000s which the time spent on host to device is minimized. Is that correct? is the 0.1s = time to "activate" the PCI bus?