Can cudaDeviceSynchronize reduce the memcopy time?

Question

A normal CUDA program:

allocate memory space in CUDA device
Memory copy from host to device
call kernel
Memory copy device to host
...etc

so if i measure the host to device time

    time = clock ();
    2. mem host to device;
    cudaDeviceSynchronize;
    time = clock () - time ;

and i will get a value of 0.1 s in my case. but my PCI bus speed is actually 24GB/s which is suppose to yield 1000 time smaller time value, so i make assumption that 0.1 s is the time that is used to activate the PCI bus.

so i tried to loop the the host to device time by 1000 time, and for the first time it show 0.1s and the rest of the time is just 0.000 s (can't go beyond millisecond) , and the total time of the 1000 loop is just 0.12s.

so i have to keep my device PCI bus activated in order to reduce the host to device time. i tried using cudaDeviceSynchronize as shown below:

    cudaDeviceSynchronize; //---to keep PCI bus activate
    time = clock ();
    2. mem host to device;
    cudaDeviceSynchronize;
    time = clock () - time ;

and the time i get is 0.000s which the time spent on host to device is minimized. Is that correct? is the 0.1s = time to "activate" the PCI bus?

The 0.1s is probably CUDA initialization time. – Robert Crovella Jun 21 '16 at 02:41 — Robert Crovella, Jun 21 '16 at 02:41

score 1 · Answer 1 · edited May 23 '17 at 12:22

As Robert Crovella suggests the time you are measuring with the first call to a CUDA function is related to the CUDA initialization.

Furthermore if you are measuring so small periods you are most probably just measuring the overhead of the function call. You should try increasing the size of the memory that you are copying in order to obtain more significant numbers.

If you are interested in measuring copy times between the CPU and the GPU you should definitely try to play with pinned memory as explained in the documentation.

Can cudaDeviceSynchronize reduce the memcopy time?

1 Answers1