5

If I transfer a single byte from a CUDA kernel to PCI-E to the host (zero-copy memory), how much is it slow compared to transferring something like 200 Megabytes?

What I would like to know, since I know that transferring over PCI-E is slow for a CUDA kernel, is: does it change anything if I transfer just a single byte or a huge amount of data? Or perhaps since memory transfers are performed in "bulks", transferring a single byte is extremely expensive and useless with respect to transferring 200 MBs?

talonmies
  • 70,661
  • 34
  • 192
  • 269
Marco A.
  • 43,032
  • 26
  • 132
  • 246
  • 3
    The bandwidth test example which has shipped with CUDA forever is specifically designed to answer this question. – talonmies Jul 18 '13 at 17:07
  • I currently don't have a CUDA gpu right now, can you give me a hint on the results? – Marco A. Jul 18 '13 at 17:31
  • 2
    This has to do with the overhead of launching a transfer request. For example 200 1MB requests will be slower than a single 200MB transfer. – Pavan Yalamanchili Jul 18 '13 at 18:03
  • If u have large data to be transferred to the GPU for processing.. then its best to look into following concepts 1) streams and 2) async copy.. [here](https://developer.nvidia.com/content/how-optimize-data-transfers-cuda-cc) is code for checking the bandwidth u might want to look into it.. – Sagar Masuti Jul 19 '13 at 02:28

2 Answers2

8

Hope this pic explain everything. The data is generated by bandwidthTest in CUDA samples. The hardware environment is PCI-E v2.0, Tesla M2090 and 2x Xeon E5-2609. Please note both axises are in log scale.

Given this figure, we can see that the overhead of launching a transfer request takes a constant time. Regression analysis on the data gives an estimated overhead time of 4.9us for H2D, 3.3us for D2H and 3.0us for D2D.

enter image description here

kangshiyin
  • 9,681
  • 1
  • 17
  • 29
  • I don't understand this chart very well. For example, which one takes more time (in total time, not in speed): a transfer of 1 byte or a transfer of 100 bytes? – étale-cohomology Aug 17 '17 at 13:07
  • 1
    @étale-cohomology for 1-byte and 100-byte, they are almost the same. It is because the constant overhead takes the majority part of the total time. – kangshiyin Oct 17 '17 at 05:36
-1

The latency plot would be more clear in this case. Small transactions aren't more expensive than big ones. The only problem with them is that they can't saturate the bus. Therefore it's possible to transfer bigger messages at almost the same time. That is why transferring one 512 KB is 120 times faster than transferring 512 1 KB transactions. The saturation point of PCIe depends on lanes count. You could find more details about PCIe features from CUDA point of view here.