Understanding memory transfer performance (CUDA)

Question

auto ts = std::chrono::system_clock::now();

cudaMemcpyAsync((void**)in_dev, in_host, 1000 * size, cudaMemcpyHostToDevice, stream_in);
cudaMemcpyAsync((void**)out_host, out_dev, 1000 * size, cudaMemcpyDeviceToHost, stream_out);

cudaStreamSynchronize(stream_in);
cudaStreamSynchronize(stream_out);

time_data.push_back(std::chrono::system_clock::now() - ts);

This is the results of a benchmark I made for my own educational purposes. Pretty simple, every 'cycle' of the program it launches parallel transfer of data and waits for those operations to be complete before taking a timestamp.

The kernel version adds a simple kernel that operates on every byte of data (also on a different stream). The trend of kernel execution time makes sense to me - my device only has so many SMs/cores and it will start taking longer once I ask for more.

What I don't understand is why the memory transfer only tests start ramping up exponentially at nearly the same data size point as the core limitations. The memory bandwidth for my device is advertised as 600 GB/s. Transferring 10 MB here takes on average ~1.5 milliseconds which isn't what napkin math would suggest given bandwidth. My expectation was that time would be nearly constant around the memory transfer latency, but that doesn't seem to be the case.

To confirm it was not my bootleg time stamp methods I ran the memory only version with NSight Compute and confirmed that going from N=1000 KB to N=10000 KB increased average async transfer time from ~80 us to around ~800 us.

What am I missing about D/H memory transfer performance? Is the key to getting good bandwidth overlapping lots of small transfers rather than large transfers or would that be worse because of limited copy engine bottlenecks?

I ran this benchmark on an RTX 3070 Ti with a pcie4 system.

Robert Crovella · Accepted Answer · 2022-03-09T18:45:28.487

Many CUDA operations can be crudely modeled as an "overhead" and a "duration". The duration is often predictable from the operation characteristics - e.g. the size of the transfer divided by the bandwidth. The "overhead" can be crudely modeled as a fixed quantity - e.g. 5 microseconds.

You graph consists of several measurements:

The "overhead" associated with initiating a transfer or "cycle". CUDA async ops generally have a minimum duration on the order of 5-50 microseconds. This is indicated in the "flat" left hand side of the blue curve. A "cycle" here represents two transfers, plus, in the case of the "kernel" version, the kernel launch overhead. The combination of these "overhead" numbers, represents the y-intercept of the blue and orange curves. The distance from the blue curve to the orange curve represents the addition of the kernel ops (which you haven't shown). On the left hand side of the curve, the operation sizes are so small that the contribution from the "duration" portion is small compared to the "overhead" constribution. This explains the approximate flatness of the curves on the left hand side.
The "duration" of operations. On the right hand side of the curves, the approximately linear region corresponds to the "duration" contribution as it becomes large and dwarfs the "overhead" cost. The slope of the blue curve should correspond to the PCIE transfer bandwidth. For a Gen4 system that should be approximately 20-24GB/s per direction (it has no connection to the 600GB/s of GPU memory bandwidth - it is limited by the PCIE bus.) The slope of the orange curve is also related to PCIE bandwidth, as this is the dominant contributor to the overall operation.
The "kernel" contribution. The distance between the blue and orange curves represent the contribution of the kernel ops, over/above just the PCIE data transfers.

What I don't understand is why the memory transfer only tests start ramping up exponentially at nearly the same data size point as the core limitations. The memory bandwidth for my device is advertised as 600 GB/s. Transferring 10 MB here takes on average ~1.5 milliseconds which isn't what napkin math would suggest given bandwidth.

The dominant transfer here is governed by the PCIE bus. That bandwidth is not 600GB/s but something like 20-24GB/s per direction. Furthermore, unless you are using pinned memory as the host memory for your transfers, the actual bandwidth will be about half of maximum achievable. This lines up pretty well with your measurement: 10MB/1.5ms = 6.6GB/s. Why does this make sense? You are transferring 10MB at a rate of ~10GB/s on the first transfer. Unless you are using pinned memory, the operation will block and will not execute concurrently with the 2nd transfer. Then you transfer 10MB at a rate of ~10GB/s on the second transfer. This is 20MB at 10GB/s, so we would expect to witness about a 2ms transfer time. Your actual transfer speeds might be closer to 12GB/s which would put the expectation very close to 1.5ms.

My expectation was that time would be nearly constant around the memory transfer latency, but that doesn't seem to be the case.

I'm not sure what that statement means, exactly, but for reasonably large transfer size, the time is not expected to be constant independent of the transfer size. The time should be a multiplier (the bandwidth) based on the transfer size.

I ran the memory only version with NSight Compute and confirmed that going from N=1000 KB to N=10000 KB increased average async transfer time from ~80 us to around ~800 us.

That is the expectation. Transferring more data takes more time. This is generally what you would observe if the "duration" contribution is significantly larger than the "overhead" contribution, which is true on the right hand side of your graph.

Here is a spreadsheet showing a specific example, using 12GB/s for PCIE bandwidth and 5 microseconds for the fixed operation overhead. The "total for 2 ops" column tracks your blue curve pretty closely:

So my reasoning was mostly just shifted to the right too far - given the correct numbers the pcie bus can handle this makes sense. — Treeman, Mar 09 '22 at 20:42
Robert, you wrote PCIE bandwidth is "something like 20-24GB/s per direction" which is true for PCIE 4, but your table uses 12 GB/s, for PCIE 3. — einpoklum, Mar 11 '22 at 16:48
I'm using 12GB/s, because, as I said in my answer, the bandwidth achievable when not using a pinned memory allocation is about half of the peak achievable. Quoting directly from my answer, emphasis added: "The dominant transfer here is governed by the PCIE bus. That bandwidth is not 600GB/s but something like 20-24GB/s per direction. Furthermore, **unless you are using pinned memory as the host memory for your transfers, the actual bandwidth will be about half of maximum achievable**." — Robert Crovella, Mar 11 '22 at 17:07
Do you have a reference for the "non-pinned bandwidth is half that of pinned" statement? In my experience (and in my [benchmarks](https://github.com/siboehm/bench/tree/master/gpu/pcie)) I'm getting 2GB/s without pinning and 20GB/s with pinning on a 16x PCIe 4.0 system with a A100 40GB SXM. — Simon Boehm, Feb 22 '23 at 21:57
I never said what you have in quotes. In both of my references, I used the word "about", which you have dropped. In any event, a roughly 2x factor is not an [uncommon observation](https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/). There are both benchmarking practices as well as external factors that could make pageable bandwidth less than half of pinned bandwidth. However on nearly any system I have tested, the results of `bandwidthTest --memory=pinned` and `bandwidthTest --memory=pageable` show a roughly 1.5x-3x ratio. The 2x factor is not a hard specification. — Robert Crovella, Feb 22 '23 at 22:17
I just tested one system with a V100 and got 12.7GB/s pinned and 3.8 GB/s pageable. I tested another system with a GTX1660Ti and got 12.7 GB/s pinned and 8.1 GB/s pageable. YMMV. — Robert Crovella, Feb 22 '23 at 22:18
I'm sorry for misquoting you! I was typing it up quickly. It seems to gap has gotten much bigger for recent architectures, on every system I tested on (A100 SXM, A100 PCIe, A600 PCIe), using both my own code and bandwidthTest it was much closer to 10x than 2x. — Simon Boehm, Feb 24 '23 at 21:48
As indicated in the article I linked, pageable memory performance may depend both on the memory bandwidth of the host system, as well as generally the "speed" of the CPU on the host system, since pageable transfers involve CPU copying of data from your ordinary pageable allocation to a pinned buffer. — Robert Crovella, Feb 24 '23 at 22:10

Understanding memory transfer performance (CUDA)

1 Answers1