The cost of CUDA global memory transactions

Question

According to CUDA 5.0 Programming Guide, if I am using both L1 and L2 caching (on Fermi or Kepler), all global memory operations are done using 128-byte memory transactions. However, if I am using L2 only, 32-byte memory transactions are used (chapter F.4.2).

Let us assume that all caches are empty. If I have a warp, with each thread accessing a single 4-byte word, in a perfectly aligned fashion, this will result in 1x128B transaction in L1+L2 case, and in 4x32B transaction in L2-only case. Is that right?

My question is - are the 4 32B transactions any slower than a single 128B transaction? My intuition from pre-Fermi hardware suggests that it would be slower, but perhaps this is no longer true on the newer hardware? Or maybe I should just look at the amount of bandwidth utilization to judge the efficiency of my memory access?

Robert Crovella · Accepted Answer · 2012-10-09T14:56:04.743

Yes, in caching mode, a single 128byte transaction will be generated (as seen from the L1 cache level.) In uncached mode, four 32byte transactions will be generated (as seen from the L2 cache level - it's still a single 128byte request coming from the warp due to coalescing.) In the case you describe, the four 32byte transactions are not any slower, for a fully coalesced access, regardless of cached or uncached mode. The memory controller (on a given GPU) should generate the same transactions to satisfy the warp's request in either case. Since the memory controller is composed of a number (up to 6) of "partitions", each of which has a 64bit wide path, ultimately multiple memory transactions (coming across multiple partitions, perhaps) will be used to satisfy either request (4x32byte or 1x128byte). The specific number of transactions and organization across partitions may vary from GPU to GPU, (and isn't part of your question, but a GPU with DDR-pumped memory will return 16bytes per partition per memory transaction, and with QDR-pumped memory, will return 32bytes per partition per memory transaction). This isn't specific to CUDA 5 either. You might want to review one of NVIDIA's webinars for this material, in particular "CUDA Optimization : Memory Bandwidth Limited Kernels". Even if you don't want to watch the video, a quick review of the slides will remind you of the various differences between so-called "cached" and "uncached" accesses (this is referring to L1), and also give you the compiler switches needed to try each case.

Another reason to review the slides is that it will remind you of under what circumstances you might want to try "uncached" mode. In particular, if you have a scattered (uncoalesced) access pattern coming from your warps, uncached mode access may yield an improvement because there is less "wastage" when requesting 32byte quantities from memory to satisfy the request of a single thread as compared to 128byte quantities. However in response to your final question, it's fairly difficult to be analytical about it, because presumably your code is a mix of ordered and disordered access patterns. Since uncached mode is turned on via compiler switch, the suggestion given in the slides is simply to "try your code both ways" and see which runs faster. In my experience, running in uncached mode rarely yields a perf improvement.

EDIT: Sorry I had the link and title for the wrong presentation. Fixed slide/video link and webinar title.

Aligned, coalesced 32-bit load requests to L1 are 128-bytes. If the load request misses in L1 then L1 will issue 4 32-byte load requests to L2. All L2 transactions are 32-bytes. Cache operator can be controlled on a per instruction basis using inline PTX. — Greg Smith, Oct 09 '12 at 21:10

The cost of CUDA global memory transactions

1 Answers1

Linked