cuda memory coalescing

Question

I would like first to confirm the following: The elementary global memory transaction to shared memory is either 32 bytes, 64 or 128 bytes, but only if the memory accesses can be coalesced. The latencies of the precedent transactions are all equal. Is that right?

Second question: If the memory reads can't be coalesced, each thread reads only 4 bytes (is that right?) will all threads memory accesses be made sequential?

You may wish to review some of the webinars available [here](https://developer.nvidia.com/gpu-computing-webinars). In particular there are webinars that cover memory efficient operations and coalescing for [global memory](http://developer.download.nvidia.com/CUDA/training/cuda_webinars_GlobalMemory.pdf) (and [video](http://developer.download.nvidia.com/CUDA/training/globalmemoryusage_june2011.mp4)) and [shared memory (video)](http://developer.download.nvidia.com/CUDA/training/sharedmemoryusage_july2011.mp4) Memory transactions occur at a size of either 32 or 128 byte granularity. — Robert Crovella, Feb 10 '13 at 00:50

score 1 · Answer 1 · answered Feb 10 '13 at 02:04

It depends on the architecture you are working on. However, on Fermi and Kepler you have:

Memory transactions are always 32byte or 128byte called segments
32byte segments is used when only L2 cache is used, 128byte segments when L2+L1.
If two threads of the same warp fall into the same segment, data is delivered in a single transation
If on the other hand there is data in a segment you fetch that no thread requested - it is being read anyway and you (probably) waste bandwidth
Whole segments fall into L1 & L2 cache and may reduce your bandwidth pressure when your neighbouring warps need the same segment
L1 & L2 are fairly small compared to the number of threads they usually deliver for. That is why you should not expect a piece of data to stay in the cache for long (in contrary to CPU programming)
You can disable L1 caching which may help if you overfetch in random memory access patterns.

As you can see there are several variables which decide how much time your memory access is going to take. The general rule of thumb is: the more dense your access pattern - the better! Stride or misalignment are not as costly now as they were in the past, so don't worry too much about that, unless you are doing some late-stage optimizations.

cuda memory coalescing

1 Answers1