Is coalesced memory access a feature or phenomenon?

Question

I'm current writing a smaller project in OpenCL, and I'm trying to find out what really causes memory coalescing. Every book on GPGPU programming says it's how GPGPUs should be programmed, but not why the hardware would prefer this.

So is it some special hardware component which merges data transfers? Or is it simply to better utilize the cache? Or is it something completely different?

For one, hardware memory fetch operation is as wide as 128 bits and if you use 32-bit values then 4 of them(if they are consecutive addresses) can be fetched from memory in single instruction to feed 4 workitems at once. Another one, if all access same address, all workitems get same data in single operation, with broadcasting. At least for gcn. Also that 128 bit is composed of multiple channels which can serve best when they are different address and not in same modulus value. Also memory read/writes are pipelined in hardware so it is better to use independent addresses, independent banks. — huseyin tugrul buyukisik, Oct 01 '17 at 13:47

score 0 · Answer 1 · answered Oct 02 '17 at 18:34

Memory coalescing makes several different things more efficient. It is usually done before the requests hit the cache. Similar to the SIMT execution model it is a architectural trade-off. It enables GPUs to have a more efficient and very high performance memory system but also forces programmers to think carefully about their data layout.

Without coalescing either the cache needs to be able to serve a huge number of requests at the same time or memory access would take a lot longer as the different data transfers would need to be handled one at a time. This is even relevant when just checking if something is a hit or a miss.

Merging requests is rather easy to do, you just pick one transfer and then merge all requests with matching upper address bits. You just generate a single request per cycle and replay the load or store instruction until all threads have been handled.

Caches also stores consecutive bytes, 32/64/128Byte, this fits most applications well, is a good fit to modern DRAM and reduces the overhead for cache bookeeping information: The cache is organized in cachelines and each cacheline has a tag that indicates which addresses are stored in the line.

Modern DRAM uses wide interfaces and also long bursts: The memory of a GPU is typically organized in 32-bit or 64-bit wide channels with GDDR5 memory that has a burst length of 8. This means that every transaction at the DRAM interface has to fetch at least 32-bit*8=32 byte or 64-bit*8=64 byte at a time, even if just a single byte is required from these bytes. Designing data layouts that lead to coalesced requests helps to use the DRAM interface efficiently.

GPUs also have a huge number of parallel threads active at the same time and rather small cache at the same time. CPUs are often able to use their caches to reorder their memory requests to DRAM friendly patterns. The larger number of threads and smaller caches on GPUs make this "cache based coalescing" less efficient on GPUs, as the data will often not stay long enough in the cache to get merged at the cache with other requests to the same cacheline.

Dragontamer5788 · Answer 2 · 2017-10-02T22:07:27.107

Despite the "random access" name on "RAM" (Random-access Memory), Double-Data-Rate #3 Random-Access Memory (DDR3-RAM) is faster at accessing consecutive positions rather than randomly.

Case in point: "CAS Latency" is the amount of time that DDR3 RAM will stall when you're accessing a new "column", as your RAM chip is literally charging up to serve the new data from another location on the chip.

EDIT: Jan Lucas argues that RAS Latency is more important in practice. See his comment for details.

There's roughly a 10ns delay whenever you switch columns. So, if you have a bunch of memory accesses, if you keep access a bunch of data 'close' to each other, then you don't invoke a CAS delay.

So if you have 20-words to access at a particular location, its more efficient to access those 20-words before moving to a new memory location (invoking a CAS delay). Otherwise, you'll have to invoke ANOTHER CAS delay to "switch back" between memory locations.

Its just around 10 nanoseconds, but that amount of time adds up over time.

You seem to be confusing CAS and RAS. Rows need to be opening and staying within the same row avoid RAS delays. CAS latency is, however, always there, but is pipelined. CAS latency almost does not matter for GPUs. — Jan Lucas, Oct 02 '17 at 21:14
Thanks for the info. I you might be right, its been a while since I've taken my Computer Architecture class. — Dragontamer5788, Oct 02 '17 at 22:06

Is coalesced memory access a feature or phenomenon?

2 Answers2