Is local memory access coalesced?

Question

Suppose, I declare a local variable in a CUDA kernel function for each thread:

float f = ...; // some calculations here

Suppose also, that the declared variable was placed by a compiler to a local memory (which is the same as global one except it is visible for one thread only as far as I know). My question is will the access to f be coalesced when reading it?

score 2 · Accepted Answer · answered Sep 06 '11 at 08:47

I don't believe there is official documentation of how local memory (or stack on Fermi) is laid out in memory, but I am pretty certain that mulitprocessor allocations are accessed in a "striped" fashion so that non-diverging threads in the same warp will get coalesced access to local memory. On Fermi, local memory is also cached using the same L1/L2 access mechanism as global memory.

score -2 · Answer 2 · answered Sep 22 '11 at 13:15

-2

CUDA cards don't have memory allocated for local variables. All local variables are stored in registers. Complex kernels with lots of variables reduce the number of threads that can run concurrently, a condition known as low occupancy.

answered Sep 22 '11 at 13:15

John Gordon

2,576
3
24
29

1

That simply isn't true. Every thread can have a statically allocated local memory allocation of up to 16kb. This memory is stored in SDRAM off chip and is not cached. See [this](http://drdobbs.com/high-performance-computing/215900921) or the CUDA programming guide for more information. – talonmies Sep 22 '11 at 13:51
You are quite right. I've look through the programming guide multiple times and somehow never noticed that before. – John Gordon Sep 22 '11 at 17:09

Is local memory access coalesced?

2 Answers2