Questions tagged [gpu-local-memory]

The memory space in GPU programming which is thread-specific in terms of access, but physically located in global GPU memory; it is perhaps better named "thread-local global memory"

By default, automatic variables in GPU kernels (which are local to a single GPU execution thread) are placed in the large register file available on GPU cores. However, such placement is not always possible; for example:

  • Local arrays may require indexed access, which most/all GPUs do not support; if the index cannot be determined at compile time, such arrays cannot be placed in registers.
  • The kernel may use more space than is available in the register file ("register spilling").

When registger placement is not possible, the thread-specific memory is not placed in a GPU core's shared memory, but rather in the much-larger global device memory. Other than this address not being available to the programmer at compile-time, "local" memory behaves mostly the same as "global" memory: Low bandwidth and high latency relative to shared memory or registers. The kernel compiler will typically place a warp threads' local memory in global memory in an automatically interleaved pattern, to improve access speeds.

More information regarding local memory in CUDA can be found in nVIDIA's CUDA Grogramming Guide.

In OpenCL parlance, this memory space is named "private memory", while OpenCL "local memory" is actually work-group-local, i.e. the equivalent of shared memory in CUDA.

11 questions
35
votes
5 answers

In a CUDA kernel, how do I store an array in "local thread memory"?

I'm trying to develop a small program with CUDA, but since it was SLOW I made some tests and googled a bit. I found out that while single variables are by default stored within the local thread memory, arrays usually aren't. I suppose that's why it…
Matteo Monti
  • 8,362
  • 19
  • 68
  • 114
7
votes
1 answer

Amount of local memory per CUDA thread

I read in NVIDIA documentation (http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications, table #12) that the amount of local memory per thread is 512 Ko for my GPU (GTX 580, compute capability 2.0). I…
devel484
  • 165
  • 1
  • 2
  • 7
3
votes
1 answer

CUDA Local memory register spilling overhead

I have a kernel which uses a lot of registers and spills them into local memory heavily. 4688 bytes stack frame, 4688 bytes spill stores, 11068 bytes spill loads ptxas info : Used 255 registers, 348 bytes cmem[0], 56 bytes cmem[2] Since the…
user1096294
  • 829
  • 2
  • 10
  • 19
3
votes
1 answer

Local, global, constant & shared memory

I read some CUDA documentation that refers to local memory. (It is mostly the early documentation.) The device-properties reports a local-mem size (per thread). What does 'local' memory mean? What is 'local' memory? Where is 'local' memory? …
Doug
  • 2,783
  • 6
  • 33
  • 37
2
votes
2 answers

Is local memory access coalesced?

Suppose, I declare a local variable in a CUDA kernel function for each thread: float f = ...; // some calculations here Suppose also, that the declared variable was placed by a compiler to a local memory (which is the same as global one except it…
AdelNick
  • 982
  • 1
  • 8
  • 17
2
votes
2 answers

OpenCL local memory exists on Mali/Adreno GPU

Does OpenCL local memory really exist on Mali/Adreno GPU or they only exist in some special mobile phones? If they exist, in which case should we use local memory, such as GEMM/Conv or other cl kernel?
irasin
  • 145
  • 6
2
votes
1 answer

Force all threads in a work group to execute the same if/else branch

I would like to use the local/shared memory optimization to reduce global memory access, so I basically have this function float __attribute__((always_inline)) test_unoptimized(const global float* data, ...) { // ... for(uint j=0;…
ProjectPhysX
  • 4,535
  • 2
  • 14
  • 34
2
votes
1 answer

Why a simple CUDA function needs so much local memory?

I've wrote a simple function on CUDA. It's resize an image to double scale. For an image at 1920*1080, this function need ~20ms to complete. I've tried some different way to optimize that function. And I found that may be local memory is the key…
cs512
  • 23
  • 4
1
vote
1 answer

Does Vulkan support local subgroup memory sharing and manipulation?

In OpenCL I could use __local whenever I wanted to manipulate subgroup memory. Analogically CUDA has __shared__ keyword. Does Vulkan have something equivalent? I cannot see anything in the subgroup…
alagris
  • 1,838
  • 16
  • 31
0
votes
1 answer

CUDA efficient usage of shared/local memory?

I am still a little unsure when it comes to shared/local memory in CUDA. Currently I have a kernel, within the kernel each thread allocates a list object. Something like this __global__ void TestDynamicListPerThread() { //Creates a dynamic list…
Twiltie
  • 572
  • 1
  • 6
  • 13
-1
votes
2 answers

Local Memory: cuda presentation

I was reading this presentation document: http://on-demand.gputechconf.com/gtc-express/2011/presentations/register_spilling.pdf In page 3 of the presentation, the author states: A store always happens before a load –Only GPU threads can access LMEM…