The memory space in GPU programming which is thread-specific in terms of access, but physically located in global GPU memory; it is perhaps better named "thread-local global memory"
By default, automatic variables in GPU kernels (which are local to a single GPU execution thread) are placed in the large register file available on GPU cores. However, such placement is not always possible; for example:
- Local arrays may require indexed access, which most/all GPUs do not support; if the index cannot be determined at compile time, such arrays cannot be placed in registers.
- The kernel may use more space than is available in the register file ("register spilling").
When registger placement is not possible, the thread-specific memory is not placed in a GPU core's shared memory, but rather in the much-larger global device memory. Other than this address not being available to the programmer at compile-time, "local" memory behaves mostly the same as "global" memory: Low bandwidth and high latency relative to shared memory or registers. The kernel compiler will typically place a warp threads' local memory in global memory in an automatically interleaved pattern, to improve access speeds.
More information regarding local memory in CUDA can be found in nVIDIA's CUDA Grogramming Guide.
In OpenCL parlance, this memory space is named "private memory", while OpenCL "local memory" is actually work-group-local, i.e. the equivalent of shared memory in CUDA.