Questions tagged [gpu-shared-memory]

The memory space on a GPU computation core which is shared by all threads of a block in a work grid ("work-items" in a "work-group" of the grid in OpenCL parlance).

"Shared Memory" is one of the kinds of memory spaces in a GPU's memory model.

This space of memory, for a specific GPU thread ("work item" in OpenCL parlance) is located on the computational core within the GPU chip on which the thread is being executed (e.g. an SMX core on an nVIDIA Kepler GPU).

Shared memory is the "closest" (or fastest, if you will) memory space for a thread which is shared with other threads - all other threads in its block ("work-group" in OpenCL parlance), which are also executing on the same computational core.

Shared memory is similar in structure and behavior to L1 cache, and in fact these two are sometimes partially interchangeable. Hence accessing it is slower than using a thread's own registers, and there may be conflicts between accesses by other threads which degrade performance (named bank conflicts).

In OpenCL this memory space is called local memory which can cause confusion as "local memory" is something entirely different in CUDA.

OpenGL on the other hand uses the term "shared memory" in the same fashion as CUDA.

339 questions
46
votes
3 answers

What's the difference between CUDA shared and global memory?

I’m getting confused about how to use shared and global memory in CUDA, especially with respect to the following: When we use cudaMalloc(), do we get a pointer to shared or global memory? Does global memory reside on the host or device? Is there…
mchen
  • 9,808
  • 17
  • 72
  • 125
44
votes
5 answers

allocating shared memory

i am trying to allocate shared memory by using a constant parameter but getting an error. my kernel looks like this: __global__ void Kernel(const int count) { __shared__ int a[count]; } and i am getting an error saying error: expression must…
lina
  • 1,679
  • 4
  • 21
  • 25
25
votes
3 answers

CUDA: When to use shared memory and when to rely on L1 caching?

After Compute Capability 2.0 (Fermi) was released, I've wondered if there are any use cases left for shared memory. That is, when is it better to use shared memory than just let L1 perform its magic in the background? Is shared memory simply there…
Roger Dahl
  • 15,132
  • 8
  • 62
  • 82
23
votes
1 answer

CUDA: Tiled matrix-matrix multiplication with shared memory and matrix size which is non-multiple of the block size

I'm trying to familiarize myself with CUDA programming, and having a pretty fun time of it. I'm currently looking at this pdf which deals with matrix multiplication, done with and without shared memory. Full code for both versions can be found here.…
Mike
  • 263
  • 1
  • 2
  • 9
20
votes
2 answers

When to use volatile with shared CUDA Memory

Under what circumstances should you use the volatile keyword with a CUDA kernel's shared memory? I understand that volatile tells the compiler never to cache any values, but my question is about the behavior with a shared array: __shared__ float…
Taj Morton
  • 1,588
  • 4
  • 18
  • 26
18
votes
3 answers

Cuda Shared Memory array variable

I am trying to declare a variable for matrix multiplication as follows: __shared__ float As[BLOCK_SIZE][BLOCK_SIZE]; I am trying to make it so the user could input the size of the matrix to calculate, however that would mean changing the…
Dan
  • 1,041
  • 1
  • 12
  • 32
18
votes
3 answers

GPU shared memory size is very small - what can I do about it?

The size of the shared memory ("local memory" in OpenCL terms) is only 16 KiB on most nVIDIA GPUs of today. I have an application in which I need to create an array that has 10,000 integers. so the amount of memory I will need to fit 10,000 integers…
rana
  • 181
  • 1
  • 1
  • 3
17
votes
3 answers

Is there a way of setting default value for shared memory array?

Consider the following code: __global__ void kernel(int *something) { extern __shared__ int shared_array[]; // Some operations on shared_array here. } Is it possible to initialize the whole shared_array to some value - e.g. 0 -…
fsh
  • 319
  • 1
  • 2
  • 8
16
votes
1 answer

Is local memory slower than shared memory in CUDA?

I only found a remark that local memory is slower than register memory, the two-per-thread types. Shared memory is supposed to be fast, but is it faster than local memory [of the thread]? What I want to do is kind of a median filter, but with a…
JohnKay
  • 233
  • 2
  • 7
15
votes
1 answer

Dynamic Shared Memory in CUDA

There are similar questions to what I'm about to ask, but I feel like none of them get at the heart of what I'm really looking for. What I have now is a CUDA method that requires defining two arrays into shared memory. Now, the size of the arrays is…
zephyr
  • 2,182
  • 3
  • 29
  • 51
14
votes
1 answer

How is 2D Shared Memory arranged in CUDA

I've always worked with linear shared memory (load, store, access neighbors) but I've made a simple test in 2D to study bank conflicts which results have confused me. The next code read data from one dimensional global memory array to shared memory…
pQB
  • 3,077
  • 3
  • 23
  • 49
13
votes
1 answer

Use dynamic shared memory allocation for two different vectors

In kernel function, I want two vectors of shared memory, both with size length (actually sizeof(float)*size). Since it is not possible to allocate memory directly in the kernel function if a variable is needed, I had to allocate it dynamically,…
BobCormorano
  • 650
  • 1
  • 7
  • 14
12
votes
2 answers

GPU Shared Memory Bank Conflict

I am trying to understand how bank conflicts take place. I have an array of size 256 in global memory and I have 256 threads in a single block, and I want to copy the array to shared memory. Therefore every thread copies one…
scatman
  • 14,109
  • 22
  • 70
  • 93
11
votes
3 answers

Is there a limit to OpenCL local memory?

Today I added four more __local variables to my kernel to dump intermediate results in. But just adding the four more variables to the kernel's signature and adding the corresponding Kernel arguments renders all output of the kernel to "0"s. None of…
Framester
  • 33,341
  • 51
  • 130
  • 192
11
votes
2 answers

Templated CUDA kernel with dynamic shared memory

I want to call different instantiations of a templated CUDA kernel with dynamically allocated shared memory in one program. My first naive approach was to write: template __global__ void kernel(T* ptr) { extern __shared__ T smem[]; …
piripiri
  • 1,925
  • 2
  • 18
  • 35
1
2 3
22 23