Questions tagged [gpu-shared-memory]

The memory space on a GPU computation core which is shared by all threads of a block in a work grid ("work-items" in a "work-group" of the grid in OpenCL parlance).

"Shared Memory" is one of the kinds of memory spaces in a GPU's memory model.

This space of memory, for a specific GPU thread ("work item" in OpenCL parlance) is located on the computational core within the GPU chip on which the thread is being executed (e.g. an SMX core on an nVIDIA Kepler GPU).

Shared memory is the "closest" (or fastest, if you will) memory space for a thread which is shared with other threads - all other threads in its block ("work-group" in OpenCL parlance), which are also executing on the same computational core.

Shared memory is similar in structure and behavior to L1 cache, and in fact these two are sometimes partially interchangeable. Hence accessing it is slower than using a thread's own registers, and there may be conflicts between accesses by other threads which degrade performance (named bank conflicts).

In OpenCL this memory space is called local memory which can cause confusion as "local memory" is something entirely different in CUDA.

OpenGL on the other hand uses the term "shared memory" in the same fashion as CUDA.

339 questions

votes

3 answers

What's the difference between CUDA shared and global memory?

I’m getting confused about how to use shared and global memory in CUDA, especially with respect to the following: When we use cudaMalloc(), do we get a pointer to shared or global memory? Does global memory reside on the host or device? Is there…

memory cuda gpu-shared-memory

asked Dec 30 '12 at 19:06

mchen

9,808
17
72
125

votes

5 answers

allocating shared memory

i am trying to allocate shared memory by using a constant parameter but getting an error. my kernel looks like this: __global__ void Kernel(const int count) { __shared__ int a[count]; } and i am getting an error saying error: expression must…

c++ c cuda gpu-shared-memory

asked Apr 03 '11 at 17:34

lina

1,679
4
21
25

votes

3 answers

CUDA: When to use shared memory and when to rely on L1 caching?

After Compute Capability 2.0 (Fermi) was released, I've wondered if there are any use cases left for shared memory. That is, when is it better to use shared memory than just let L1 perform its magic in the background? Is shared memory simply there…

caching cuda gpu-shared-memory

asked Jun 30 '12 at 16:31

Roger Dahl

15,132
8
62
82

votes

1 answer

CUDA: Tiled matrix-matrix multiplication with shared memory and matrix size which is non-multiple of the block size

I'm trying to familiarize myself with CUDA programming, and having a pretty fun time of it. I'm currently looking at this pdf which deals with matrix multiplication, done with and without shared memory. Full code for both versions can be found here.…

c matrix cuda gpu-shared-memory

asked Sep 15 '13 at 17:45

Mike

votes

2 answers

When to use volatile with shared CUDA Memory

Under what circumstances should you use the volatile keyword with a CUDA kernel's shared memory? I understand that volatile tells the compiler never to cache any values, but my question is about the behavior with a shared array: __shared__ float…

cuda gpgpu volatile gpu-shared-memory

asked Mar 11 '13 at 04:02

Taj Morton

1,588
4
18
26

votes

3 answers

Cuda Shared Memory array variable

I am trying to declare a variable for matrix multiplication as follows: __shared__ float As[BLOCK_SIZE][BLOCK_SIZE]; I am trying to make it so the user could input the size of the matrix to calculate, however that would mean changing the…

c cuda gpu-shared-memory

asked Feb 08 '12 at 04:36

Dan

1,041
1
12
32

votes

3 answers

GPU shared memory size is very small - what can I do about it?

The size of the shared memory ("local memory" in OpenCL terms) is only 16 KiB on most nVIDIA GPUs of today. I have an application in which I need to create an array that has 10,000 integers. so the amount of memory I will need to fit 10,000 integers…

gpu nvidia gpu-shared-memory

asked Feb 13 '11 at 11:04

rana

votes

3 answers

Is there a way of setting default value for shared memory array?

Consider the following code: __global__ void kernel(int *something) { extern __shared__ int shared_array[]; // Some operations on shared_array here. } Is it possible to initialize the whole shared_array to some value - e.g. 0 -…

cuda gpu-shared-memory

asked Jun 25 '11 at 13:42

fsh

votes

1 answer

Is local memory slower than shared memory in CUDA?

I only found a remark that local memory is slower than register memory, the two-per-thread types. Shared memory is supposed to be fast, but is it faster than local memory [of the thread]? What I want to do is kind of a median filter, but with a…

c cuda gpu gpu-shared-memory

asked Aug 30 '11 at 09:20

JohnKay

votes

1 answer

Dynamic Shared Memory in CUDA

There are similar questions to what I'm about to ask, but I feel like none of them get at the heart of what I'm really looking for. What I have now is a CUDA method that requires defining two arrays into shared memory. Now, the size of the arrays is…

cuda gpu-shared-memory

asked Jul 24 '14 at 19:13

zephyr

2,182
3
29
51

votes

1 answer

How is 2D Shared Memory arranged in CUDA

I've always worked with linear shared memory (load, store, access neighbors) but I've made a simple test in 2D to study bank conflicts which results have confused me. The next code read data from one dimensional global memory array to shared memory…

cuda gpu-shared-memory

asked Oct 26 '11 at 13:49

pQB

3,077
3
23
49

votes

1 answer

Use dynamic shared memory allocation for two different vectors

In kernel function, I want two vectors of shared memory, both with size length (actually sizeof(float)*size). Since it is not possible to allocate memory directly in the kernel function if a variable is needed, I had to allocate it dynamically,…

cuda gpu-shared-memory

asked Mar 15 '13 at 14:58

BobCormorano

votes

2 answers

GPU Shared Memory Bank Conflict

I am trying to understand how bank conflicts take place. I have an array of size 256 in global memory and I have 256 threads in a single block, and I want to copy the array to shared memory. Therefore every thread copies one…

c++ cuda gpgpu gpu-shared-memory bank-conflict

asked Dec 09 '10 at 08:22

scatman

14,109
22
70
93

votes

3 answers

Is there a limit to OpenCL local memory?

Today I added four more __local variables to my kernel to dump intermediate results in. But just adding the four more variables to the kernel's signature and adding the corresponding Kernel arguments renders all output of the kernel to "0"s. None of…

opencl gpgpu gpu-shared-memory

asked Mar 08 '11 at 19:17

Framester

33,341
51
130
192

votes

2 answers

Templated CUDA kernel with dynamic shared memory

I want to call different instantiations of a templated CUDA kernel with dynamically allocated shared memory in one program. My first naive approach was to write: template __global__ void kernel(T* ptr) { extern __shared__ T smem[]; …

c++ cuda gpu-shared-memory

asked Dec 19 '14 at 16:58

piripiri

1,925
2
18
35

2 3

…

22 23 Next