Bank-Conflict-Free Access in shared memory

Question

I have to use shared memory that is 64 elements in size, twice the number of banks and threads in a warp. How should I address them to yield a bank-conflict-free access?

How are U access your shared memory. U need in 64 bit access per thread or do two 32bit accesses? — geek, Mar 31 '12 at 15:11

score 2 · Answer 1 · edited Mar 31 '12 at 15:27

2

In case of 32-bit memory access you can use default memory access pattern.

__shared__ int shared[32];
int data = shared[base + stride * tid];

there stride is odd.

If you have 64-bit access you can use some trick like this:

struct type 
{  
   int x, y, z;
};
__shared__ struct type shared[32];
struct type data = shared[base + tid];

edited Mar 31 '12 at 15:27

Ashwin Nanjappa

76,204
83
211
292

answered Mar 31 '12 at 15:23

geek

1,809
1
12
12

Thanks a lot. As an example I have an array with 32 elements which is shared with 16 threads, so each thread has to access to the two elements of this array. Then how should be the proper addressing according to the above issue? – Behzad Baghapour Mar 31 '12 at 16:01
first pattern works well in your case. Could I ask one small question: Why 16 ? any array with qualifier `__shared__` shared for all thread in thread block. – geek Mar 31 '12 at 16:25
It is just an example. I actually deals with FEM simulation which shares some arrays like mass matrices among the computations performed by threads. – Behzad Baghapour Mar 31 '12 at 16:37

score 0 · Answer 2 · answered Apr 03 '12 at 08:57

Let's assume you're using compute capability 1.x, so your shared memory has 16 banks, and each thread has to access 2 elements in shared memory.

What you want is for a thread to access the same memory bank for both elements, so if you organize it such that the required elements are 16 away from each other, you should avoid bank conflicts.

__shared__ int shared[32];
int data = shared[base + stride * tid];
int data = shared[base + stride * tid + 16];

I used this pattern for storing complex floats, but I had an array of complex floats, so it looked like

#define TILE_WIDTH 16

__shared__ float shared[TILE_WIDTH][2*TILE_WIDTH + 1];
float real = shared[base + stride * tid];
float imag = shared[base + stride * tid + TILE_WIDTH];

Where the +1 is to avoid serialization in transposed access patterns.

Bank-Conflict-Free Access in shared memory

2 Answers2