1

In my program I use shared memory to do prefetching of data. A 2D block of threads, dimentions 8 by 4 (32), gets 8 * 4 * 8 * sizeof(float4) bytes of shared memory. Each thread copies 8 float4s in a loop:

inline __device__ void pack(const float4 *g_src, float4 *s_dst, const unsigned int w, const unsigned int d) {
    uint2 indx = { blockIdx.x * blockDim.x + threadIdx.x, blockIdx.y * blockDim.y + threadIdx.y };
    uint2 sindx = { threadIdx.x, threadIdx.y };
    int i;

    for (i = 0; i < d; ++i) s_dst[(sindx.y * blockDim.x + sindx.x) * d + i] = g_src[(w * indx.y + indx.x) * d + i];
} 

where w is set to width of the global memory buffer (in number of float4s) and d is set to 8 (number of float4s copied).

Can such configuration and further usage of the memory, lead to bank conflicts, or will broadcasting be applied? Will this be a case also when threads copy only, say 5 float4s, not 8?

MK

P.S. Same topic in Nvidia forum

paleonix
  • 2,293
  • 1
  • 13
  • 29
Dori
  • 675
  • 1
  • 7
  • 26

1 Answers1

1

During prefetching phase bank conflicts will occur. E.g. threads within first warp with IDs (computed as threadIdx.x + threadIdx.y * blockDim.x) 0, 4, 8, ... 28 access same bank. You can see it as thread (0,0) and thread (4,0) for i equals 0 access s_dst[0] and s_dst[32] belonging to the same bank.

If bank conflicts occur during further usage depends on the scheme you will access s_dst.

Broadcast mechanism is applied only when threads simultaneously read the same address.

How many bank conflicts occur depends on the value of d. If d mod 32 == 1 there won't be any conflicts.

EDIT: IMHO the best way to avoid bank conflicts in prefetching phase, specially if d is changing, is to equaly split the work among the warps. Lets say you need to prefetch n values to shared memory, w_id is ID of warp and l_id is ID of thread within warp (from 0 to 31). Than prefetching should look like this:

for(int i = l_id + w_id*WARP_SIZE; i < n; i += WARP_SIZE*COUNT_OF_WARPS_IN_BLOCK)
{
    s_dst[i] = ...;
}

But this helps only to avoid bank conflicts during prefetching. As I have already said to avoid conflicts during further usage depends on the scheme you will access s_dst.

stuhlo
  • 1,479
  • 9
  • 17
  • Can it be done without bank conflicts? Would unrolling the 'for' loop do better? – Dori Feb 27 '13 at 12:17
  • w_id = (threadIdx.x + threadIdx.y * blockDim.x) / 32, l_id = (threadIdx.x + threadIdx.y * blockDim.x) % 32 – stuhlo Feb 27 '13 at 13:16
  • `n` is equal to entire count of values that are going to be prefetched. – stuhlo Feb 27 '13 at 14:37
  • And how, according to this, address global memory? – Dori Feb 28 '13 at 09:04
  • You have to add some offset to `i` determined by ID of block which represents beginning of data in global memory belonging to that block. Something like `g_src[BLOCK_ID * n + i]` – stuhlo Feb 28 '13 at 11:55