CUDA: Why does padding the shared memory array by one column decrease the speed of the kernel by 250%?

Question

__global__ void setRowReadColPad(int *out)
{
    // static shared memory
    __shared__ int tile[32][33];

    // mapping from thread index to global memory offset
    unsigned int idx = threadIdx.y * blockDim.x + threadIdx.x;
    
    
    // shared memory store operation
    tile[threadIdx.y][threadIdx.x] = idx;
    

    // wait for all threads to complete
    __syncthreads();

    // shared memory load operation
    
    out[idx] = tile[threadIdx.x][threadIdx.y];
    
}

With shared memory padding it takes 13.473us.

Without padding it takes 5.025us.

Can someone please explain why padding makes such a large difference to the kernel runtime?

I see the opposite. When I use the last dimension of the shared memory as 33 on a Tesla P100 I see a kernel execution time of 2.1us, and when I use a dimension of 32 there, I see 2.9us. In any event, the discrepancy is because in the non-padded case, you have columnar access to shared memory which creates bank conflicts. In the padded case (dimension = 33) you have no bank conflicts. — Robert Crovella, Mar 05 '19 at 19:20
With padding I've checked only one request load/store Shared Memory access, but it takes almost 3 times than without padding. I have GTX 1070Ti. — Lluis Beltrán Rovira, Mar 06 '19 at 06:48
I do not know what's going on: `__shared__ int tile[32][32]`--->shared; but… `__shared__ int tile[32][33]`--->L2 cache. That's the question; the memory accessing by compiler. — Lluis Beltrán Rovira, Mar 06 '19 at 08:15
I cannot reproduce it on NVIDIA 960, GPU activities reported by nvprof is: stride 32 --> 3.1360us; stride 33 --> 1.4720us. — johnjohnlys, Mar 06 '19 at 14:33
I'm still investigating..but nothing. Another friend from [link](https://devtalk.nvidia.com/default/topic/1029015/share-memory-bank-conflict-no-conflict-is-slower-than-all-conflict-/) had the same problem but no replies. I'm new on cuda — Lluis Beltrán Rovira, Mar 07 '19 at 06:59
I've just checked on a GT525M and the result are ok (as expected): `248.39ms [32][32] bank conflicts......140.25ms [32][33] free bank conflicts` I dont know what happend with my GTX1070 Ti.. A mystery... — Lluis Beltrán Rovira, Mar 07 '19 at 17:23

CUDA: Why does padding the shared memory array by one column decrease the speed of the kernel by 250%?

0 Answers0