__global__ void setRowReadColPad(int *out)
{
// static shared memory
__shared__ int tile[32][33];
// mapping from thread index to global memory offset
unsigned int idx = threadIdx.y * blockDim.x + threadIdx.x;
// shared memory store operation
tile[threadIdx.y][threadIdx.x] = idx;
// wait for all threads to complete
__syncthreads();
// shared memory load operation
out[idx] = tile[threadIdx.x][threadIdx.y];
}
With shared memory padding it takes 13.473us.
Without padding it takes 5.025us.
Can someone please explain why padding makes such a large difference to the kernel runtime?