I was trying to detect shared memory bank conflicts for matrix transposition kernels. The first kernel performs matrix transposition without padding, and hence should have bank conflicts, while the second kernel uses padding, and should not have bank conflicts.
However, profiling with NSight Compute in the memory workload section shows 0 bank conflicts for both kernels.
I implemented the kernels as device functions like so
// tiled, with padding (expecting no bank conflicts)
template <class value_type, class container_type = value_type*>
__device__
void
transpose_padded(container_type m1, container_type m2, size_t width)
{
__shared__ value_type tile[BLOCK_WIDTH][BLOCK_WIDTH+1];
// BLOCK_WIDTH = 32, global scope constant
auto row = blockDim.y*blockIdx.y + threadIdx.y;
auto col = blockDim.x*blockIdx.x + threadIdx.x;
auto index = row * width + col;
auto tr_row = blockDim.y * blockIdx.x + threadIdx.y;
auto tr_col = blockDim.x * blockIdx.y + threadIdx.x;
auto tr_index = tr_row * width + col;
auto local_x = threadIdx.x;
auto local_y = threadIdx.y;
tile[local_x][local_y] = m1[index];
__syncthreads();
if (tr_row < width && tr_col < width)
{
m2[tr_index] = tile[local_y][local_x];
}
return;
}
// tiled, without padding (expecting bank conflicts)
template <class value_type, class container_type = value_type*>
__device__
void
transpose_tiled(container_type input, container_type output, size_t width)
{
// assuming square blocks
extern __shared__ value_type input_tile[];
auto row = blockDim.y*blockIdx.y + threadIdx.y;
auto col = blockDim.x*blockIdx.x + threadIdx.x;
auto matrix_index = row*width + col;
auto tr_row = col;
auto tr_col = row;
auto tr_index = tr_row*width + tr_col;
// coalesced global memory access
auto shared_index = threadIdx.y*blockDim.x+threadIdx.x;
input_tile[shared_index]= input[matrix_index];
__syncthreads();
if (tr_row < width && tr_col < width)
output[tr_index] = input_tile[shared_index];
return;
}
The input matrix that I used had dimensions 100x100. In both kernels, the block sizes are 32x32 threads. The instantiations have value type double.
Are there really no bank conflicts, or is this caused by something else altogether? What other information from other sections can I use to determine whether there could be bank conflicts or not?