CUDA shared memory bank conflict unexpected timing

Question

I was trying to reproduce a bank conflict scenario (minimal working example here) and decided to perform a benchmark when a warp (32 threads) access 32 integers of size 32-bits each in the following 2 scenarios:

When there is no bank conflict (offset=1)
When there is a bank conflict (offset=32, all threads are accessing bank 0)

Here is a sample of the code (only the kernel):

__global__ void kernel(int offset) {

    __shared__ uint32_t shared_memory[MEMORY_SIZE];

    // init shared memory
    if (threadIdx.x == 0) {
        for (int i = 0; i < MEMORY_SIZE; i++) 
            shared_memory[i] = i;
    }

    __syncthreads();

    uint32_t index = threadIdx.x * offset;

    // 2048 / 32 = 64 
    for (int i = 0; i < 64; i++)
    {
        shared_memory[index] += index * 10;

        index += 32;
        index %= MEMORY_SIZE;

        __syncthreads();   
     }
}

I expected the version with offset=32 to run slower than the one with offset=1 as access should be serialized but found out that they have similar output time. How is that possible ?

Please share your code for the two versions here on the question page. — einpoklum, Nov 20 '22 at 23:48
the code is quite long but in the github gist, you can change the parameter `offset` through the command line — Ferdinand Mom, Nov 20 '22 at 23:49
Indeed, I tried to put cudaDeviceSynchronize() inside the for loop after every kernel call and found an average time of 0.0231144 ms (for offset=1) and 0.0278107 ms (offset=32). I would have expected a noticeable difference — Ferdinand Mom, Nov 20 '22 at 23:54
@einpoklum He is. It is part of `elapsed_time()`. which isn't ideal either. Just use [nvbench](https://github.com/NVIDIA/nvbench) instead of trying to roll your own benchmarking code. — paleonix, Nov 20 '22 at 23:54
@paleonix: My [gpu-kernel-runner](https://github.com/eyalroz/gpu-kernel-runner/) may also be suited to something like this. But regardless - that might not be the issue here. — einpoklum, Nov 20 '22 at 23:56
Another thing to check is how the PTX (and perhaps even the SASS) look like. Perhaps something gets optimized away? Also, try performing _many_ writes, to drown out the effect of other things, like thread setup, shared mem initialization, block synchronization etc. — einpoklum, Nov 20 '22 at 23:58
If you really need to initialize the shared memory for this, you should also let the threads cooperate on that to cut down drowning overhead. — paleonix, Nov 21 '22 at 00:06
After adding many writes, I start to see a difference even if it is not very noticeable. I guess that is enough to illustrate the concept. Thanks to the both of you for your help! — Ferdinand Mom, Nov 21 '22 at 00:12
Edit: I previously meant `shared_efficiency` not `shared_utilization`. Anyway, I know have the expected results. For `offset=1`, I have a `shared_efficiency` of 41.67 vs 3.68 for `offset=32` — Ferdinand Mom, Nov 21 '22 at 01:00
// Only 1 thread in a warp writes to shared memory in your test? Then there should not be conflicts and differences Ok, I see — Mikhail M, Nov 25 '22 at 06:58

score 1 · Answer 1 · answered Nov 25 '22 at 07:06

You have only 1 working warp, so biggest problem with your performance is that each (or most) GPU command awaits for finishing previous one. This hides most shared memory conflicts slowdown. You also have a lot of work per each shared memory access. How many small commands there are in cosf? Try simple integer arithmetics instead.

CUDA shared memory bank conflict unexpected timing

1 Answers1