0

I was trying to reproduce a bank conflict scenario (minimal working example here) and decided to perform a benchmark when a warp (32 threads) access 32 integers of size 32-bits each in the following 2 scenarios:

  • When there is no bank conflict (offset=1)
  • When there is a bank conflict (offset=32, all threads are accessing bank 0)

Here is a sample of the code (only the kernel):

__global__ void kernel(int offset) {

    __shared__ uint32_t shared_memory[MEMORY_SIZE];

    // init shared memory
    if (threadIdx.x == 0) {
        for (int i = 0; i < MEMORY_SIZE; i++) 
            shared_memory[i] = i;
    }

    __syncthreads();

    uint32_t index = threadIdx.x * offset;

    // 2048 / 32 = 64 
    for (int i = 0; i < 64; i++)
    {
        shared_memory[index] += index * 10;

        index += 32;
        index %= MEMORY_SIZE;

        __syncthreads();   
     }
}

I expected the version with offset=32 to run slower than the one with offset=1 as access should be serialized but found out that they have similar output time. How is that possible ?

paleonix
  • 2,293
  • 1
  • 13
  • 29
Ferdinand Mom
  • 59
  • 1
  • 5
  • Please share your code for the two versions here on the question page. – einpoklum Nov 20 '22 at 23:48
  • the code is quite long but in the github gist, you can change the parameter `offset` through the command line – Ferdinand Mom Nov 20 '22 at 23:49
  • Indeed, I tried to put cudaDeviceSynchronize() inside the for loop after every kernel call and found an average time of 0.0231144 ms (for offset=1) and 0.0278107 ms (offset=32). I would have expected a noticeable difference – Ferdinand Mom Nov 20 '22 at 23:54
  • @einpoklum He is. It is part of `elapsed_time()`. which isn't ideal either. Just use [nvbench](https://github.com/NVIDIA/nvbench) instead of trying to roll your own benchmarking code. – paleonix Nov 20 '22 at 23:54
  • @paleonix: My [gpu-kernel-runner](https://github.com/eyalroz/gpu-kernel-runner/) may also be suited to something like this. But regardless - that might not be the issue here. – einpoklum Nov 20 '22 at 23:56
  • Another thing to check is how the PTX (and perhaps even the SASS) look like. Perhaps something gets optimized away? Also, try performing _many_ writes, to drown out the effect of other things, like thread setup, shared mem initialization, block synchronization etc. – einpoklum Nov 20 '22 at 23:58
  • If you really need to initialize the shared memory for this, you should also let the threads cooperate on that to cut down drowning overhead. – paleonix Nov 21 '22 at 00:06
  • After adding many writes, I start to see a difference even if it is not very noticeable. I guess that is enough to illustrate the concept. Thanks to the both of you for your help! – Ferdinand Mom Nov 21 '22 at 00:12
  • Edit: I previously meant `shared_efficiency` not `shared_utilization`. Anyway, I know have the expected results. For `offset=1`, I have a `shared_efficiency` of 41.67 vs 3.68 for `offset=32` – Ferdinand Mom Nov 21 '22 at 01:00
  • // Only 1 thread in a warp writes to shared memory in your test? Then there should not be conflicts and differences Ok, I see – Mikhail M Nov 25 '22 at 06:58

1 Answers1

1

You have only 1 working warp, so biggest problem with your performance is that each (or most) GPU command awaits for finishing previous one. This hides most shared memory conflicts slowdown. You also have a lot of work per each shared memory access. How many small commands there are in cosf? Try simple integer arithmetics instead.

Mikhail M
  • 929
  • 2
  • 10
  • 23