I am designing a CUDA kernel that will be launched with 16 threads per thread block. I have an array of N ints in shared memory (i.e. per thread block) that I wish to process.
If the access pattern of the threads is consecutive into the array then does that mean there will be no bank conflicts? I understand that if the array was a char array there would be bank conflicts but I'm not entirely sure what happens if its an int array. I'm guessing there will be bank conflicts because each set of 4 consecutive ints share the same memory bank?
If this is true then what is the correct solution to prevent bank conflicts? Address scrambling like in the histogram sample?