I have been told in my CUDA course that the following access (global memory) is coalescaled if elements of my "a" array have a size of 4,8 or 16 bytes.
int i = blockIdx.x*blockDim.x + threadIdx.x;
a[i];
The 2 conditions for coalescing are : Threads of the warp must access a chunk of 32, 64 or 128 bytes. Warp's first thread must be accessing an address which is a multiple of 32, 64 or 128
But in this example(first condition), nothing guarantees that the warp will access a chunk of 32 bytes.
If I assume that a's elements are floats (4 bytes), and if I define blockDim.x as 5, then every warp will access chunks of 20 (4x5) bytes even though elements of my "a" array have a size of 4,8 or 16 bytes, and not 32. So, is the very first claim about coalescing false ?
Thank you for your answer.