CUDA coalescing and global memory

Question

I have been told in my CUDA course that the following access (global memory) is coalescaled if elements of my "a" array have a size of 4,8 or 16 bytes.

int i = blockIdx.x*blockDim.x + threadIdx.x;
a[i];

The 2 conditions for coalescing are : Threads of the warp must access a chunk of 32, 64 or 128 bytes. Warp's first thread must be accessing an address which is a multiple of 32, 64 or 128

But in this example(first condition), nothing guarantees that the warp will access a chunk of 32 bytes.

If I assume that a's elements are floats (4 bytes), and if I define blockDim.x as 5, then every warp will access chunks of 20 (4x5) bytes even though elements of my "a" array have a size of 4,8 or 16 bytes, and not 32. So, is the very first claim about coalescing false ?

Thank you for your answer.

talonmies · Accepted Answer · 2020-05-04T12:21:48.487

But in this example(first condition), nothing guarantees that the warp will access a chunk of 32 bytes.

Because of thread ordering, it guarantees that each warp accesses 128 bytes (32 threads x 4 bytes). Which is a necessary condition of coalesced memory access.

If I assume that a's elements are floats (4 bytes), and if I define blockDim.x as 5, then every warp will access chunks of 20 (4x5) bytes even though elements of my "a" array have a size of 4,8 or 16 bytes, and not 32.

Warps are always 32 threads. If you define blockDim.x as 5, each block will consist of 1 warp with 27 nulled threads. The coalescing rules will still apply and transactions will be coalesced, but you are wasting 27/32 of your potential computational capacity and memory bandwidth.

So, is the very first claim about coalescing false ?

No.

"Because of thread ordering, it guarantees that each warp accesses 128 bytes (32 threads x 4 bytes)". Is this only true for 4-byte elements? — If_You_Say_So, May 04 '20 at 15:17
Obviously yes. The question is specifically asking about float elements — talonmies, May 04 '20 at 15:26

CUDA coalescing and global memory

1 Answers1