I've read that CUDA can read from global memory 128 bytes at at time, so it makes sense that each thread in a warp can read/write 4 bytes in a coalesced pattern for a total of 128 bytes.
Reading/writing with the vector types like int4 and float4 is faster.
But what I don't understand why this is. If each thread in the warp is requesting 16 bytes, and only 128 bytes can move across the bus at a time, where does the performance gain come from?
Is it because there are fewer memory requests happening i.e. it is saying "grab 16 bytes for each thread in this warp" once, opposed to "grab 4 bytes for each thread in this warp" 4 times? I can't find anything in the literature that says the exact reason why the vector types are faster.