14

I've read that CUDA can read from global memory 128 bytes at at time, so it makes sense that each thread in a warp can read/write 4 bytes in a coalesced pattern for a total of 128 bytes.

Reading/writing with the vector types like int4 and float4 is faster.

But what I don't understand why this is. If each thread in the warp is requesting 16 bytes, and only 128 bytes can move across the bus at a time, where does the performance gain come from?

Is it because there are fewer memory requests happening i.e. it is saying "grab 16 bytes for each thread in this warp" once, opposed to "grab 4 bytes for each thread in this warp" 4 times? I can't find anything in the literature that says the exact reason why the vector types are faster.

Community
  • 1
  • 1
user13741
  • 241
  • 2
  • 5

2 Answers2

8

Your last paragraph is basically the answer to your question. The performance improvement comes from efficiency gains, in two ways

  1. At the instruction level, a multi-word vector load or store only requires a single instruction to be issued, so the bytes per instruction ratio is higher and total instruction latency for a particular memory transaction is lower.
  2. At the memory controller level, a vector sized transaction request from a warp results in a larger net memory throughput per transaction, so the bytes per transaction ratio is higher. Fewer transaction requests reduces memory controller contention and can produce higher overall memory bandwidth utilisation.

So you get efficiency gains both at the multiprocessor and memory controller by using vector memory instructions, as compared to issuing individual instructions which produce individual memory transactions to get the same amount of bytes from global memory

talonmies
  • 70,661
  • 34
  • 192
  • 269
5

You have thorough answer for the question in Parallel4All blog: http://devblogs.nvidia.com/parallelforall/cuda-pro-tip-increase-performance-with-vectorized-memory-access/

The main reason is less index arithmetics per byte loaded in case vector loads are used.

There is another one - more loads in flight, which helps saturate memory bandwidth in cases of low occupancy.

Maxim Milakov
  • 221
  • 1
  • 5