CUDA coalesced access of FP64 data

Question

I am a bit confused with how memory access issued by a warp is affected by FP64 data.

A warp always consists of 32 threads regardless if these threads are doing FP32 or FP64 calculations. Right?
I have read that each time a thread in a warp tries to read/write the global memory, the warp accesses 128 bytes (32 single-precision floats). Right?
So if all the threads in a warp are reading different single precision floats (a total of 128 bytes) from the memory but in a coalesced manner, the warp will issue a single memory transaction. Right?

Here is my question now:

What if all threads in the warp try to access different double-precision floats (a total of 256 bytes) in a coalesced manner? Will the warp issue two memory transactions (128+128)?

PS: I am mostly interested in Compute Capability 2.0+ architectures

talonmies · Accepted Answer · 2017-02-09T11:54:39.457

A warp always consists of 32 threads regardless if these threads are doing FP32 or FP64 calculations. Right?

Correct

I have read that each time a thread in a warp tries to read/write the global memory, the warp accesses 128 bytes (32 single-precision floats). Right?

Not exactly. There are also 32 byte transaction sizes.

So if all the threads in a warp are reading different single precision floats (a total of 128 bytes) from the memory but in a coalesced manner, the warp will issue a single memory transaction. Right?

Correct

What if all threads in the warp try to access different double-precision floats (a total of 256 bytes) in a coalesced manner? Will the warp issue two memory transactions (128+128)?

Yes. The compiler will emit a 64 bit load instruction which will be serviced by two 128 byte transactions per warp when coalesced memory access is possible.

Thanks for your answer my friend. I am willing to accept it, but I would also be grateful if you could comment on the 32-byte transactions. Under what circumstances do they happen? Thank you in advance. — AstrOne, Feb 09 '17 at 12:36
@AstrOne: If every thread in the warp needs to load 8 or 16 bit types, these can be serviced with 32 byte transactions. You can also force the compiler to emit 32 byte transactions if you wish. — talonmies, Feb 09 '17 at 12:45

CUDA coalesced access of FP64 data

1 Answers1