3

I am a bit confused with how memory access issued by a warp is affected by FP64 data.

  • A warp always consists of 32 threads regardless if these threads are doing FP32 or FP64 calculations. Right?
  • I have read that each time a thread in a warp tries to read/write the global memory, the warp accesses 128 bytes (32 single-precision floats). Right?
  • So if all the threads in a warp are reading different single precision floats (a total of 128 bytes) from the memory but in a coalesced manner, the warp will issue a single memory transaction. Right?

Here is my question now:

  • What if all threads in the warp try to access different double-precision floats (a total of 256 bytes) in a coalesced manner? Will the warp issue two memory transactions (128+128)?

PS: I am mostly interested in Compute Capability 2.0+ architectures

AstrOne
  • 3,569
  • 7
  • 32
  • 54

1 Answers1

2

A warp always consists of 32 threads regardless if these threads are doing FP32 or FP64 calculations. Right?

Correct

I have read that each time a thread in a warp tries to read/write the global memory, the warp accesses 128 bytes (32 single-precision floats). Right?

Not exactly. There are also 32 byte transaction sizes.

So if all the threads in a warp are reading different single precision floats (a total of 128 bytes) from the memory but in a coalesced manner, the warp will issue a single memory transaction. Right?

Correct

What if all threads in the warp try to access different double-precision floats (a total of 256 bytes) in a coalesced manner? Will the warp issue two memory transactions (128+128)?

Yes. The compiler will emit a 64 bit load instruction which will be serviced by two 128 byte transactions per warp when coalesced memory access is possible.

talonmies
  • 70,661
  • 34
  • 192
  • 269
  • Thanks for your answer my friend. I am willing to accept it, but I would also be grateful if you could comment on the 32-byte transactions. Under what circumstances do they happen? Thank you in advance. – AstrOne Feb 09 '17 at 12:36
  • @AstrOne: If every thread in the warp needs to load 8 or 16 bit types, these can be serviced with 32 byte transactions. You can also force the compiler to emit 32 byte transactions if you wish. – talonmies Feb 09 '17 at 12:45