Cuda: async-copy vs coalesced global memory read atomicity

Question

I was reading something about the memory model in Cuda. In particular, when copying data from global to shared memory, my understanding of shared_mem_data[i] = global_mem_data[i] is that it is done in a coalesced atomic fashion, i.e each thread in the warp reads global_data[i] in a single indivisible transaction. Is that correct?

cuda makes no statements about the order of thread execution. Therefore you should not assume any ordering between what is read by one thread and what is read by another, even if they are in the same warp. With respect to atomicity of a single thread that is reading, say, a properly aligned multibyte quantity, those bytes should be coherent, even if they were written by another thread. See [here](https://stackoverflow.com/questions/52848426/how-to-execute-atomic-write-in-cuda/52877358#52877358) which includes a link to the specific point in the hardware memory model doc supporting this claim — Robert Crovella, Oct 31 '20 at 21:47
In *general* CUDA makes no statements about the order of thread execution. There may be a few exceptions, such as in the case of warp collective intrinsics that specify a sync mask. And I haven't tried to capture every idea from the other answer I linked here in these comments. Please read it for a more complete description. — Robert Crovella, Oct 31 '20 at 21:55

einpoklum · Answer 1 · 2021-02-05T10:52:44.820

1

tl;dr: No.

It is not guaranteed, AFAIK, that all values are read in a single transaction. In fact, a GPU's memory bus is not even guaranteed to be wide enough for a single transaction to retrieve a full warp's width of data (1024 bits for a full warp read of 4 bytes each). It is theoretically for some values in the read-from locations in memory to change while the read is underway.

edited Feb 05 '21 at 10:52

answered Oct 31 '20 at 15:18

einpoklum

118,144
57
340
684

1

Regarding this: "That is, the threads in a warp will not resume execution until all other threads all warp threads have their output." its a bit hard to parse the english there, but I would disagree with that statement, based on my interpretation of what it probably is trying to convey. I also think the last statement is doubtful if you are applying it to the context of a single thread, but may be reasonably correct if the locations referred to are locations acted on by different threads. – Robert Crovella Oct 31 '20 at 21:51

Cuda: async-copy vs coalesced global memory read atomicity

1 Answers1

tl;dr: No.