How to execute atomic write in CUDA?

Question

First of all I cannot find reliable source whether the write is atomic in CUDA or not. For example Is global memory write considered atomic in CUDA? touches this subject but the last remark shows we are not talking about same atomic notion. Having the code:

global_mem[0] = pick_at_random_from(1, 2);
shared_mem[0] = pick_at_random_from(1, 2);

executed by gazillion of threads "atomic" means in both cases the content will be 1 or 2 and it is guaranteed nothing else can show up (like 3). Atomic means integrity.

But as I understand it, CUDA does not guarantee it, so when I run this code I can potentially get value 3? If it really the case, how to perform atomic write? There is atomicExch but it is an overkill -- it does more than it is needed.

Atomic functions I already checked: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomic-functions

Atomic operations are, as the documentation says, "read-modify-write operations" in CUDA. The definition used for CUDA is "The operation is atomic in the sense that it is guaranteed to be performed without interference from other threads". I think (not 100% sure) that you are ensured to get 1,2 in the code you showed, you just do not know which kernel wrote it due to race conditions — Ander Biguri, Oct 17 '18 at 12:58
@AnderBiguri, do you quote the part I linked? If yes, the beginning of the sentence states about **functions** not **operations**, thus I believe this read-modify-write should be read a sequence not a pool, and they are referring to the listed functions below (in the doc). — astrowalker, Oct 17 '18 at 13:09
no, you can't get 3, you will get either 1 or 2, assuming the writes you are doing are locationally consistent and naturally aligned across threads, and this has been covered elsewhere (multiple questions here on the `cuda` tag, such as [this one](https://stackoverflow.com/questions/38161819/weak-guarantees-for-non-atomic-writes-on-gpus)) Your question is maybe a duplicate of that one. — Robert Crovella, Oct 17 '18 at 13:20
If you want a formal statement of the memory consistency model in CUDA, as opposed to my claims, you would need to parse through [the memory model definition given in the PTX manual](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#memory-consistency-model) — Robert Crovella, Oct 17 '18 at 13:22
@RobertCrovella, thank you, but I checked your answers, once you write the writes are atomic, on on the other answer you write that writes are NOT atomic. For first, write cannot be atomic and not atomic at the same time, for second, with such contradiction I still don't know whether they are atomic :-) — astrowalker, Oct 18 '18 at 06:08
Lets not use the word atomic. As has already been explained to use, the customary usage of that (Read-modify-write) does not apply to what you are asking. You are asking if writes will be coherent, and gave the example of multiple threads writing either 1 or 2. I've indicated to you already that in that circumstance, subject to a few minor conditions, that will always get 1 or 2, and not 3, and provide you a linked answer supporting that, as well as documentational support. If you find somewhere that I have contradicted that, please provide the link to it. — Robert Crovella, Oct 18 '18 at 12:53
@RobertCrovella, why does it not apply? I could have bytes to write or 16-byte structures (not mixed though), I really doubt that considering any architecture writes larger than registry size are atomic. This was just an example, that even with writes you have to consider if they are atomic or not. "By the time the transactions are issued to global memory, there is no guarantee of atomicity in the CUDA programming or memory model, unless atomic instructions are used." Quoted from https://stackoverflow.com/a/20775278/6734314 — astrowalker, Oct 18 '18 at 13:47
Yes, if you write a 4 byte quantity in one thread, and a 1-byte quantity in another thread, you will not get "atomicity" in any architecture, or using any method, that I am aware of, anywhere. That was not the example you gave. `global_mem[0] = ...` could not be doing that. And perhaps its just me, but your question certainly seemed to be asking about that example you gave, and not unaligned writes or 16-byte structure writes. Your question is all over the map. To be clear, when you run **that** code, you cannot get 3. Period. — Robert Crovella, Oct 18 '18 at 15:00
@RobertCrovella, In my comment I wrote "not mixed though". So it is your over-interpretation. I am asking about the same size writes (without such constraint the question does not make sense). — astrowalker, Oct 19 '18 at 06:05

Robert Crovella · Accepted Answer · 2019-12-21T20:32:33.933

For a write operation in each of 2 different threads in CUDA, if:

the writes are to the same location (address)
that address is naturally aligned for the size of the write
the size of the write operation is the same between each of the two threads (and is of size 1, 2, 4, or 8 bytes)

then you are guaranteed to get one of the values written by those two threads, and not any other value, considering the data type size that was written. This is provided so long as the write is done by a single SASS instruction. The correctness here is provided by current CUDA hardware, not necessarily the compiler, the CUDA programming model, and/or the C++ standard to which CUDA adheres.

This is directly extendable to any number of threads that meet the above conditions.

This assumes no other threads are doing "anything else" with respect to the written locations (i.e. they are not writing a different size quantity to that location, or any overlapping location, or of some other alignment).

Which actual value will end up in that location is generally undefined (except that it will be one and only one of the written values, and not anything else) unless the programmer enforces some ordering on the operations.

When writing vector quantities or structures in C/C++, care should be taken to ensure that the underlying write (store) instruction in SASS code references the appropriate size. The comments above when referring to write operations are referring to the writes as issued by the SASS code. Generally speaking, I don't expect much difference between that interpretation and "writes from C/C++ code" using POD data types. But structures could possibly be broken into multiple transactions of a smaller size, in which case the above statements would be abrogated. Nevertheless, it's possible with appropriate programming practices (e.g. careful use of vector types) in C/C++ to ensure that up to 8 byte writes will be used if relevant.

How to execute atomic write in CUDA?

1 Answers1

Linked