For a write operation in each of 2 different threads in CUDA, if:
- the writes are to the same location (address)
- that address is naturally aligned for the size of the write
- the size of the write operation is the same between each of the two threads (and is of size 1, 2, 4, or 8 bytes)
then you are guaranteed to get one of the values written by those two threads, and not any other value, considering the data type size that was written. This is provided so long as the write is done by a single SASS instruction. The correctness here is provided by current CUDA hardware, not necessarily the compiler, the CUDA programming model, and/or the C++ standard to which CUDA adheres.
This is directly extendable to any number of threads that meet the above conditions.
This assumes no other threads are doing "anything else" with respect to the written locations (i.e. they are not writing a different size quantity to that location, or any overlapping location, or of some other alignment).
Which actual value will end up in that location is generally undefined (except that it will be one and only one of the written values, and not anything else) unless the programmer enforces some ordering on the operations.
When writing vector quantities or structures in C/C++, care should be taken to ensure that the underlying write (store) instruction in SASS code references the appropriate size. The comments above when referring to write operations are referring to the writes as issued by the SASS code. Generally speaking, I don't expect much difference between that interpretation and "writes from C/C++ code" using POD data types. But structures could possibly be broken into multiple transactions of a smaller size, in which case the above statements would be abrogated. Nevertheless, it's possible with appropriate programming practices (e.g. careful use of vector types) in C/C++ to ensure that up to 8 byte writes will be used if relevant.