C++ Atomic operations within contiguous block of memory

Question

Is it possible to use atomic operations, possibly using the std::atomic library, when assigning values in a contiguous block of memory.

If I have this code:

uint16_t* data = (uint16_t*) calloc(num_values, size);

What can I do to make operations like this atomic:

data[i] = 5;

I will have multiple threads assigning to data, possibly at the same index, at the same time. The order in which these threads modify the value at a particular index doesn't matter to me, as long as the modifications are atomic, avoiding any possible mangling of the data.

EDIT: So, per @user4581301, I'm providing some context for my issue here. I am writing a program to align depth video data frames to color video data frames. The camera sensors for depth and color have different focal characteristics so they do not come completely aligned. The general algorithm involves projecting a pixel in depth space to a region in color space, then, overwriting all values in the depth frame, spanning that region, with that single pixel. I am parallelizing this algorithm. These projected regions may overlap, thus when paralellized, writes to an index may occur concurrently.

Pseudo-code looks like this:

for x in depth_video_width:
  for y in depth_video_height:
      pixel = get_pixel(x, y)
      x_min, x_max, y_min, y_max = project_depth_pixel(x, y)

      // iterate over projected region
      for x` in [x_min, x_max]:
         for y` in [y_min, y_max]:
             // possible concurrent modification here
             data[x`, y`] = pixel

The outer loop or outermost two loops are parallelized.

If you don't actually care what values go into the array then just scribble them in there. I don't see how that would work, but seems to be what you said. If you worry about torn writes where some bytes are set and some aren't, that only matters if the values are different. But if you're writing weirdly different values into the same array elements, I don't see how this plan of yours could ever work. — Zan Lynx, Aug 03 '20 at 17:49
`uint16_t* data = (uint16_t*) calloc(num_values, size);` might translate into `std::vector data(num_values)` if `size` isn't a surprising value. — user4581301, Aug 03 '20 at 17:49
@Zan Lynx So the values will be weirdly different, but they do have meaning. If there are torn writes, the data will no longer have any meaning. — Matthew Ha, Aug 03 '20 at 17:51
You could have an [X-Y problem](https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem). Sometimes it's better to step back and ask about the problem this is supposed to solve. — user4581301, Aug 03 '20 at 17:54
Ah, I haven't heard of that term, but it may definitely apply to my case. I'll update my question, thank you. — Matthew Ha, Aug 03 '20 at 17:57
@user997112 that's just wrong from C++ standpoint. C++ abstract machine doesn't care about atomicity on CPU level, so any unguarded access to the same object is undefined behavior. — SergeyA, Aug 03 '20 at 18:01
@ZanLynx That would have undefined behaviour. It's more than just getting unspecified values. When somebody tells you "it's all just bytes", tell them they're wrong. — Asteroids With Wings, Aug 03 '20 at 18:02
@user997112 undefined from C++ point of view, on any and all architectures. — SergeyA, Aug 03 '20 at 18:59
@user997112 I am saying that on any platform in C++ it would result in undefined behavior. You can ask a separate question if you want (I am 100% sure it's a duplicate, though). — SergeyA, Aug 04 '20 at 19:09
@user997112 because C++ is not tied to a platform. If you are still unsure, I suggest you ask a dedicated question regarding this matter. — SergeyA, Aug 05 '20 at 19:23
@user997112 you are still wrong. Compiler, for example, can optimize access to variable if such access is not atomic and just never write or read from it, if it can see that the variable is not read from in the same thread. But I am not interested in continuing this discussion. — SergeyA, Aug 06 '20 at 12:57

Asteroids With Wings · Accepted Answer · 2020-08-03T17:47:34.360

3

You're not going to be able to do exactly what you want like this.

An atomic array doesn't make much sense, nor is it what you want (you want individual writes to be atomic).

You can have an array of atomics:

#include <atomic>
#include <array>

int main()
{
    std::array<std::atomic<uint16_t>, 5> data{};
    data[1] = 5;
}

… but now you can't just access a contiguous block of uint16_ts, which it's implied you want to do.

If you don't mind something platform-specific, you can keep your array of uint16_ts and ensure that you only use atomic operations with each one (e.g. GCC's __atomic intrinsics).

But, generally, I think you're going to want to keep it simple and just lock a mutex around accesses to a normal array. Measure to be sure, but you may be surprised at how much of a performance loss you don't get.

If you're desperate for atomics, and desperate for an underlying array of uint16_t, and desperate for a standard solution, you could wait for C++20 and keep an std::atomic_ref (this is like a non-owning std::atomic) for each element, then access the elements through those. But then you still have to be cautious about any operation accessing the elements directly, possibly by using a lock, or at least by being very careful about what's doing what and when. At this point your code is much more complex: be sure it's worthwhile.

edited Aug 03 '20 at 17:47

answered Aug 03 '20 at 17:40

Asteroids With Wings

17,071
2
21
35

The primary motivation for the contiguous memory is so that I can `memcpy` it to another location later (that is expecting contiguous uint16_t). But I'm guessing that's not feasible since this solution would involve an array of `std::atomic`, not just `uint16_t` – Matthew Ha Aug 03 '20 at 17:45
2

@MatthewHa How would you synchronise the `memcpy` without a lock? – Asteroids With Wings Aug 03 '20 at 17:45
yeah, memcpy'ing into a single atomic doesn't work. The whole array thing is just distracting from the real problem. – Jeffrey Aug 03 '20 at 17:46
@AsteroidsWithWings The memcpy would happen after all threads have been joined. Alternatively, I was wondering whether it would be possible to wrap a pointer in something from `std::atomic` such that assignments to the value it points to is atomic. Then, I could create my contiguous block, and wrap pointers to every value. – Matthew Ha Aug 03 '20 at 17:49
@MatthewHa _"The memcpy would happen after all threads have been joined."_ In that case it is possible (and not even unreasonable), but as I say above, your code will be much more complicated - you should ensure that it's worth the performance gain (if any) – Asteroids With Wings Aug 03 '20 at 17:49
@MatthewHa The second thing you describe is `atomic_ref`. You could make this yourself pre-C++20 but I guess you'd have to put in all the (e.g.) GCC intrinsics code yourself. – Asteroids With Wings Aug 03 '20 at 17:54
@AsteroidsWithWings, regarding the gcc atomic builtins, the link you provided says that the atomic function assume that the program they're called from are free of data races. My use case definitely involves data races, when the different threads modify values at the same index. I was wondering if I was interpreting this requirement incorrectly. – Matthew Ha Aug 03 '20 at 19:51
@MatthewHa That's a good point. To be honest, I don't really know. – Asteroids With Wings Aug 03 '20 at 20:16

score 1 · Answer 2 · answered Aug 03 '20 at 18:10

To add on the last answer, I would strongly advocate against using an array of atomics since any read or write to an atomic locks an entire cache line (at least on x86). In practice, it means that when accessing element i in your array (either to read or to write it), you would lock the cache line around that element (so other threads couldn't access that particular cache line).

The solution to your problem is a mutex as mentioned in the other answer.

For the maximum supported atomic operations it seems to be currently 64bits (see https://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.html)

The Intel-64 memory ordering model guarantees that, for each of the following 
memory-access instructions, the constituent memory operation appears to execute 
as a single memory access:

• Instructions that read or write a single byte.
• Instructions that read or write a word (2 bytes) whose address is aligned on a 2
byte boundary.
• Instructions that read or write a doubleword (4 bytes) whose address is aligned
on a 4 byte boundary.
• Instructions that read or write a quadword (8 bytes) whose address is aligned on
an 8 byte boundary.

Any locked instruction (either the XCHG instruction or another read-modify-write
 instruction with a LOCK prefix) appears to execute as an indivisible and 
uninterruptible sequence of load(s) followed by store(s) regardless of alignment.

In other word, your processor doesn't know how to do more than 64bits atomic operations. And I'm not even mentioning here the STL implementation of atomic which can use lock (see https://en.cppreference.com/w/cpp/atomic/atomic/is_lock_free).

C++ Atomic operations within contiguous block of memory

2 Answers2