Smaller type for atomic that implements a sychronization primitive, such as semaphore, mutex or barrier

Question

Synchronization primitives can be implemented using std::atomic<T>, at least their userspace part. With C++20 std::atomic<T>::wait, they can be based on atomic only.

The question is whether it is worth using pointer-sized types or smaller types.

C++20 std::semaphore is an example where max value is passed as template parameter, so the choice can be made at compile time, not hard-coded.

My thoughts so far:

I think that on platform where futex uses specific variable size, should just use that size. On Windows with flexible WaitOnAdress there's a room to pick.

32-bit types on x64 lead to smaller encoding and the same efficiency, so probably are the way to go. Further reducing to 16 and 8 bit types is apparently not useful (16-bit types result in larger encoding than 32-bit types).

There's also a case of Windows on arm, for which I'm not sure.

For std::semaphore, the ability to pick integer type based on max is side effect of specializing binary semaphore, rather than an intention.

WaitOnAdress is part of synchronization library (think of it as of a analog of pthread), of a wrapper around the system core's API, it's not really a cheap atomic. There is long time going arguments against such implementations.. "why threads cannot be library". But it's Windows concept of having wrappers around wrappers and system objects, threads , windows behaving like RAII entities, no matter the cost. It's sometimes handy, sometimes not. — Swift - Friday Pie, Aug 01 '20 at 08:50
Sure but on fast path it is not going to be called. You can also assume that Semaphore Object is used instead. I'm asking about the userspace part. — Alex Guteniev, Aug 01 '20 at 08:53
Why would you say 8-bit types wouldn't be useful on x86? 8-bit operand-size is encoded by using a different opcode, unlike 16/32/64 with prefixes or lack thereof. 8-bit seems like the obvious choice to me; x86 CPUs have fully efficient byte stores that aren't slower to commit to L1d than an aligned 32-bit or 64-bit chunk, unlike on many non-x86 uarches. (Except that packing more data into the same cache line increases the chances for false sharing.) Only downside is avoiding partial-register slowdowns in the surrounding code, I think. — Peter Cordes, Aug 01 '20 at 21:41
[Are there any modern CPUs where a cached byte store is actually slower than a word store?](https://stackoverflow.com/q/54217528) - yes, ARM, and most non-x86 apparently. So you would want an `int` sized object on ARM I think. — Peter Cordes, Aug 01 '20 at 21:43
@PeterCordes, what does this mean for 64-bit ARM, keep 64-bit or can fall back to 32-bit? — Alex Guteniev, Aug 02 '20 at 03:08
I'm pretty sure AArch64 microarchitectures can still efficiently work with 32-bit words. `int` is still 32-bit in AArch64 ABIs so it would make sense to build hardware that can handle it efficiently! (And whatever penalty exists for bytes is small compared to the cost of an atomic RMW anyway, like maybe an extra cycle or so to commit to cache, and possibly in load latency as well IDK. So even byte would not be bad.) — Peter Cordes, Aug 02 '20 at 03:18
@Peter, on the other hand, I also don't see a good advantage for using smaller variable for smaller storage size either. User is likely to pad mutexes/semaphores on separate cache line anyway, and will rarely create an array of them. — Alex Guteniev, Aug 02 '20 at 03:24
It makes a lot of sense to put the data protected by the mutex in the same cache line with it. Making it 1 byte gives max flexibility to pack it in somewhere in a struct. And if you have an array of structs with fine-grained locking, on average false-sharing will be rare enough that it's not worth exploding your cache footprint. Not every object is performance critical on its own, potentially only in aggregate. — Peter Cordes, Aug 02 '20 at 03:34
@Peter, I see. Then probably should go with smallest type on every platform, but there's a point to default to 32-bit for counted primitives like counting semaphore or recursive mutex. — Alex Guteniev, Aug 02 '20 at 04:25
I might still go with `int` on non-x86, just because I'm not 100% sure what downsides there might be for single bytes. And/or principle of least surprise; some people probably expect a mutex to be the machine's natural word size. I don't want to post that as an answer because I don't know enough about what indirect effects there might be from any microarchitectural diff, or other effects. — Peter Cordes, Aug 02 '20 at 04:31

Smaller type for atomic that implements a sychronization primitive, such as semaphore, mutex or barrier

0 Answers0