C++ shared_mutex implementation

Question

boost::shared_mutex or std::shared_mutex (C++17) can be used for single writer, multiple reader access. As an educational exercise, I put together a simple implementation that uses spinlocking and has other limitations (eg. fairness policy), but is obviously not intended to be used in real applications.

The idea is that the mutex keeps a reference count that is zero if no thread holds the lock. If > 0, the value represents the number of readers that have access. If -1, a single writer has access.

Is this a correct implementation (in particular with the used, minimal, memory orderings) that is free of data races ?

#include <atomic>

class my_shared_mutex {
    std::atomic<int> refcount{0};
public:

    void lock() // write lock
    {
        int val;
        do {
            val = 0; // Can only take a write lock when refcount == 0

        } while (!refcount.compare_exchange_weak(val, -1, std::memory_order_acquire));
        // can memory_order_relaxed be used if only a single thread takes write locks ?
    }

    void unlock() // write unlock
    {
        refcount.store(0, std::memory_order_release);
    }

    void lock_shared() // read lock
    {
        int val;
        do {
            do {
                val = refcount.load(std::memory_order_relaxed);

            } while (val == -1); // spinning until the write lock is released

        } while (!refcount.compare_exchange_weak(val, val+1, std::memory_order_acquire));
    }

    void unlock_shared() // read unlock
    {
        // This must be a release operation (see answer)
        refcount.fetch_sub(1, std::memory_order_relaxed);
    }
};

Peter Cordes · Accepted Answer · 2019-08-15T00:51:28.037

(CAS = Compare And Swap = C++ compare_exchange_weak function, which on x86 will typically compile to an x86 lock cmpxchg instruction which can only run when it owns the cache line in Exclusive or Modified MESI state).

lock_shared looks good: spinning read-only attempting a CAS only when it looks possible is better for performance than spinning on CAS or atomic increment. You already needed to do a read-only check to avoid changing -1 to 0 and unlocking a write-lock.

On x86, put a _mm_pause() in the retry path of the spin loop to avoid memory-order mis-speculation pipeline nukes when exiting the read-only spin loop, and steal fewer resources from the other hyperthread while spinning. (Use a while() loop, not do{}while(), so the pause only runs after failing once. pause on Skylake and later waits about 100 cycles so avoid it in the fast-path.)

I think unlock_shared should be using mo_release, not mo_relaxed, since it needs to order the loads from the shared data structure to make sure a writer doesn't start writing before the loads from the reader critical section happen. (LoadStore reordering is a thing on weakly-ordered architectures, even though x86 only does StoreLoad reordering.) A Release operation will order preceding loads and keep them inside the critical section.

(in write lock): // can memory_order_relaxed be used if only a single thread takes write locks ?

No, you still need to keep the writes inside the critical section, so the CAS still needs to Synchronize With (in C++ terminology) release-stores from unlock_shared.

https://preshing.com/20120913/acquire-and-release-semantics/ has a nice image that shows the 1-way barrier effect of a release-store or acquire-load.

I wasn't sure about the memory order in unlock_shared, but my reasoning was that it does not really 'release' anything since it has read-only access and cannot change the data it protects — LWimsey, Dec 17 '16 at 05:15
@LWimsey: yeah, it's harder to think about load ordering than store ordering, but it's a real thing. A load becomes globally visible at the moment when it reads data from L1 cache. (Because that's when it makes a copy from globally-coherent cache into the out-of-order core of a single CPU.) — Peter Cordes, Dec 17 '16 at 05:29
@LWimsey Are you saying that lock_shared could immediately do an unlock operation? Crazy, but you seem to imply that ordering WRT to unlock_shared doesn't matter... — curiousguy, Aug 11 '19 at 22:08
@curiousguy `lock_shared` cannot unlock the shared mutex; It can only take a read-lock if already in unlocked state. You are right about the ordering in `unlock_shared`, that definitively has to be a release operation. — LWimsey, Aug 12 '19 at 05:50

C++ shared_mutex implementation

1 Answers1