Force release and acquire ordering for fetch and add with load/store

Question

So I am trying to implement writer and reader methods in such a way where I always want the reader to read the value that was last written by the writer, and thus have reads ordered after writes. For this specific example, I am trying to use a very simple fixed size queue, that has two pointers, _enqPtr and _deqPtr. These are both std::atomic<size_t> that represent the next index an element is enqueued or dequeued at. My full code I am using is below

class Queue {
public:
    Queue(size_t size): _enqPtr(0),
                        _deqPtr(0),
                        _size(size),
                        _data(new std::atomic<long>[_size])
    {
        for (int i = 0; i < _size; ++i)
        {
            _data[i].store(0);
        }
    }

    bool enqueue(long val) {
        size_t place = _enqPtr.fetch_add(1, std::memory_order_relaxed);
        if (place >= _size)
            return false;
        _data[place].store(val, std::memory_order_release); // DESIRED TO BE BEFORE
        return true;
    }

    bool dequeue(long& val) {
        size_t place = _deqPtr.fetch_add(1, std::memory_order_relaxed);
        size_t enq = _enqPtr.load();
        if (place >= _size)
            return false;
        if (place >= enq) {
            // went too far
            _deqPtr.store(enq - 1);
            return false;
        }

        val = _data[place].load(std::memory_order_acquire); // DESIRED TO BE AFTER 
        assert(val != 0); // keeps catching this, despite acquire / release.
        return true;
    }                    
private:
    std::atomic<size_t> _enqPtr;
    std::atomic<size_t> _deqPtr;
    size_t _size;
    std::atomic<long>* _data;
};

The fetch_add calls facilitate that every thread that calls the enqueue or dequeue gets it's own index. However, the issue i am running into is it seems that , one thread calls enqueue and calls fetch_add, but before it can store it's value, another thread calls dequeue and completes the call, and thus reads a null or non-value (in this case zero).

The threading code i am using to test this is below:

class QueueTester {
public:
    QueueTester(size_t queueSize, size_t threadCount):
                                            _threadCount(threadCount),
                                            _threads(new std::thread[_threadCount]),
                                            _queue(queueSize)
                                            {}
    void run() {
        for (int i = 0; i < _threadCount; ++i)
        {
            if (i % 2 == 0)  {
                _threads[i] = std::thread([this]{
                    long k = 1;
                    while (_queue.enqueue(k)) {
                        k = k == 0 ? 1 : k + 1;
                    }
                }); 
            } else {
                _threads[i] = std::thread([this]{
                    long k = 1;
                    while (_queue.dequeue(k));
                });             
            }

        }

        for (int i = 0; i < _threadCount; ++i)
        {
            _threads[i].join();
        }
    }
private:
    size_t _threadCount;
    std::thread* _threads;
    Queue _queue;
};

What do i need to adjust in terms of memory ordering and fences such that val = _data[place].load(std::memory_order_acquire); will always load after / from _data[place].store(val, std::memory_order_release); ?

Your problems are deeper than memory ordering and fences. Those only help prevent reordering of loads and stores *within a given thread*. But your code has a race *even if each thread executes in strict program order*. Namely, the entirety of `dequeue` could execute in Thread B in between the `_enqPtr.fetch_add(1)` and `_data[place].store(val)` of `enqueue` running in Thread A, and it would read a value that hadn't been written. No amount of fencing will fix that. So you need to go back to the drawing board with your algorithm. — Nate Eldredge, Feb 05 '21 at 03:54
AFAIK, designing a lock-free queue with multiple producers and consumers is a hard problem in general, though it is solved and you can find algorithms in the literature. So unless you're looking for a substantial project, you may want to use an off-the-shelf algorithm or implementation, or use a mutex. Beware that bugs may be very subtle and difficult to test for. — Nate Eldredge, Feb 05 '21 at 14:28
@NateEldredge in this particular case i am actually looking to implement a wait free queue, such that no thread is ever blocked at any point (a step above lock free where only one thread can make progress always). I was trying to see if FAA and indexing an array can be used for this, but this seems to not be the right way using memory fences. I am aware of the available once but those have fast and slow path inserts, which is not always optimal. — Josh Weinstein, Feb 05 '21 at 21:13
Feel free to answer why this doesn't work with memory fences — Josh Weinstein, Feb 05 '21 at 21:15
Well in short, that's just not what memory fences do. They help ensure that loads and stores made by one thread become visible to other threads in the same order in which you wrote them in your source code. But they can never force loads and stores made by *different* threads to happen in the particular order that you wish. That is not their function; only your program logic can do that. — Nate Eldredge, Feb 05 '21 at 21:52

Force release and acquire ordering for fetch and add with load/store

0 Answers0