Lock-free C++ data structures, impossible?

Question

I really dont understand how you can make some data structures lock-free. For example, if you have a linked list then either you surround the operations with mutexes, OR you can end up with a race condition if another thread executes whilst you are busy re-linking new nodes together.

The concept of "lock free" (I appreciate it doesnt mean "No locks" but means threads can progress without waiting for other threads to finish) just doesnt make sense.

Could somebody please show me a simple example using a stack, queue or linked list etc which is implemented as "lock-free" because I cannot understand how you can prevent the race condition without interfering with another threads productivity? Surely these two aims contradict each other?

Containers, in general, cannot be "lock-less" because anything that invalidates other items in the container (e.g. having to reallocate space and move the current data to a new location) would break every other thread trying to use it. — Zac Howland, Mar 01 '14 at 20:54
you should read "C++ concurrency in action" by Anthony Williams — Walter, Mar 01 '14 at 20:56
@Walter I've read the book. What I was getting at is that any data structure where elements can be invalidated by other operations requires synchronization. — Zac Howland, Mar 01 '14 at 21:01
@ZacHowland okay, but that's not what you said in your earlier comment. — Walter, Mar 01 '14 at 21:48
@Zac Still not true. See [here](http://high-scale-lib.sourceforge.net/) for an example of a lock-free, growable HashMap. — Voo, Mar 02 '14 at 00:22
@Voo I fail to see what an Open Source Java project has to do with concurrency in C++. — Zac Howland, Mar 02 '14 at 03:39
@Zac I fail to see how a claim such as "it's impossible to do X" is *not* invalidated by a single counterexample. There is no algorithm that you can implement in Java that you cannot also implement in C++ - it may just be much harder to do so. — Voo, Mar 02 '14 at 12:44
@Voo First of all, data structures are different from algorithms. Second, you completely twisted what I said. Third, the "example" you gave uses synchronization (mutexes). — Zac Howland, Mar 02 '14 at 20:55
@Zac 1) You need algorithms to implement operations on those data structures which is what we're interested in. 2) Direct quote from you "any data structure where elements can be invalidated by other operations requires synchronization" - which is wrong as the growable lock-free HashMap demonstrates (yes you can implement it in c++ too, someone actually wrote a port I think) 3) Umn no it doesn't - show me the line of code? — Voo, Mar 02 '14 at 21:05
@Voo You are now attempting to further twist my words around. If you read what I said carefully, you'll find it much more difficult to continue this line of discussion. Perhaps I should have added the phrase "for safety". I was going off of the project description for the example you have, but the actual code is even worse: In one implementation, a `rehash` function is an empty function. In another, the `resize` operation will return the new (incompletely created) hash table for subsequent threads. — Zac Howland, Mar 02 '14 at 22:10
@Zac I'm twisting your words by directly quoting you (and not just parts, but the whole sentence)? Since the "uses mutexes" claim is off the table, on to the next ones: `rehash` is an internal function that's necessary for JCK tests to pass - clearly a protected method cannot influence the correctness of user code. The FSM that underlies the given system was model checked by several people without anybody showing such a bug as you claim for resize. I'd be very interested in seeing one example timeline that demonstrates the problem you seem to have found. — Voo, Mar 03 '14 at 00:34
The resize part is rather complicated though so you probably want to get [familiar](http://www.azulsystems.com/events/javaone_2007/2007_LockFreeHash.pdf) with the underlying FSM. If you can show where the FSM is broken, you probably have excellent chances to give a talk at JavaOne on that - a bug report where the implementation disagrees with the FSM would also make quite some waves. — Voo, Mar 03 '14 at 00:50
@Voo We are getting off in the weeds here. I said, " What I was getting at is that any data structure where elements can be invalidated by other operations requires synchronization," and "anything that invalidates other items in the container (e.g. having to reallocate space and move the current data to a new location) would break every other thread trying to use it." Take `std::vector` for example. If you were to try to access element X in thread 1 at the same time thread 2 was causing the vector to be resized, you have a race condition that must be synchronized. — Zac Howland, Mar 03 '14 at 01:44
@Voo If you looked at the HashMap code you linked, you would see `// Since this routine has a fast cutout for copy-already-started, callers // MUST 'help_copy' lest we have a path which forever runs through // 'resize' only to discover a copy-in-progress which never progresses.` The fast cut-out they refer to is the returning of an incomplete hashmap (which is noted in the comments). — Zac Howland, Mar 03 '14 at 01:47
@Zac I did, I also read the presentation and thought the FSM through and why that's not a problem (the old hashmap is still available until the resize is finished and readers look first in the old one - the presentation also shows how to guarantee that we don't miss updates). And clearly if you can implement a HashMap correctly that allows resizing you already have a working resizable vector (a vector can be thought of as a HashMap with indizes as keys). The only limitation you have is that you need atomic reads/writes of the values, so you're basically limited to working with pointers. — Voo, Mar 03 '14 at 07:26

score 4 · Answer 1 · answered Mar 01 '14 at 20:56

Lock-less data structures use atomic operations and may impose additional requirements. For example, the data structure might only be safe for one reader and one writer thread, or any other combination. In the case of a simple linked list would use atomic reads and writes to the node pointer, to guarantee that multiple threads can safely read and write to it at the same time.

You may or may not get away with just that. If you need additional guarantees about the content of the data structure and validation, you are probably not able to make this without some form of high level locking. Also, not every data structure allows to be rewritten to be lock free, even when taking into account additional requirements on how the data structure is used. In those case, immutable objects might be a solution, but they have usually come with performance penalties due to copying, which is not always desirable over locking the object and then mutating it.

score 3 · Answer 2 · answered Oct 11 '21 at 04:00

What I find easy and explainable is that first you can write pseudocode for lock based(mutex) style data structure and then try to see how the variables you held a lock on, can be modified in a lock free way with CAS operations.Though others have given great answers, I would like to add that you get a feel of it only if you implement with by yourself,of course by reading some pseudocode from some research paper it was published in.

Here's a queue I implemented in C++ with validation testing for multi threaded runs :

#include<iostream>
#include<atomic>
#include<thread>
#include<vector>
#define N 1000
using namespace std;
class lf_queue
{
private:
    struct node
    {   int data;
        atomic<node*> next;
        node(int d):data(d)
        {}
    };
    atomic<node*> Head;
    atomic<node*> Tail;
public:
    lf_queue()
    {
        node *nnode= new node(-1);
        nnode->next=NULL;
        Head=nnode;
        Tail=nnode;
    }
    void enqueue(int data)
    {
        node *nnode= new node(data);
        nnode->next=NULL;
        node *tail,*next_p;
        while(true)
        {
            tail=Tail.load();
            next_p=tail->next;
            if(tail==Tail.load())
            {
                if(next_p==NULL)
                {
                    if((tail->next).compare_exchange_weak(next_p,nnode))
                    break;
                }
                else
                {
                    Tail.compare_exchange_weak(tail,next_p);
                }
            }
        }
        Tail.compare_exchange_weak(tail,nnode);
    }
    bool dequeue(int &res)
    {
        while(true)
        {
            node *head,*tail,*next_p;
            head=Head.load();
            tail=Tail.load();
            next_p=head->next;
            if(head==Head.load())
            {
                if(head==tail)
                {
                    if(next_p==NULL)
                        return false;
                    Tail.compare_exchange_weak(tail,next_p);
                }
                else
                {
                    res=next_p->data;
                    if(Head.compare_exchange_weak(head,next_p))
                        break;
                }
            }
        }//end loop
        return true;
    }
};
void producer(lf_queue &q)
{   //cout<<this_thread::get_id()<<"Inside producer\n";
    for(int i=0;i<N;i++)
    {
       q.enqueue(1);
     }
    //cout<<this_thread::get_id()<<" "<<"Finished producing\n";
}
void consumer(lf_queue &q,atomic<int>& sum)
{   //cout<<this_thread::get_id()<<" "<<"Inside consumer\n";
    for(int i=0;i<N;i++)
    {
        int res=0;
        while(!q.dequeue(res));
        sum+=res;
    }
    //cout<<this_thread::get_id()<<" "<<"Finished consuming\n";
}
int main()
{
    lf_queue Q;
    atomic<int> sum;
    sum.store(0);
    vector<thread> thread_pool;
    for(int i=0;i<10;i++)
    {   if(i%2==0)
        {   thread t(consumer,ref(Q),ref(sum));
            thread_pool.push_back(move(t));
        }
        else
        {
            thread t(producer,ref(Q));
            thread_pool.push_back(move(t));    
        }
    }
    for(int i=0;i<thread_pool.size();i++)
    thread_pool[i].join();
    cout<<"Final sum "<<sum.load()<<"\n";
    return 0;
}

I tried implementing lock free linked list using Harris's paper, but ran into complications, you see with C++11 style, you can only perform CAS on atomic<> types, and also these atomic<node*> can't be casted into long for the purpose of bit stealing which Harris's implementation uses to logically mark deleted nodes.However there are code implementations in C available on internet that use low level cas_ptr operations which gives more flexibility for casting to/from between addresses and long.

score 2 · Answer 3 · answered Mar 01 '14 at 21:53

There are different primitives available that allow one to construct such lock-free data structures. For example, compare-and-swap (CAS for short) that atomically executes the following code:

CAS(x, o, n)
  if x == o:
    x = n
    return o
  else:
    return x

With this operation, you can do atomic updates. Consider, for example, a very simple linked-list that stores elements in a sorted order, allows you to insert new elements and to check whether an element already exists. The find operation will work as before: it will traverse all the links until it either finds an element, or finds a larger element than the query. Insertion needs to be a little more careful. It could work as follows:

insert(lst, x)
  xn = new-node(x)
  n = lst.head
  while True:
    n = find-before(n, x)
    xn.next = next = n.next
    if CAS(n.next, next, x) == next:
      break

find-before(n,x) just finds an element that precedes x in the order. This is, of course, just a sketch. Things get more complicated once you want to support deletions. I recommend Herlihy and Shavit's "The Art of Multiprocessor Programming." I should also point out that it is often advantageous to switch data structures that implement the same model, to make them lock-free. For example, if you want to implement an equivalent of std::map, it would be a pain to do it with a red-black tree, but a skip-list is much more manageable.

ilmale · Answer 4 · 2014-03-02T16:09:53.013

Lockless structure use atomic instruction to acquire ownership of resources. Atomic instruction lock the variable it's working at CPU cache level, witch assure you that another cores can't interfere with the operation.

Let's say you have these atomic instruction:

read(A) -> A
compare_and_swap(A, B, C) -> oldA = A; if (A == B) { A = C }; return oldA;

With these instruction you can simply create a stack:

template<typename T, size_t SIZE>
struct LocklessStack
{
public:
  LocklessStack() : top(0)
  {
  }
  void push(const T& a)
  {
     int slot;
     do
     {
       do
       {
         slot = read(top);
         if (slot == SIZE)
         {
           throw StackOverflow();
         }
       }while(compare_and_swap(top, slot, slot+1) == slot);
       // NOTE: If this thread stop here. Another thread pop and push
       //       a value, this thread will overwrite that value [ABA Problem].
       //       This solution is for illustrative porpoise only
       data[slot] = a;
     }while( compare_and_swap(top, slot, slot+1) == slot );
  }
  T pop()
  {
     int slot;
     T temp;
     do
     {
       slot = read(top);
       if (slot == 0)
       {
         throw StackUnderflow();
       }
       temp = data[slot-1];
     }while(compare_and_swap(top, slot, slot-1) == slot);
     return temp;
  }
private:
  volatile int top;
  T data[SIZE];
};

volatile is required so compiler don't mess the order of operation during optimization. Two concurrent push occur:

The first one enter in the while loop and read slot, then the second push arrive, read top, the compare and swap (CAS) succeed and increment top. The other thread wake up, the CAS fail and read another time top..

Two concurrent pop occur:

Really similar to the previous case. Need to read the value as well.

One pop and push occur simultaneously:

pop read the top, read temp.. push enter and modify top and push a new value. Pop CAS fail, pop() will do the cycle again and read a new value

or

push read the top and acquire a slot, pop enter and modify the top value. push CAS fail and have to cycle again pushing on a lower index.

Obviously this is not true in a concurrent environment

stack.push(A);
B = stack.pop();
assert(A == B); // may fail

cause while push is atomic and pop is atomic the combination of them is not atomic.

First chapter of Game programming gem 6 is a nice reference.

Note the code is NOT TESTED and atomic can be really nasty.

Is the while loop necessary? I thought c_a_s would just use the updated value? — user997112, Mar 01 '14 at 22:06
What if you have a stack with 100 elements, you issue a push, and 10 pops. The push halts execution right before `data[slot] = a`, and then 10 pops execute. I think you'll lose the value you are pushing. — foxcub, Mar 01 '14 at 22:12
@foxcub: yeah, you are right... probably another outer while may help but then the ABA kick in. (push() [10 pop and 10 push()] and you are overriding another value. Atomic are really nasty.) — ilmale, Mar 01 '14 at 22:35
@user997112: CAS can only check if the variable is changed and then it assign. But is limited to POD value (is really low level, no copy constructor are called), so usually is used to compare pointer or index. — ilmale, Mar 01 '14 at 22:39
Apart from the other problems, `volatile` doesn't do what you think it does and is **not** enough to guarantee correct functioning! You need `std::atomic`` or other compiler specific builtins (or get your compiler to give you additional guarantees for volatile - MSVC has a flag for that) — Voo, Mar 02 '14 at 00:27
@Voo: I'm pretty sure to know what volatile do. _"Let's say you have these atomic instruction:"_ suppose that you have these instruction and they are atomic. But even if you have the atomic instruction the optimizer can skip a read or rearrange the read/write order. Also CAS is compiler dependent, so is not a good idea to give an answer that works only on certain standard or with certain compiler. I don't have std::atomic on Ps3 and btw std::atomics have a volatile internal representation, and any intrisic I know require a volatile input. — ilmale, Mar 02 '14 at 16:01
@ilmale volatile doesn't give any ordering guarantees from the CPU which makes them basically completely useless for concurrent programming. See [Herb's post](http://www.drdobbs.com/parallel/volatile-vs-volatile/212701484). YOu can make volatile useful by setting some flags for VC++ but the standard doesn't give you any useful guarantees. If you need ordering guarantees (and you do for correct multi-threaded programming, that's the core of it!), use `std::atomic` (if volatile in c++ really did what you think it does, `std::atomic` would be useless) — Voo, Mar 02 '14 at 16:19

Guntram Blohm · Answer 5 · 2014-03-01T21:10:29.550

Assume a simple operation, that increments a variable by one. If you implement this using "read the variable from memory to the cpu, add 1 to the cpu register, write the variable back", then you have to put some kind of mutex around the whole thing because you want to make sure the 2nd thread won't read the variable until after the first has written it back.

If your processor has an atomic "increment memory location" assembly instruction, you don't need the lock.

Or, assume you want to insert an element into a linked list, which means you need to make the start pointer point to the new element, then make the new element point to the element that was the previous first one. With an atomic "exchange two memory cells" operation, you could write the current start pointer into the "next" pointer of the new element, then swap the two pointers - now, depending on which thread runs first, the elements will be in different order in the list, but the list data structure remains intact.

Basically, it's always about "do several things at once, in one atomic operation, so you can't break the operation into single parts that may not be interrupted".

score 0 · Answer 6 · edited Apr 13 '17 at 12:40

your definition of lock-freedom is wrong.

Lock-freedom allows individual threads to starve but guarantees system-wide throughput. An algorithm is lock-free if it satisfies that when the program threads are run sufficiently long at least one of the threads makes progress (for some sensible definition of progress) https://en.wikipedia.org/wiki/Non-blocking_algorithm

this means, that with multiple threads accessing the data-structure only 1 will be granted; the rest will fail

the important thing about lock-freedom is the probability of a memory-collision. A data-structure secured with locks will be generally faster than an implementation with atomic-variables, but it doesnt scale well with a small chance of a collision.

example: multiple threads constantly push_back data in your list. this will lead to many collisions and classical mutex are fine. However if you have 1 Thread pushing Data to the end of the list and 1 Thread poping Data at the front, the situation is different. If the list is not empty, push_back() and pop_front() wont collide (depends on implementation), because they dont work on the same object. But there is still a change of an empty list, so you still need to secure the access. In this scenario lock-freedom will be the better solution, since you can call both functions simultaneously without having to wait.

in short: lock-free is designed for large datastructures, where multiple writers are mostly seperated and rarely collide.

i tried to implement a lock-free list container on my own a while ago... https://codereview.stackexchange.com/questions/123201/lock-free-list-in-c

score 0 · Answer 7 · answered Jan 18 '22 at 14:07

Here you go - a very basic (push_front) lock-free list:

template <class T>
class LockFreeList {    
public:
    struct Node {
        T value_;
        Node* next_;
        Node(const T& value) : value_(value), next_(nullptr) {
        }
    };
    
    void push_front(const T& value) {
        Node* new_node = new Node(value);
        Node* old_head = head_;
        do { new_node->next_ = old_head; }
        while (!head_.compare_exchange_weak(old_head, new_node));
    }
    
private:
    std::atomic<Node*> head_;
};

Inspired by Fedor Pikus's CppCon 2017 talk, see: https://youtu.be/ZQFzMfHIxng?t=2432

With a minor change: push_back uses compare_exchange_weak.

Lock-free C++ data structures, impossible?

7 Answers7