Linux 3.0: futex-lock deadlock bug?

Question

// SubFetch(x,y) = atomically x-=y and return x (__sync_sub_and_fetch)
// AddFetch(x,y) = atomically x+=y and return x (__sync_add_and_fetch)
// CompareWait(x, y) = futex(&x, FUTEX_WAIT, y) wait on x if x == y
// Wake(x, y) = futex(&x, FUTEX_WAKE, y) wake up y waiters

struct Lock
{
Lock() : x(1) {}

void lock()
{
    while (true)
    {
        if (SubFetch(x, 1) == 0)
            return;

        x = -1;

        CompareWait(x, -1);
    }
}

void unlock()
{
    if (AddFetch(x, 1) == 1)
        return;

    x = 1;

    Wake(x, 1);
}

private:
    int x;
};

Linux 3.0 provides a system call called futex, upon which many concurrency utilities are based including recent pthread_mutex implementations. Whenever you write code you should always consider whether using an existing implementation or writing it yourself is the better choice for your project.

Above is an implementation of a Lock (mutex, 1 permit counting semaphore) based upon futex and the semantics description in man futex(7)

It appears to contain a deadlock bug whereby after multiple threads are trying to lock and unlock it a few thousand times, the threads can get into a state where x == -1 and all the threads are stuck in CompareWait, however noone is holding the lock.

Can anyone see where the bug is?

Update: I'm a little surprised that futex(7)/semantics is so broken. I completely rewrote Lock as follows... is this correct now?

// CompareAssign(x,y,z) atomically: if (x == y) {x = z; ret true; } else ret false;

struct Lock
{
Lock() : x(0) {}

void lock()
{
    while (!CompareAssign(x, 0, 1))
        if (x == 2 || CompareAssign(x, 1, 2))
            CompareWait(x, 2);
}

void unlock()
{
    if (SubFetch(x, 1) == 0)
        return;

    x = 0;

    Wake(x, 1);
}

private:
int x;
};

The idea here is that x has the following three states:

0: unlocked
1: locked & no waiters
2: locked & waiters

I've heard futexes seem deceivingly simple, it's quite hard to properly use them. Is there any reason you're doing this? — Karoly Horvath, Mar 08 '12 at 16:43
What do you mean by futex sematic "so broken". It is an interface between user space atomic operations and kernel based thread waiting and signalling. It is minimal to achieve exactly that, and to do that very efficiently. It is not meant to be an application level programming API. — Jens Gustedt, Mar 08 '12 at 19:14
As the futex documentation states it is subject to spurious wakeups, do you check for and correctly handle EINTR ? — nos, Mar 08 '12 at 19:27
@JensGustedt: I mean the implementation description in futex(7) / section "semantics" is incorrect and full of race conditions. The rest of your comment is irrelevant. — Andrew Tomazos, Mar 08 '12 at 19:33
@nos: Yes, CompareWait and Wake throw in the case of an error. — Andrew Tomazos, Mar 08 '12 at 19:34
There is this paper from drepper, explaining a lot about futexes, and what pifalls there are with them, and why trying to create mutexes with them the naive way wont work. I suggest you search for that one and read it. — PlasmaHH, Mar 13 '12 at 21:48

score 4 · Accepted Answer · answered Mar 08 '12 at 16:49

4

The problem is that you explicitly -1 assign to x if the SubFetch fails to acquire the lock. This races with the unlock.

Thread 1 acquires the lock. x==0.
Thread 2 tries to acquire the lock. The SubFetch sets x to -1, and then thread 2 is suspended.
Thread 1 releases the lock. The AddFetch sets x to 0, so the code then explicitly sets x to 1 and calls Wake.
Thread 2 wakes up and sets x to -1, and then calls CompareWait.

Thread 2 is now stuck waiting, with x set to -1, but there is no one around to wake it, as thread 1 has already released the lock.

answered Mar 08 '12 at 16:49

Anthony Williams

66,628
14
133
155

I have clearly misread futex(7) where it says "should set the counter to -1". What is the fix? – Andrew Tomazos Mar 08 '12 at 17:13
There is code for a futex-based mutex at http://www.justsoftwaresolutions.co.uk/files/c_mutexes.zip Basically, you cannot blindly just set the value of `x`, you have to use an atomic compare-and-swap operation, and check the return value in all cases. – Anthony Williams Mar 08 '12 at 17:30

MSB · Answer 2 · 2012-03-13T21:30:45.003

The proper implementation of a futex-based Mutex is described in Ulrich Drepper's paper "Futexes are tricky"

http://people.redhat.com/drepper/futex.pdf

It includes not only the code but also a very detailed explanation of why it is correct. The code from the paper:

class mutex
{
 public:
 mutex () : val (0) { }
 void lock () {
   int c;
   if ((c = cmpxchg (val, 0, 1)) != 0)
     do {
       if (c == 2 || cmpxchg (val, 1, 2) != 0)
         futex_wait (&val, 2);
     } while ((c = cmpxchg (val, 0, 2)) != 0);
 }
 void unlock () {
//NOTE: atomic_dec returns the value BEFORE the operation, unlike your SubFetch !
   if (atomic_dec (val) != 1) {
     val = 0;
     futex_wake (&val, 1);
   }
 }
 private:
   int val;
};

Comparing the code in the paper with your code, I spot a difference

You have

if (x == 2 || CompareAssign(x, 1, 2))

using the futex's value directly whereas Drepper uses the return value from the previous CompareAssign(). That difference will probably affect performance only.

Your unlock code is different, too, but seems to be semantically equivalent.

In any case I would strongly advise you to follow Drepper's code to the letter. That paper has stood the test of time and received a lot of peer review. You gain nothing from rolling your own.

score 0 · Answer 3 · answered Mar 08 '12 at 16:55

How about this scenario with three threads, A, B , and C.

The initial state of this scenario has:

thread A holding the lock
thread B not contending for the lock just yet
thread C in CompareWait()
x == -1 from when C failed to acquire the lock

        A                  B                C
    ==============   ================  ===============
     AddFetch()
     (so x == 0)
                       SubFetch()
                       (so x == -1)

     x = 1

                       x = -1

     Wake()

At this point whether B or C are unblocked, they will not get a result of 0 when they SubFetch().

Ok, either I have completely misunderstood semantics section of futex(7) or the description is completely broken. What is the fix here? — Andrew Tomazos, Mar 08 '12 at 17:16

Linux 3.0: futex-lock deadlock bug?

3 Answers3