Why are pthread spinlocks so much faster than mutexes?

Question

I'm hacking on a fifo/queue for a multithreaded program using the following example as a guide. One thread writes to the fifo and another reads from it. The fifo will only be shared between these two threads. The synchronization overhead is massive; adding and removing 10 million items from the fifo takes about three seconds on my machine if I use mutexes and 1.5 seconds if I use spinlocks instead.

My blocking add function is as follows. The blocking remove function works analogously. The functions were updated, as John Bollinger pointed out that the spinlock code was racy if there were more than one writer or reader.

static void
spin_while_full(volatile synced_queue *me) {
    while (me->queue->n_elements == me->queue->capacity) {
    }
}

void
synced_queue_add(synced_queue *me, void *value) {
    synced_queue_lock_type tp = me->lock_type;
    if (tp == SYNCED_QUEUE_SPIN_LOCK) {
        while (true) {
            pthread_spin_lock(&me->lock);
            if (me->queue->n_elements < me->queue->capacity) {
                break;
            }
            pthread_spin_unlock(&me->lock);
        }
    } else if (tp == SYNCED_QUEUE_SPIN_LOCK_UNCONTESTED) {
        spin_while_full(me);
        pthread_spin_lock(&me->lock);
    } else if (tp == SYNCED_QUEUE_MUTEX) {
        pthread_mutex_lock(&me->mutex);
        while (me->queue->n_elements == me->queue->capacity) {
            pthread_cond_wait(&me->var_prod, &me->mutex);
        }
    } else {
        assert(false);
    }
    queue_add(me->queue, value);
    if (tp == SYNCED_QUEUE_SPIN_LOCK ||
        tp == SYNCED_QUEUE_SPIN_LOCK_UNCONTESTED) {
        pthread_spin_unlock(&me->lock);
    } else {
        pthread_mutex_unlock(&me->mutex);
        pthread_cond_signal(&me->var_cons);
    }
}

The library's user sets the lock_type flag to whatever synchronization method they want. The benchmark results for the three lock types are:

=== benchmark_synced_queue
spinlock               1024 elements  0.17 us/job
uncontested spinlock   1024 elements  0.14 us/job
mutex                  1024 elements  0.31 us/job
spinlock                512 elements  0.17 us/job
uncontested spinlock    512 elements  0.14 us/job
mutex                   512 elements  0.31 us/job
spinlock                256 elements  0.15 us/job
uncontested spinlock    256 elements  0.14 us/job
mutex                   256 elements  0.28 us/job
spinlock                128 elements  0.15 us/job
uncontested spinlock    128 elements  0.14 us/job
mutex                   128 elements  0.29 us/job
spinlock                 64 elements  0.16 us/job
uncontested spinlock     64 elements  0.14 us/job
mutex                    64 elements  0.28 us/job
spinlock                 32 elements  0.18 us/job
uncontested spinlock     32 elements  0.15 us/job
mutex                    32 elements  0.15 us/job
spinlock                 16 elements  0.21 us/job
uncontested spinlock     16 elements  0.16 us/job
mutex                    16 elements  0.30 us/job
spinlock                  8 elements  0.29 us/job
uncontested spinlock      8 elements  0.16 us/job
mutex                     8 elements  0.60 us/job
spinlock                  4 elements  0.43 us/job
uncontested spinlock      4 elements  0.17 us/job
mutex                     4 elements  1.21 us/job

So while the "correct" spinlock is a little slower than the incorrect one, the difference between the spinlock types and the mutex is much bigger.

The pthread manual pretty much discourages the use of spinlocks:

Spin locks should be employed in conjunction with real-time scheduling policies (SCHED_FIFO, or possibly SCHED_RR). Use of spin locks with nondeterministic scheduling policies such as SCHED_OTHER probably indicates a design mistake. ... User-space spin locks are not applicable as a general locking solution. They are, by definition, prone to priority inversion and unbounded spin times. A programmer using spin locks must be exceptionally careful not only in the code, but also in terms of system configuration, thread placement, and priority assignment.

My question is why is locking is so much slower using mutexes than spinlocks? The problem with spinlocks is of course that it taxes the cpu. So is there a way to avoid using spinlocks, but to still get good performance?

How many CPU cores do you run on? How many threads do you have contesting for the lock? Try with a bunch of threads competing for the lock, many more than you have CPU cores... — Andrew Henle, Aug 23 '23 at 22:01
As stated, I only have two threads; one that writes to the fifo and another that reads from it. The machines I'm benchmarking on have between four to 24 cores. — Björn Lindqvist, Aug 23 '23 at 22:29
But if you try your experiment with many more threads than you have cores, you might get some insight into the tradeoffs between mutexes and spinlocks. — Andrew Henle, Aug 23 '23 at 22:48

John Bollinger · Answer 1 · 2023-08-24T14:00:28.740

3

Why are pthread spinlocks so much faster than mutexes?

Marco has described several good reasons for the specific performance difference you observed, but he didn't make one important observation: the relative timing of your spinlock version vs your mutex + cv version is meaningless because these are not equivalent to each other.

Your mutex + cv alternative correctly waits for the queue to have an available element / have available capacity before proceeding to dequeue or enqueue an element, without any data races.

On the other hand, your spinlock version is racy. It has data races when it spins waiting for the queue to be in a suitable state to proceed, and if there is more than one reader or more than one writer then your queue can break, trying to dequeue an element when there aren't any enqueued, or to enqueue an element when the queue is already at capacity, because of the gap between testing the queue ready and successfully acquiring the spinlock.

To avoid those races, the spinlock version must acquire the spinlock before testing the state of the queue, and if that is found unsuitable, it must release the spinlock and reacquire it before testing again. You can't substitute a spinlock for a mutex for use in conjunction with a CV, so that's going to be a tight loop. For example:

void synced_queue_remove(synced_queue *me, void *value) {
    while (1) {
        pthread_spin_lock(&me->lock);
        if (me->queue->n_elements > 0) break;
        pthread_spin_unlock(&me->lock);
    }
    queue_remove(me->queue, value);
    pthread_spin_unlock(&me->lock);
}

I predict that that will tend to perform very poorly indeed, but such a result speaks to the code overall. It does not make a direct comparison between mutex performance and spinlock performance itself any more than the original example does.

edited Aug 24 '23 at 14:00

answered Aug 24 '23 at 04:35

John Bollinger

160,171
8
81
157

I updated my question. But even with your correct spinlocking (which isn't needed as there is only one writer and one reader), the mutex variant is way slower. – Björn Lindqvist Aug 24 '23 at 15:57
@BjörnLindqvist correct spinlocking is required even with only one reader and one writer. Without, your example has undefined behavior as a result of the data races, which in practice can very much manifest as *mis*behavior. Having only one of each protects you only from the queue state becoming incorrect in the gap between checking it and acquiring the spinlock. – John Bollinger Aug 24 '23 at 17:34
@BjörnLindqvist, in any case, your revised code still is not an apples-to-apples comparison. If you want that then take the CV out and in its place structure the mutex version the same as the `SYNCED_QUEUE_SPIN_LOCK` version. – John Bollinger Aug 24 '23 at 17:39
Isn't it? How would you implement the blocking wait without condition variables or spinning? – Björn Lindqvist Aug 24 '23 at 17:52
No, it isn't, @BjörnLindqvist. As I already said, if you want an apples-to-apples comparison of the performance of spinlocks *vs*. that of mutexes then the mutex version should be structured exactly the same way as the spinlock version. This answer already demonstrates such a form (for spinlocks). Sub in mutexes. Among other things, that abandons the blocking wait in the mutex case. – John Bollinger Aug 24 '23 at 17:58
I forgot to address your other comment. There is only one producer and one consumer so as soon as the thread has checked that the fifo is non-empty/non-full it will continue to stay that way until the thread has added or removed from it. There is no possibility of a race. Yes, I could use busy loops over condition variables for the mutex case, But then it would waste cpu just like a spinlock so what would be the point? Might aswell use spinlocks... – Björn Lindqvist Aug 24 '23 at 21:05
@BjörnLindqvist, that is why the gap between checking the queue state and acquiring the mutex or spinlock is not problematic in the single-producer, single-consumer case. But it has nothing to do with the data races involving `me->queue->n_elements`. You have two threads accessing that object, and some of the access are writes. If the accesses by the different threads are not synchronized then that is the (C) *definition* of "data race". The behavior of a program containing a data race is undefined, and in practice, such programs often do misbehave. – John Bollinger Aug 24 '23 at 22:21

Marco Bonelli · Answer 2 · 2023-08-24T16:53:20.480

I'll give you my educated guess. It needs some testing to be proven right/wrong, which you should be able to perform yourself.

First of all let me highlight the fundamental implementation difference between spinlocks and mutexes:

Pthread spinlocks are implemented almost entirely in userspace (though they may fall back to actually sleeping through syscalls in case of high contention depending on the implementation). As the name suggest, they make the current CPU "spin" while waiting to acquire the lock. This is usually implemented with atomic compare-and-exchange CPU instructions.
Pthred mutexes are implemented through the futex(2) syscall (at least on Linux) or anyway through a set of syscalls. Contrary to spinlocks, they prevent CPU usage while waiting for a lock by making the current thread sleep at least until the lock is ready to be acquired.

Your implementation uses a busy loop in the case of spin locks VS a pthread condition in the case of mutexes. Again, the first case is likely a userspace-only job and the second case implies sleeping/waking threads through syscalls. This further polarizes the two cases you are testing: in the first one (spinlocks) you will have two threads that will almost entirely hog the CPU for their exclusive use, while in the second one (mutexes) you will have two threads that will "intelligently" sleep/wake trying to minimize CPU usage when unneeded.

By itself, this already introduces a big difference in overhead between the two implementations: the performance of the first one will more or less only be limited by the speed of your CPUs, while the second one also has to rely on the underlying kernel for task sleep/wakeup. Performing a syscall (mutex) is a much more expensive operation than a few hundred CMPXCHG instructions (spinlock).

On top of this, modern CPUs and operating systems use dynamic CPU frequency scaling, which essentially means that the CPU clock speed of each core is not fixed, but continuously adjusted (independently from other cores) based on its utilization either by the kernel, the hardware itself, or a combination of the two (see e.g. Intel P-state).

Therefore, what is likely happening is that the intermittent sleeping caused by mutexes in your second implementation is further degrading performance by making the CPU constantly change clock speed like a roller coaster. I have talked about this phenomenon in more detail (specifically focusing on Linux) in this other answer of mine.

You may be correct but the question is then what to do about it? I don't want to waste cpu time with spinlocks, but I also don't want dvfs and syscall overhead ruining my day. — Björn Lindqvist, Aug 24 '23 at 16:02
Well unfortunately you gotta pick one :') there isn't much else to do except altering the scaling driver/governor to improve the second case. Can you afford to saturate 2 CPUs of your system with this? Then go for the spinlock variant. Do you have to distribute this software for general purpose usage? Go with the mutex variant or provide an option to the users. — Marco Bonelli, Aug 24 '23 at 16:50

Why are pthread spinlocks so much faster than mutexes?

2 Answers2