I'm hacking on a fifo/queue for a multithreaded program using the following example as a guide. One thread writes to the fifo and another reads from it. The fifo will only be shared between these two threads. The synchronization overhead is massive; adding and removing 10 million items from the fifo takes about three seconds on my machine if I use mutexes and 1.5 seconds if I use spinlocks instead.
My blocking add function is as follows. The blocking remove function works analogously. The functions were updated, as John Bollinger pointed out that the spinlock code was racy if there were more than one writer or reader.
static void
spin_while_full(volatile synced_queue *me) {
while (me->queue->n_elements == me->queue->capacity) {
}
}
void
synced_queue_add(synced_queue *me, void *value) {
synced_queue_lock_type tp = me->lock_type;
if (tp == SYNCED_QUEUE_SPIN_LOCK) {
while (true) {
pthread_spin_lock(&me->lock);
if (me->queue->n_elements < me->queue->capacity) {
break;
}
pthread_spin_unlock(&me->lock);
}
} else if (tp == SYNCED_QUEUE_SPIN_LOCK_UNCONTESTED) {
spin_while_full(me);
pthread_spin_lock(&me->lock);
} else if (tp == SYNCED_QUEUE_MUTEX) {
pthread_mutex_lock(&me->mutex);
while (me->queue->n_elements == me->queue->capacity) {
pthread_cond_wait(&me->var_prod, &me->mutex);
}
} else {
assert(false);
}
queue_add(me->queue, value);
if (tp == SYNCED_QUEUE_SPIN_LOCK ||
tp == SYNCED_QUEUE_SPIN_LOCK_UNCONTESTED) {
pthread_spin_unlock(&me->lock);
} else {
pthread_mutex_unlock(&me->mutex);
pthread_cond_signal(&me->var_cons);
}
}
The library's user sets the lock_type
flag to whatever
synchronization method they want. The benchmark results for the three
lock types are:
=== benchmark_synced_queue
spinlock 1024 elements 0.17 us/job
uncontested spinlock 1024 elements 0.14 us/job
mutex 1024 elements 0.31 us/job
spinlock 512 elements 0.17 us/job
uncontested spinlock 512 elements 0.14 us/job
mutex 512 elements 0.31 us/job
spinlock 256 elements 0.15 us/job
uncontested spinlock 256 elements 0.14 us/job
mutex 256 elements 0.28 us/job
spinlock 128 elements 0.15 us/job
uncontested spinlock 128 elements 0.14 us/job
mutex 128 elements 0.29 us/job
spinlock 64 elements 0.16 us/job
uncontested spinlock 64 elements 0.14 us/job
mutex 64 elements 0.28 us/job
spinlock 32 elements 0.18 us/job
uncontested spinlock 32 elements 0.15 us/job
mutex 32 elements 0.15 us/job
spinlock 16 elements 0.21 us/job
uncontested spinlock 16 elements 0.16 us/job
mutex 16 elements 0.30 us/job
spinlock 8 elements 0.29 us/job
uncontested spinlock 8 elements 0.16 us/job
mutex 8 elements 0.60 us/job
spinlock 4 elements 0.43 us/job
uncontested spinlock 4 elements 0.17 us/job
mutex 4 elements 1.21 us/job
So while the "correct" spinlock is a little slower than the incorrect one, the difference between the spinlock types and the mutex is much bigger.
The pthread manual pretty much discourages the use of spinlocks:
Spin locks should be employed in conjunction with real-time scheduling policies (SCHED_FIFO, or possibly SCHED_RR). Use of spin locks with nondeterministic scheduling policies such as SCHED_OTHER probably indicates a design mistake. ... User-space spin locks are not applicable as a general locking solution. They are, by definition, prone to priority inversion and unbounded spin times. A programmer using spin locks must be exceptionally careful not only in the code, but also in terms of system configuration, thread placement, and priority assignment.
My question is why is locking is so much slower using mutexes than spinlocks? The problem with spinlocks is of course that it taxes the cpu. So is there a way to avoid using spinlocks, but to still get good performance?