31

Where can I find documentation for "adaptive" pthread mutexes? The symbol PTHREAD_MUTEX_ADAPTIVE_NP is defined on my system, but the only documentation I can find online says nothing about what an adaptive mutex is, or when it's appropriate to use.

So... what is it, and when should I use it?

For reference, my version of libc is:

GNU C Library (Ubuntu EGLIBC 2.15-0ubuntu10.5) stable release version 2.15, by Roland McGrath et al.
Copyright (C) 2012 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Compiled by GNU CC version 4.6.3.
Compiled on a Linux 3.2.50 system on 2013-09-30.
Available extensions:
    crypt add-on version 2.1 by Michael Glad and others
    GNU Libidn by Simon Josefsson
    Native POSIX Threads Library by Ulrich Drepper et al
    BIND-8.2.3-T5B
libc ABIs: UNIQUE IFUNC
For bug reporting instructions, please see:
<http://www.debian.org/Bugs/>.

and "uname -a" gives

Linux desktop 3.2.0-55-generic #85-Ubuntu SMP Wed Oct 2 12:29:27 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
laslowh
  • 8,482
  • 5
  • 34
  • 45

3 Answers3

75

PTHREAD_MUTEX_ADAPTIVE_NP is something that I invented while working in the role of a glibc contributor on making LinuxThreads more reliable and perform better. LinuxThreads was the predecessor to glibc's NPTL library, originally developed as a stand-alone library by Xavier Leroy, who is also well-known as one of the creators of OCaml.

The adaptive mutex survived into NTPL in essentially unmodified form: the code is nearly identical, including the magic constants for the estimator smoothing and the maximum spin relative to the estimator.

Under SMP, when you go to acquire a mutex and see that it is locked, it can be sub-optimal to simply give up and call into the kernel to block. If the owner of the lock only holds the lock for a few instructions, it is cheaper to just wait for the execution of those instructions, and then acquire the lock with an atomic operation, instead of spending hundreds of extra cycles by making a system call.

The kernel developers know this very well, which is one reason why we have spinlocks in the Linux kernel for fast critical sections. (Among the other reasons is, of course, that code which cannot sleep, because it is in an interrupt context, can acquire spinlocks.)

The question is, how long should you wait? If you spin forever until the lock is acquired, that can be sub-optimal. User space programs are not well-written like kernel code (cough). They could have long critical sections. They also cannot disable pre-emption; sometimes critical sections blow up due to a context switch. (POSIX threads now provide real time tools to deal with this: you can put threads into a real-time priority and FIFO scheduling and such, plus configure processor affinity.)

I think we experimented with fixed iteration counts, but then I had this idea: why should we guess, when we can measure. Why don't we implement a smoothed estimator of the lock duration, similarly to what we do for the TCP retransmission time-out (RTO) estimator. Each time we spin on a lock, we should measure how many spins it actually took to acquire it. Moreover, we should not spin forever: we should perhaps spin only at most twice the current estimator value. When we take a measurement, we can smooth it exponentially, in just a few instructions: take a fraction of the previous value, and of the new value, and add them together, which is the same as adding a fraction of their difference to back to the estimator: say, estimator += (new_val - estimator)/8 for a 1/8 to 7/8 blend between the old and new value.

You can think of this as a watchdog. Suppose that the estimator tells you that the lock, on average, takes 80 spins to acquire. You can be quite confident, then, that if you have executed 160 spins, then something is wrong: the owner of the lock is executing some exceptionally long case, or maybe has hit a page fault or was otherwise preempted. At this point the waiting thread cuts its losses and calls into the kernel to block.

Without measurement, you cannot do this accurately: there is no "one size fits all" value. Say, a fixed limit of 200 spins would be sub-optimal in a program whose critical sections are so short that a lock can almost always be fetched after waiting only 10 spins. The mutex locking function would burn through 200 iterations every time there is an anomalous wait time, instead of nicely giving up at, say, 20 and saving cycles.

This adaptive approach is specialized, in the sense that it will not work for all locks in all programs, so it is packaged as a special mutex type. For instance, it will not work very well for programs that lock mutexes for long periods: periods so long that more CPU time is wasted spinning on the large estimator values than would have been by going into the kernel. The approach is also not suitable for uniprocessors: all threads besides the one which is trying to get the lock are suspended in the kernel. The approach is also not suitable in situations in which fairness is important: it is an opportunistic lock. No matter how many other threads have been waiting, for no matter how long, or what their priority is, a new thread can come along and snatch the lock.

If you have very well-behaved code with short critical sections that are highly contended, and you're looking for better performance on SMP, the adaptive mutex may be worth a try.

P.P
  • 117,907
  • 20
  • 175
  • 238
Kaz
  • 55,781
  • 9
  • 100
  • 149
  • In the source code I see that it busy loops using nops. Is there any reason why you do not busy loop with `sched_yield` like [WebKit](https://webkit.org/blog/6161/locking-in-webkit/) does? – Hongli May 11 '16 at 01:21
  • @HongLi Using `sched_yield` in a loop is obsolete. Linux developed futexes about a decade and a half ago. With a futex, we can use an atomic operation (without kernel intervention) to get a lock. If that fails, we can go into the kernel to wait on the futex (rather than executing wasteful `sched_yield` calls which don't wait for anything). We can also spin on a futex: that is, try the atomic operation a number of times, before calling the futex wait operation. The WebKit approach is obsolete, based on outdated concepts. – Kaz May 11 '16 at 16:28
  • @Kaz Thanks for the reply. I understand that `sched_yield` does not wait for anything, but the WebKit blog post makes the case that they busy loop on `sched_yield` in order to optimize for microcontended cases: contended cases in which the critical section is short. If a `sched_yield` has less constant time overhead than a futex syscall, then seems plausible to me that busy looping on `sched_yield` for a short number of iterations is beneficial when optimizing for microcontended cases. Do you agree with this? If not, then are you saying that a futex syscall's constant overhead is very low? – Hongli May 18 '16 at 01:45
  • @Hongli With `futex` you have to do syscall in both locking thread _and_ unlocking thread. With `sched_yield` you only need one syscall in locking thread. The problem is that `sched_yield` is unpredictable and might give you different results due to scheduler being hesitant about migrating threads from one core to another that is far away in a sense of cache transfer time. –  Jun 26 '21 at 17:48
7

The symbol is mentionned there:

http://elias.rhi.hi.is/libc/Mutexes.html

"LinuxThreads supports only one mutex attribute: the mutex type, which is either PTHREAD_MUTEX_ADAPTIVE_NP for "fast" mutexes, PTHREAD_MUTEX_RECURSIVE_NP for "recursive" mutexes, PTHREAD_MUTEX_TIMED_NP for "timed" mutexes, or PTHREAD_MUTEX_ERRORCHECK_NP for "error checking" mutexes. As the NP suffix indicates, this is a non-portable extension to the POSIX standard and should not be employed in portable programs.

The mutex type determines what happens if a thread attempts to lock a mutex it already owns with pthread_mutex_lock. If the mutex is of the "fast" type, pthread_mutex_lock simply suspends the calling thread forever. If the mutex is of the "error checking" type, pthread_mutex_lock returns immediately with the error code EDEADLK. If the mutex is of the "recursive" type, the call to pthread_mutex_lock returns immediately with a success return code. The number of times the thread owning the mutex has locked it is recorded in the mutex. The owning thread must call pthread_mutex_unlock the same number of times before the mutex returns to the unlocked state.

The default mutex type is "timed", that is, PTHREAD_MUTEX_TIMED_NP."

EDIT: updated with info found by jthill (thanks!)

A little more info on the mutex flags and the PTHREAD_MUTEX_ADAPTIVE_NP can be found here:

"The PTHRED_MUTEX_ADAPTIVE_NP is a new mutex that is intended for high throughput at the sacrifice of fairness and even CPU cycles. This mutex does not transfer ownership to a waiting thread, but rather allows for competition. Also, over an SMP kernel, the lock operation uses spinning to retry the lock to avoid the cost of immediate descheduling."

Which basically suggest the following: in case where high thoughput is desirable, such mutex can be implemented requiring extra considerations from the thread logic due to it's very nature. You will have to design an algorithm that can use these properties resulting in high throughput. Something that load balances itself from within (as opposed to "from the kernel") where order of execution is unimportant.

There was a very good book for linux/unix multithreading programming which name escapes me. If I find it I'll update.

Sebastien
  • 1,439
  • 14
  • 27
  • Yeah, I meant to link to that page in my question, fixed. My problem with that is that it tells me nothing about what each of those types is or does. And further down the page, it gives the impression that the only differences are what happens when a thread tries to re-lock a mutex. I'm pretty sure there are other differences. – laslowh Nov 08 '13 at 16:28
  • Well the following paragraph on the page seems pretty clear to me. The value picked changes the behavior of the pthread_mutex_lock function when applied to a mutex held by the calling thread. (edited the answer accordingly) – Sebastien Nov 08 '13 at 16:33
  • 2
    Clear, but incomplete. How does a "timed" mutex behave under those circumstances? Are there other differences between the mutex types? Why is it called an "adaptive" or "fast" mutex if its defining characteristic is that it deadlocks threads upon multiple locks? – laslowh Nov 08 '13 at 16:37
2

Here you go. As I read it, it's a brutally simple mutex that doesn't care about anything except making the no-contention case run fast.

jthill
  • 55,082
  • 5
  • 77
  • 137
  • Mind if I merge your answer with mine? I would at least copy paste the relevant paragraph here. Thanks for the info! – Sebastien Nov 15 '13 at 19:54
  • @Sebastien happy to help, go for it. Near as I can tell though I haven't dug all that hard the doc for this is going to be the source. – jthill Nov 15 '13 at 21:21
  • @jthill: Nice find. It's crazy that the only documentation that we can collectively find is an email on the Linux kernel list. – laslowh Nov 18 '13 at 19:53
  • This is little more than a link only answer, I am afraid. – Suma Sep 16 '21 at 17:41