Deferred bcast wakeup for condition variables - is it valid?

Question

I'm implementing pthread condition variables (based on Linux futexes) and I have an idea for avoiding the "stampede effect" on pthread_cond_broadcast with process-shared condition variables. For non-process-shared cond vars, futex requeue operations are traditionally (i.e. by NPTL) used to requeue waiters from the cond var's futex to the mutex's futex without waking them up, but this is in general impossible for process-shared cond vars, because pthread_cond_broadcast might not have a valid pointer to the associated mutex. In a worst case scenario, the mutex might not even be mapped in its memory space.

My idea for overcoming this issue is to have pthread_cond_broadcast only directly wake one waiter, and have that waiter perform the requeue operation when it wakes up, since it does have the needed pointer to the mutex.

Naturally there are a lot of ugly race conditions to consider if I pursue this approach, but if they can be overcome, are there any other reasons such an implementation would be invalid or undesirable? One potential issue I can think of that might not be able to be overcome is the race where the waiter (a separate process) responsible for the requeue gets killed before it can act, but it might be possible to overcome even this by putting the condvar futex in the robust mutex list so that the kernel performs a wake on it when the process dies.

bdonlan · Accepted Answer · 2011-09-25T03:16:07.127

2

There may be waiters belonging to multiple address spaces, each of which has mapped the mutex associated with the futex at a different address in memory. I'm not sure if FUTEX_REQUEUE is safe to use when the requeue point may not be mapped at the same address in all waiters; if it does then this isn't a problem.

There are other problems that won't be detected by robust futexes; for example, if your chosen waiter is busy in a signal handler, you could be kept waiting an arbitrarily long time. [As discussed in the comments, these are not an issue]

Note that with robust futexes, you must set the value of the futex & 0x3FFFFFFF to be the TID of the thread to be woken up; you must also set bit FUTEX_WAITERS on if you want a wakeup. This means that you must choose which thread to awaken from the broadcasting thread, or you will be unable to deal with thread death immediately after the FUTEX_WAKE. You'll also need to deal with the possibility of the thread dying immediately before the waker thread writes its TID into the state variable - perhaps having a 'pending master' field that is also registered in the robust mutex system would be a good idea.

I see no reason why this can't work, then, as long as you make sure to deal with the thread exit issues carefully. That said, it may be best to simply define in the kernel an extension to FUTEX_WAIT that takes a requeue point and comparison value as an argument, and let the kernel handle this in a simple, race-free manner.

edited Sep 25 '11 at 03:16

answered Sep 24 '11 at 03:38

bdonlan

224,562
31
268
324

Useful info but some mistakes. 1st para, it is safe AFAIK. Non-private futexes are always resolved to the underlying page rather than using virtual addresses. 2nd para, the "chosen" waiter would be chosen by the kernel (futex wake operation) so it couldn't choose a thread that wasn't actually suspended waiting. However it might as well be this bad because a new signal handler could be invoked immediately after the futex wait returns. I think this is the big show-stopping fault in my design. – R.. GitHub STOP HELPING ICE Sep 24 '11 at 04:42
3rd para, value is not the tid to be woken up, but the tid of the "owner", i.e. the wake happens iff the low bits of the futex value match the tid of the terminating thread and bit 31 (waiters flag) is also set. I agree it's still a problem though since you can't set it atomically with the thread getting woken up. – R.. GitHub STOP HELPING ICE Sep 24 '11 at 04:44
And in any case, I think your post answers the question in the negative. Accepted. – R.. GitHub STOP HELPING ICE Sep 24 '11 at 05:08
@R.., when I say "chooses", I refer to choosing the thread from the waker side so we can put its TID into the robust futex variable before waking it. – bdonlan Sep 24 '11 at 07:19
Looking again, I'm not sure the signal handler issue matters. Even with a "normal" implementation, a signal handler getting invoked right after one thread wakes and takes the mutex will prevent others from making progress. – R.. GitHub STOP HELPING ICE Sep 25 '11 at 01:44
@R.., good point, but if you _know_ you're going to pick up the mutex, you can at least _in principle_ block signals for the duration. If you're getting a directed FUTEX_WAIT, you can be signalled right after awakening, and before you have a chance to block signals (if you block signals over the entire cvar wait then no problem of course) – bdonlan Sep 25 '11 at 02:00
`pthread_cond_wait` itself picks up the mutex in the process of returning, and it does not block signals. (Blocking them just temporarily would be useless because they'd have to be unblocked before it returns, and the mutex is still locked when it returns..) – R.. GitHub STOP HELPING ICE Sep 25 '11 at 02:20
Ahh, you mean the outer mutex, not the one internal to the cvar. Good point; I guess it does block things then. – bdonlan Sep 25 '11 at 03:10
Yep. My cvars don't have an inner mutex; everything is built on atomics. They also avoid the sequence number ABA issue by using the thread-id of the newly-arrived waiter as the value stored in the futex and atomically-compared in the futex wait call. This might have to change if I make them more complex, but I'd rather it not, so that signal/bcast operations are realtime-suitable without resorting to priority inheritance for the lock inside the cvar... – R.. GitHub STOP HELPING ICE Sep 25 '11 at 03:14
@R.., I guess my answer will have to change to be in the positive then :) – bdonlan Sep 25 '11 at 03:16
Regarding the end of your answer, `FUTEX_WAIT_REQUEUE_PI` already exists for this purpose, but it seems like it could be bad to use it with a non-prio-inheritance mutex... – R.. GitHub STOP HELPING ICE Sep 25 '11 at 03:28

score 0 · Answer 2 · answered Sep 24 '11 at 07:55

0

I just don't see why you assume that the corresponding mutex might not be known. It is clearly stated

The effect of using more than one mutex for concurrent pthread_cond_timedwait() or pthread_cond_wait() operations on the same condition variable is undefined; that is, a condition variable becomes bound to a unique mutex when a thread waits on the condition variable, and this (dynamic) binding shall end when the wait returns.

So even for process shared mutexes and conditions this must hold, and any user space process must always have mapped the same and unique mutex that is associated to the condition.

Allowing users to associate different mutexes to a condition at the same time is nothing that I would support.

answered Sep 24 '11 at 07:55

Jens Gustedt

76,821
6
102
177

Any thread calling `pthread_cond_wait` (or timedwait) must use the same mutex as all the others, but the mutex pointer is not passed to `pthread_cond_signal`, and it's possible that it's not even mapped in the signaling thread's memory space. Even if it is mapped, if the pointer to the mutex were stored in the condition variable, it would be a pointer in (one of) the waiting process's address space and not necessarily valid in the signaling thread. – R.. GitHub STOP HELPING ICE Sep 24 '11 at 16:01
I'm not sure I follow what you mean, but the problem is that you want the waiters to remain asleep until the mutex is unlocked, and then wake up one at a time each time the mutex is unlocked. If you're using a futex other than the one contained in the mutex, I don't see how you could set things up so that unlocking the mutex wakes them... – R.. GitHub STOP HELPING ICE Sep 24 '11 at 17:12
My idea was that a thread that is woken up (either directly by `cond_signal` or from the second futex) first queues up for the mutex, signals the next one after obtaining it and then returns to the caller. This next thread will then do the same, but will be blocked until the application code of the first one unlocks the mutex. But I figured then that this was about what you have been asking in your question, so I deleted my comment :) – Jens Gustedt Sep 24 '11 at 18:36

Deferred bcast wakeup for condition variables - is it valid?

2 Answers2