Wait on multiple condition variables on Linux without unnecessary sleeps?

Question

I'm writing a latency sensitive app that in effect wants to wait on multiple condition variables at once. I've read before of several ways to get this functionality on Linux (apparently this is builtin on Windows), but none of them seem suitable for my app. The methods I know of are:

Have one thread wait on each of the condition variables you want to wait on, which when woken will signal a single condition variable which you wait on instead.
Cycling through multiple condition variables with a timed wait.
Writing dummy bytes to files or pipes instead, and polling on those.

#1 & #2 are unsuitable because they cause unnecessary sleeping. With #1, you have to wait for the dummy thread to wake up, then signal the real thread, then for the real thread to wake up, instead of the real thread just waking up to begin with -- the extra scheduler quantum spent on this actually matters for my app, and I'd prefer not to have to use a full fledged RTOS. #2 is even worse, you potentially spend N * timeout time asleep, or your timeout will be 0 in which case you never sleep (endlessly burning CPU and starving other threads is also bad).

For #3, pipes are problematic because if the thread being 'signaled' is busy or even crashes (I'm in fact dealing with separate process rather than threads -- the mutexes and conditions would be stored in shared memory), then the writing thread will be stuck because the pipe's buffer will be full, as will any other clients. Files are problematic because you'd be growing it endlessly the longer the app ran.

Is there a better way to do this? Curious for answers appropriate for Solaris as well.

I'm encountering this limitation in the C++0x threading primitives too which seem to be heavily based on a pthreads least-common-denominator. — Marsh Ray, May 19 '11 at 23:38
Can't you use a single semaphore instead? Once the waiting thread gets a unit, it can poll the various sources to find one that has 'fired', (maybe an array of volatile booleans?). — Martin James, Jul 22 '13 at 15:15

CesarB · Answer 1 · 2011-10-04T00:55:42.700

17

Your #3 option (writing dummy bytes to files or pipes instead, and polling on those) has a better alternative on Linux: eventfd.

Instead of a limited-size buffer (as in a pipe) or an infinitely-growing buffer (as in a file), with eventfd you have an in-kernel unsigned 64-bit counter. An 8-byte write adds a number to the counter; an 8-byte read either zeroes the counter and returns its previous value (without EFD_SEMAPHORE), or decrements the counter by 1 and returns 1 (with EFD_SEMAPHORE). The file descriptor is considered readable to the polling functions (select, poll, epoll) when the counter is nonzero.

Even if the counter is near the 64-bit limit, the write will just fail with EAGAIN if you made the file descriptor non-blocking. The same happens with read when the counter is zero.

edited Oct 04 '11 at 00:55

answered Oct 02 '11 at 17:12

CesarB

43,947
7
63
86

Awesome, this has a lot of use cases. – Joseph Garvin Oct 03 '11 at 14:21
It's too bad that eventfd doesn't allow "naming". That is, you can't use eventfd (or any of the other *fd mechanisms) between processes that do not have a parent-child relationship with handle sharing. Otherwise, these mechanisms would completely replace the outdated, deficient functionally, inefficient from a performance perspective, POSIX named crap we have right now (that does similar things), but can't be mixed with other fd waits or waited on by epoll/poll/select. – Michael Goldshteyn Nov 07 '11 at 13:55
1

@Michael Goldshteyn: can't you use file descriptor passing via Unix domain sockets to pass the eventfd descriptor (or any other file descriptor) to an unrelated process? – CesarB Nov 08 '11 at 13:38
@CesarB, yes that is an option, but it adds a lot of complexity for what should be a simple feature: Named fd based events and timers. – Michael Goldshteyn Nov 08 '11 at 14:00
This however cause on roundtrip to kernel per reading even if it's "signaled". Not the "best" method compared to a semaphore. – xryl669 Aug 14 '14 at 17:58
Where could I find an example of using eventfd to signal pthread condition variables? Thank you. – Frank Apr 21 '16 at 06:12

Roman Nikitchenko · Accepted Answer · 2019-08-06T08:53:46.813

If you are talking about POSIX threads I'd recommend to use single condition variable and number of event flags or something alike. The idea is to use peer condvar mutex to guard event notifications. You anyway need to check for event after cond_wait() exit. Here is my old enough code to illustrate this from my training (yes, I checked that it runs, but please note it was prepared some time ago and in a hurry for newcomers).

#include <pthread.h>
#include <stdio.h>
#include <unistd.h>

static pthread_cond_t var;
static pthread_mutex_t mtx;

unsigned event_flags = 0;
#define FLAG_EVENT_1    1
#define FLAG_EVENT_2    2

void signal_1()
{
    pthread_mutex_lock(&mtx);
    event_flags |= FLAG_EVENT_1;
    pthread_cond_signal(&var);
    pthread_mutex_unlock(&mtx);
}

void signal_2()
{
    pthread_mutex_lock(&mtx);
    event_flags |= FLAG_EVENT_2;
    pthread_cond_signal(&var);
    pthread_mutex_unlock(&mtx);
}

void* handler(void*)
{
    // Mutex is unlocked only when we wait or process received events.
    pthread_mutex_lock(&mtx);

    // Here should be race-condition prevention in real code.

    while(1)
    {
        if (event_flags)
        {
            unsigned copy = event_flags;

            // We unlock mutex while we are processing received events.
            pthread_mutex_unlock(&mtx);

            if (copy & FLAG_EVENT_1)
            {
                printf("EVENT 1\n");
                copy ^= FLAG_EVENT_1;
            }

            if (copy & FLAG_EVENT_2)
            {
                printf("EVENT 2\n");
                copy ^= FLAG_EVENT_2;

                // And let EVENT 2 to be 'quit' signal.
                // In this case for consistency we break with locked mutex.
                pthread_mutex_lock(&mtx);
                break;
            }

            // Note we should have mutex locked at the iteration end.
            pthread_mutex_lock(&mtx);
        }
        else
        {
            // Mutex is locked. It is unlocked while we are waiting.
            pthread_cond_wait(&var, &mtx);
            // Mutex is locked.
        }
    }

    // ... as we are dying.
    pthread_mutex_unlock(&mtx);
}

int main()
{
    pthread_mutex_init(&mtx, NULL);
    pthread_cond_init(&var, NULL);

    pthread_t id;
    pthread_create(&id, NULL, handler, NULL);
    sleep(1);

    signal_1();
    sleep(1);
    signal_1();
    sleep(1);
    signal_2();
    sleep(1);

    pthread_join(id, NULL);
    return 0;
}

This is a sensible answer but unfortunately the semantics are different. If I poll on a file for example, and there are 10 bytes written to that file before I wake up, then when I wake up I discover that 10 bytes were written. Under this scheme, if an event happens ten times before I wake up, I only learn about the last one. — Joseph Garvin, Jun 03 '10 at 17:03
You could try to extend the scheme so that instead of event flags there is a list of event flags, and the reader thread keeps track of where it is in the list, but that doesn't scale to multiple threads -- how do you know when you can delete elements of the list? In order for it to be fast you now need a lockless linked list and reference counted implementation. Probably still faster than waiting a scheduler quantum but far from ideal... — Joseph Garvin, Jun 03 '10 at 17:05
Instead of copying event flags you can do any event list 'pop'. List is unchanged unless you release mtx (of course only if any modification is under the same mutex). This is one of biggest advantages of this scheme. You can for example use events queue and yes, this queue is protected. While 'reader' checks it and 'pops' anybody willing to 'push' will wait for short time. But please note you shall NOT process events under lock, only 'extract'. — Roman Nikitchenko, Jun 19 '10 at 23:23
I didn't end up using this answer because it's still not quite right for my app, but for the info available in the question this is the right way to go :) — Joseph Garvin, Jun 08 '11 at 03:44
Isn't this answer missing a `pthread_mutex_lock(&mtx);` after processing the received events? — jotik, Mar 14 '16 at 10:10
@jotik Yes, you are right. The answer was written as a concept. Original code was a little bit different with queue of events, not just flags. — Roman Nikitchenko, Mar 14 '16 at 13:57
What is the point of copy if you don't use it? Maybe you meant to take a copy and unlock the mutex and then do ops on that copy? — Martin, Jan 12 '17 at 14:35

Kaz · Answer 3 · 2014-04-23T22:57:26.007

If you want maximum flexibility under the POSIX condition variable model of synchronization, you must avoid writing modules which communicate events to their users only by means of exposing a condition variable. (You have then essentially reinvented a semaphore.)

Active modules should be designed such that their interfaces provide callback notifications of events, via registered functions: and, if necessary, such that multiple callbacks can be registered.

A client of multiple modules registers a callback with each of them. These can all be routed into a common place where they lock the same mutex, change some state, unlock, and hit the same condition variable.

This design also offers the possibility that, if the amount of work done in response to an event is reasonably small, perhaps it can just be done in the context of the callback.

Callbacks also have some advantages in debugging. You can put a breakpoint on an event which arrives in the form of a callback, and see the call stack of how it was generated. If you put a breakpoint on an event that arrives as a semaphore wakeup, or via some message passing mechanism, the call trace doesn't reveal the origin of the event.

That being said, you can make your own synchronization primitives with mutexes and condition variables which support waiting on multiple objects. These synchronization primitives can be internally based on callbacks, in a way that is invisible to the rest of the application.

The gist of it is that for each object that a thread wants to wait on, the wait operation queues a callback interface with that object. When an object is signaled, it invokes all of its registered callbacks. The woken threads dequeue all the callback interfaces, and peek at some status flags in each one to see which objects signaled.

score 2 · Answer 4 · edited Aug 20 '13 at 06:59

2

For waiting on multiple condition variables, there is an implementation for Solaris that you could port to Linux if you're interested: WaitFor API

edited Aug 20 '13 at 06:59

Alex Bitek

6,529
5
47
77

answered Apr 26 '11 at 04:10

Enforcer

21
1

Wait on multiple condition variables on Linux without unnecessary sleeps?

4 Answers4