What's a good strategy for clean/reliable shutdown of threads that use pthread barriers for synchronization?

Question

I've got a pthread-based multithreaded program that has four threads indefinitely executing this run-loop (pseudocode):

while(keepRunning)
{
   pthread_barrier_wait(&g_stage_one_barrier);

   UpdateThisThreadsStateVariables();  

   pthread_barrier_wait(&g_stage_two_barrier);

   DoComputationsThatReadFromAllThreadsStateVariables();
}

This works pretty well, in that during stage one each thread updates its own state variables, and that's okay because no other thread is reading any other thread's state variables during stage one. Then during stage two it's a free-for-all as far as threads reading each others' state is concerned, but that's okay because during stage two no thread is modifying its local state variables, so they are effectively read-only.

My only remaining problem is, how do I cleanly and reliably shut down these threads when it's time for my application to quit? (By "cleanly and reliably", I mean without introducing potential deadlocks or race conditions, and ideally without having to send any UNIX-signals to force threads out of a pthread_barrier_wait() call)

My main() thread can of course set keepRunning to false for each thread, but then how does it get pthread_barrier_wait() to return for each thread? AFAICT the only way to get pthread_barrier_wait() to return is to have all four threads' execution-location inside pthread_barrier_wait() simultaneously, but that's difficult to do when some threads may have exited already.

Calling pthread_barrier_destroy() seems like what I'd want to do, but it's undefined behavior to do that while any threads might be waiting on the barrier.

Is there a good known solution to this problem?

What is the two 'wait' events that every thread is waiting for on each loop? Perhaps have the main() function simply reset the 'keeprunning' variable and repeatedly trigger each of the 'wait' events. — user3629249, Mar 03 '15 at 23:56

Ulfalizer · Accepted Answer · 2015-03-04T15:25:33.393

2

Having two flags and using something like the following should work:

for (;;)
{
    pthread_barrier_wait(&g_stage_one_barrier);           +
                                                          |
    UpdateThisThreadsStateVariables();                    |
                                                          |
    pthread_mutex_lock(&shutdownMtx);                     | Zone 1
    pendingShutdown = !keepRunning;                       |
    pthread_mutex_unlock(&shutdownMtx);                   |
                                                          |
    pthread_barrier_wait(&g_stage_two_barrier);           +
                                                          |
    if (pendingShutdown)                                  |
        break;                                            | Zone 2
                                                          |
    DoComputationsThatReadFromAllThreadsStateVariables(); |
}

shutdownMtx should protect the setting of keepRunning too, though it's not shown.

The logic is that by the time pendingShutdown gets set to true, all the threads must be within Zone 1. (This is true even if only some of the threads saw keepRunning being false, so races on keepRunning should be okay.) It follows that they will all reach pthread_barrier_wait(&g_stage_two_barrier), and then all break out when they enter Zone 2.

It would also be possible to check for PTHREAD_BARRIER_SERIAL_THREAD -- which is returned by pthread_barrier_wait() for exactly one of the threads -- and only do the locking and updating of pendingShutdown in that thread, which could improve performance.

edited Mar 04 '15 at 15:25

answered Mar 04 '15 at 00:26

Ulfalizer

4,664
1
21
30

Racing condition: different threads may get different values from `keepRunning` at the same iteration. – Valeri Atamaniouk Mar 04 '15 at 00:31
Ah, I was assuming `keepRunning` was a global "shut down process" flag. Perhaps that was a misunderstanding. If `pendingShutdown` and `keepRunning` are globals it shouldn't matter if different threads see different values, since the invariant that all threads must be in *Zone 1* if *some* thread sets `pendingShutdown` to `true` should hold. (`keepRunning` and `pendingShutdown` might need to be protected by mutexes too.) – Ulfalizer Mar 04 '15 at 00:32
Yes, there is a race with global variable. There is no guarantee the value is changed when there are no threads splitted before and after the test. The approach mandates that all the threads are either before or after the test, which is not achieved here. – Valeri Atamaniouk Mar 04 '15 at 00:35
Hmm... could you give a concrete example of when this might break? – Ulfalizer Mar 04 '15 at 00:37
2 threads: one has already reached stage 2 and second one is doing first update. You change the flag, the second thread picks up exit value, enters the second barrier, both threadx leave it. Second thread exits and the first one block in the first barrier, as it has missed the flag update. – Valeri Atamaniouk Mar 04 '15 at 00:40
What do you mean by "has already reached stage 2"? That it's waiting in `pthread_barrier_wait(&g_stage_two_barrier)`? In that case, shouldn't it wait there until the second thread gets there too, and then pick up that `pendingShutdown` is true once it leaves the barrier? – Ulfalizer Mar 04 '15 at 00:44
Declaring `pendingShutdown` and `keepRunning` as `volatile` and adding simple critical section around `pendingShutdown != keepRunning` resolves all possible race conditions. – Valeri Atamaniouk Mar 04 '15 at 11:56
@Ulfalizer: Looks fine to me. The main advantage of doing it in just one thread is that otherwise the `shutdownMtx` will bounce between CPU caches. – caf Mar 04 '15 at 13:25
@Ulfalizer: I've delete my comments that no longer apply to this version, I suggest you do the same. – caf Oct 13 '15 at 03:28

caf · Answer 2 · 2015-03-04T04:56:17.607

You could have an additional thread that synchronises on the same barriers, but only exists as a "shutdown master". Your worker threads would use the exact code that you have in your question, and the "shutdown master" thread would do:

while (keepRunning)
{
    pthread_barrier_wait(&g_stage_one_barrier);

    pthread_mutex_lock(&mkr_lock);
    if (!mainKeepRunning)
        keepRunning = 0;
    pthread_mutex_unlock(&mkr_lock);

    pthread_barrier_wait(&g_stage_two_barrier);
}

When the main thread wants the other threads to shut down, it would just do:

pthread_mutex_lock(&mkr_lock);
mainKeepRunning = 0;
pthread_mutex_unlock(&mkr_lock);

(ie. the keepRunning variable becomes part of the shared thread state that is read-only during stage 2, and is owned by the shutdown master thread during stage 1).

Of course, you can also just pick one of your other threads to be the "shutdown master thread" rather than using a dedicated thread for that purpose.

score 1 · Answer 3 · answered Mar 04 '15 at 00:27

There is a conflict of requirements: barrier semantics require all threads to be in to continue, and shutdown requires termination when threads are shared between execution blocks (could be inside different barriers).

I suggest to replace barrier with a custom implementation that would support extern cancel call.

Example (may not run, but the idea...):

struct _barrier_entry
{
  pthread_cond_t cond;
  volatile bool released;
  volatile struct _barrier_entry *next;
};

typedef struct
{
  volatile int capacity;
  volatile int count;
  volatile struct _barrier_entry *first;
  pthread_mutex_t lock;
} custom_barrier_t;

Initialization:

int custom_barrier_init(custom_barrier_t *barrier, int capacity)
{
   if (NULL == barrier || capacity <= 0)
   {
     errno = EINVAL;
     return -1;
   }
   barrier->capacity = capacity;
   barrier->count = 0;
   barrier->first = NULL;
   return pthread_mutex_init(&barrier->lock, NULL);
   return -1;
}

Helper:

static void _custom_barrier_flush(custom_barrier_t *barrier)
{
   struct _barrier_entry *ptr;
   for (ptr = barrier->first; NULL != ptr;)
   {
     struct _barrier_entry *next = ptr->next;
     ptr->released = true;
     pthread_cond_signal(&ptr->cond);
     ptr = next;
   }
   barrier->first = NULL;
   barrier->count = 0;
}

Blocking wait:

int custom_barrier_wait(custom_barrier_t *barrier)
{
   struct _barrier_entry entry;
   int result;
   pthread_cond_init(&barrier->entry, NULL);
   entry->next = NULL;
   entry->released = false;

   pthread_mutex_lock(&barrier->lock);
   barrier->count++;
   if (barrier->count == barrier->capacity)
   {
     _custom_barrier_flush(barrier);
     result = 0;
   }
   else
   {
     entry->next = barrier->first;
     barrier->first = entry;
     while (true)
     {
       pthread_cond_wait(&entry->cond, &barrier->lock);
       if (entry->released)
       {
         result = 0;
         break;
       }
       if (barrier->capacity < 0)
       {
         errno = ECANCELLED;
         result = -1;
         break;
       }
     }
   }
   pthread_mutex_unlock(&barrier->lock);
   pthread_cond_destroy(&entry->cond);
   return result;
}

Cancellation:

 int custom_barrier_cancel(custom_barrier_t *barrier)
 {
   pthread_mutex_lock(barrier->lock);
   barrier->capacity = -1;
   _custom_barrier_flush(barrier);
   pthread_mutex_unlock(barrier->lock);
   return 0;
 }

So the thread code can run in the loop, until it gets ECANCELLED error after custom_barrier_wait call.

user3386109 · Answer 4 · 2015-03-04T01:21:56.203

The threads that are waiting at the barriers are not the issue, it's the threads that are still running UpdateThis... or DoComputations... that will delay the shutdown. You can reduce the shutdown time by periodically checking for shutdown inside the UpdateThis... and DoComputations... functions.

Here's the outline of one possible solution

main initializes a mutex g_shutdown_mutex
main locks the mutex
main launches the threads
the threads do their thing while periodically trying to lock the mutex, but since main has the mutex locked, the trylock function will always fail
when it's time to shutdown, main unlocks the mutex
now the trylock will succeed and the worker functions will return early
before reaching the second barrier, any thread that successfully locks the mutex sets a global variable g_shutdown_requested
after passing the second barrier, all the threads will see the same value in g_shutdown_requested and make the same decision whether to exit or not

So the while loop looks like this

while(1)
{
    pthread_barrier_wait(&g_stage_one_barrier);

    UpdateThisThreadsStateVariables();

    if ( pthread_mutex_trylock( &g_shutdown_mutex ) == 0 )
    {
        g_shutdown_requested = true;
        pthread_mutex_unlock( &g_shutdown_mutex );
        break;
    }

    pthread_barrier_wait(&g_stage_two_barrier);

    if ( g_shutdown_requested )
        break;

    DoComputationsThatReadFromAllThreadsStateVariables();
}

And the worker functions look like this

void UpdateThisThreadsStateVariables( void )
{
    for ( i = 0;; i++ )
    {
        // check the mutex once every 4000 times through the loop
        if ( (i & 0xfff) == 0 && pthread_mutex_trylock( &g_shutdown_mutex ) == 0 )
        {
            pthread_mutex_unlock( &g_shutdown_mutex );   // abnormal termination
            return; 
        }

        // do the important stuff here

        if ( doneWithTheImportantStuff )    // normal termination
            break;
    }
}

This code would deadlock when `main` unlocks mutex after one of the thread manages to pass locking test, but others do not. — Valeri Atamaniouk, Mar 04 '15 at 00:02
@ValeriAtamaniouk Any thread that passes the locking test immediately unlocks the mutex, so no, there is no deadlock. Perhaps that was not clear in the outline, but it should be clear from the code. The `mutex` is just a signal to the threads that they should shutdown (instead of the `keepRunning` variable). None of the threads will keep the mutex locked. — user3386109, Mar 04 '15 at 00:14
Still, if there are 4 total threads, 3 threads have failed to lock the mutex and continued, and one locks it, then 3 would end in indefinite wait inside first barrier block, and one would exit. — Valeri Atamaniouk, Mar 04 '15 at 00:29
@ValeriAtamaniouk Yes, I see now thanks! After the threads pass the second barrier, they must all make the same decision to either continue or exit. My code didn't guarantee that. I'll update the code. — user3386109, Mar 04 '15 at 01:13

What's a good strategy for clean/reliable shutdown of threads that use pthread barriers for synchronization?

4 Answers4