0

I have a pool of threads (QueueWorkers class) in my program that are released using this logic:

int QueueWorkers::stop()
{
  for (unsigned int ix = 0; ix < threadIds.size(); ++ix)
  {
    pthread_cancel(threadIds[ix]);
    pthread_join(threadIds[ix], NULL);
  }

  return 0;
}

where threadIds is a class variable of type std::vector<pthread_t>.

This logic works most of the times but I have checked testing that it fails with some probability. In particular, sometimes after the execution of pthread_cancel the pthread_join statement in the next line never returns and my program hangs.

As far as I understand until now, using pthread_join on a cancelled thread should always return. Are there any circumstances that could be avoiding this or any way of debugging what can be going on here? Is my approach to release threads upon termination the right one?

Additional information: Threads have a cancellation handler (registered using pthread_cleanup_push) which frees dynamic memory used by the thread to avoid leaks. Under normal circumstances, the handler is called upon pthread_cancel and works fine, but the time pthread_join fails returning I have checked that the cancellation handler is not invoked.

Thanks in advance!

EDIT: as suggested in question comments, I have modified my code to check the returned value of pthread_cancel. It's always 0, no matter if after that pthread_join works as expected or not.

EDIT2: as requested in some comment to this question, let me provide more detail of how it works.

The pool of threads is initialized by the start() method:

int QueueWorkers::start()
{
  // numberOfThreads and pQueue are class variables
  for (int i = 0; i < numberOfThreads; ++i)
  {
    pthread_t  tid;
    pthread_create(&tid, NULL, workerFunc, pQueue);  
    threadIds.push_back(tid);
  }

  return 0;
}

The start function workerFunc() is as follows (simplified):

static void* workerFunc(void* pQueue)
{
  // Initialize some dynamic objects (Foo for simplification)
  Foo* foo = initFoo();

  // Set pthread_cancel handler
  pthread_cleanup_push(workerFinishes, foo);

  // Loop forever
  for (;;)
  {
    // Wait for new item to process on pQueue
    ... paramsV = ((Queue*) pQueue)->pop();

    // Then process it
    ...
  }

  // Next statemement never executes but compilation breaks without it. See this note in pthread.h:
  // "pthread_cleanup_push and pthread_cleanup_pop are macros and must always be used in
  // matching pairs at the same nesting level of braces".
  pthread_cleanup_pop(0);
}

Note the pthread_cleanup_push() statement before starting the ethernal loop. This is done to implement the cleanup logic upon cancellation for the Foo object:

static void workerFinishes(void* curl)
{
  freeFoo((Foo*) curl);
}

I hope not having over-simplified the code. In any case, you can see the original version here.

fgalan
  • 11,732
  • 9
  • 46
  • 89
  • Have you checked if the function's cancellable type is set to asynchronous or if it is calling a cancellable point function? More details - https://man7.org/linux/man-pages/man3/pthread_cancel.3.html – Ajay Brahmakshatriya Jun 07 '21 at 13:37
  • Can you check the returned value of `pthread_cancel`? Is maybe the case that your thread is waiting on an uninterruptible I/O operation? – Jack Jun 07 '21 at 13:37
  • The only way to guarantee program exit is to have a thread sleep and `raise(SIGKILL)` or something similar. There is always the chance a thread state is corrupted so badly that it will never exit. – Zan Lynx Jun 07 '21 at 14:29
  • Thanks for the suggestions. I have edited my question post to include additional information. – fgalan Jun 07 '21 at 19:15
  • We need enough code to replicate the problem. For all we know, the code for the thread you are trying to cancel doesn't support cancellation. – David Schwartz Jun 07 '21 at 19:49
  • I have added more detail (code) in EDIT2. Thanks for the feedback! – fgalan Jun 07 '21 at 21:00

1 Answers1

2

Are sure the thread is in a cancelation or your thread cancelation_type is asynchronous?

From man of pthread_cancel:

A thread's cancellation type, determined by pthread_setcanceltype(3), may be either asynchronous or deferred (the default for new threads). Asynchronous cancelability means that the thread can be canceled at any time (usually immediately, but the system does not guarantee this). Deferred cancelability means that cancellation will be delayed until the thread next calls a function that is a cancellation point. A list of functions that are or may be cancellation points is provided in pthreads(7).

I don't think canceling threads is the best ways to make sure that a thread will finish. Perhaps you can send the thread a message that it should stop and make sure the thread does receive the message and will handle it.

silverfox
  • 1,568
  • 10
  • 27
hetepeperfan
  • 4,292
  • 1
  • 29
  • 47
  • No need to soft-pedal -- cancelling threads is in fact a *terrible* way to try to make them terminate. – John Bollinger Jun 08 '21 at 01:30
  • Thanks for the answer! After setting PTHREAD_CANCEL_ASYNCHRONOUS with pthread_setcanceltype() the pthread_join statement never blocks. Thus, from the point of view of my question here, the answer is fully correct. Although I agree that this is not a good solution (an issue has been created in my project about that https://github.com/telefonicaid/fiware-orion/issues/3877) – fgalan Jun 08 '21 at 14:46
  • At the end, I have improved my code to implement a thread termination logic based in message passing instead of phtread_cancellation :) https://github.com/telefonicaid/fiware-orion/pull/3882 – fgalan Jun 11 '21 at 10:00
  • I hope this works out for you. In your stop function, you can run the loop twice, in the first loop you send all threads the *stop* message, and in the second loop join them all. Then they can stop asynchronously. It seemed you already had everything up and running and only needed to push a stop message on your queue to break out of the worker loop. Feels much cleaner to me :) – hetepeperfan Jun 11 '21 at 10:20