As you say yourself, you could use thread cancellation to solve this.
Outside of thread cancellation, I don't think there's a "right" way to solve this within POSIX (waking up the poll
call with a write
isn't exactly a generic method that would work for all situations in which a thread might get blocked), because POSIX's paradigm for making syscalls
and handling signals simply doesn't allow you to close the gap between a flag check and a potentially long blocking call.
void handler() { dont_enter_a_long_blocking_call_flg=1; }
int main()
{ //...
if(dont_enter_a_long_blocking_call_flg)
//THE GAP; what if the signal arrives here ?
potentially_long_blocking_call();
//....
}
The musl libc library uses signals for thread cancellation (because signals can break long-blocking calls that are in kernel mode)
and it uses them in conjunction with global assembly labels so that from the flag setting SIGCANCEL handler, it can do
(conceptually, I'm not pasting their actual code):
void sigcancel_handler(int Sig, siginfo_t *Info, void *Uctx)
{
thread_local_cancellation_flag=1;
if_interrupted_the_gap_move_Program_Counter_to_start_cancellation(Uctx);
}
Now if you changed if_interrupted_the_gap_move_Program_Counter_to_start_cancellation(Uctx);
to if_interrupted_the_gap_move_Program_Counter_to_make_the_syscall_fail(Uctx);
and exported the if_interrupted_the_gap_move_Program_Counter_to_make_the_syscall_fail
function along with the thread_local_cancellation_flag
.
then you can use it to*:
- solve your problem robustly
implement robust signal cancelation with any signal without having to put any of that
pthread_cleanup_{push,pop}
stuff into your already working thread-safe singel threaded code
- ensure assured normal-context reaction to a signal delivery in your target thread even if the signal is handled.
Basically without a libc extension like this, if you once kill()/pthread_kill()
a process/thread with a signal it handles or if put a function on a signal-sending timer, you cannot be sure of an assured reaction to the signal delivery, as the target may well receive the signal in a gap like above and hang indefinitely instead of responding to it.
I've implemented such a libc extension on top of musl libc and published it now https://github.com/pskocik/musl. The SIGNAL_EXAMPLES directory also shows some kill()
, pthread_kill
, and setitimer()
examples that under a demonstrated race condition hang with classical libcs but don't wit my extended musl. You can use that extended musl to solve your problem cleanly and I also use it in my personal project to do robust thread cancellation without having to litter my code with pthread_cleanup_{push,pop}
The obvious downside of this approach is that it's unportable and I only have it implemented for x86_64 musl. I've published it today in the hope that somebody (Cygwin, MacOSX?) copies it, because I think it's the right way to do cancellation in C.
In C++ and with glibc, you could utilize the fact that glibc uses exceptions to implement thread cancellation and simply use pthread_cancel
(which uses a signal (SIGCANCEL) underneath) but catch it instead of letting it kill the
thread.
Note:
I'm really using two thread-local flags -- a breaker flag that breaks the next syscall with ECANCELED if set before the syscall is entered (an EINTR returned from a potentially long-blocking syscall gets turned into ECANCELED in the modified libc-provided syscall wrapper iff the breaking flag is set) and a saved breaking flag -- the moment a breaking flag has been used it's saved in the saved breaking flag and zeroed so that the breaking flag doesn't break futher potentially long blocking syscalls.
The idea is that cancelling signals are handled one at a time (the signal handler can be left with all/most signals blocked; the handler code (if any) can then unblock them) and that correctly checking code starts unwinding, i.e., cleaning up while returning errors, the moment it sees an ECANCELED. Then, the next potentially long blocking syscall could be in the cleanup code (e.g., code that writes </html>
to a socket) and that syscall must be enterrable (if the breaking flag stayed on, it wouldn't be). Of course with cleanup code having e.g., write(1,"</html>",...)
in it, it could block indefinitely too, but you could write the cleanup code so that the potentially long-blocking syscall there runs under a timer when the cleanup is due to an error (ECANCELED is an error). As I've already mentioned, robust, race-condition free, signal driven timers is one of the things this extension allows.
The EINTR => ECANCELED translation happens so that code looping on EINTR knows when to stop looping (many EINTR (=signal interrupted a syscall) cannot be prevented and the code should simply handle them by retrying the syscall. I'm using ECANCELED as an "EINTR after which you shouldn't retry."