5

Previously I had asked a question regarding how to terminate thread blocked for I/O. I have used pthread_kill() instead of pthread_cancel() or writing to pipes, considering few advantages. After implementing the code to send signal (SIGUSR2) to the target thread using pthread_kill(), initially it worked fine and didn't face any issues.

static void signalHandler(int signum) {
    LogInfo("SIGNAL HANDLER : %d",signum);  //print info message using logger utility
}

void setSignalHandler() {
    struct sigaction actions;

    memset(&actions, 0, sizeof(actions));
    sigemptyset(&actions.sa_mask);
    actions.sa_flags = 0;
    actions.sa_handler = signalHandler;

    int rc = sigaction(SIGUSR2,&actions,NULL);
    if(rc < 0) {
        LogError("Error in sigaction");  //print error using logger utility
    }
}

void stopThread() {
    fStop = 1;  //set the flag to stop the thread
    if( ftid ) { //ftid is pthread_t variable of thread that needs to be terminated.
        LogInfo("Before pthread_kill()");
        int kill_result = pthread_kill(ftid,SIGUSR2);  // send signal to interrupt the thread if blocked for poll() I/O
        LogInfo("pthread_kill() returned : %d ( %s )",kill_result,strerror(kill_result));
        int result = pthread_join( ftid, NULL );
        LogInfo("Event loop stopped for %s", fStreamName);
        LogInfo( "pthread_join() -> %d ( %s )\n", result, strerror(result) );
        if( result == 0 ) ftid = 0;
    }
}

Recently I have noticed a deadlock issue where few threads (which tried to log statements) blocked with below stacktrace

#0  0x00007fc1b2dc565c in __lll_lock_wait_private () from /lib64/libc.so.6
#1  0x00007fc1b2d70f0c in _L_lock_2462 () from /lib64/libc.so.6
#2  0x00007fc1b2d70d47 in __tz_convert () from /lib64/libc.so.6
#3  0x00000000004708f7 in Logger::getCurrentTimestamp(char*) ()

getCurrentTimestamp(char*) function is called from our logger module when LogInfo() or LogError() functions are invoked. This function internally calls gettimeofday() to print the current time in logs. So I'm suspecting LogInfo() function (which calls gettimeofday()) called inside signalHandler() might be causing the deadlock issue. But i'm not confident of this as this issue is not getting reproduced frequently and i did not get any information stating that gettimeofday() is not async signal safe.

gettimeofday() is thread safe and i'm considering it's not async signal safe as it was not listed under async signal safe functions.

1) Can I consider gettimeofday() is not async signal safe and it was the root cause of deadlock ?

2) Are there any known issues where use of pthread_kill() might cause deadlock ?

EDIT :

Below is the definition of getCurrentTimeStamp() function

char* Logger::getCurrentTimestamp ( char *tm_buff ) {
    char curr_time[ 32 ] = { 0 };
    time_t current = time( NULL );
    struct timeval detail_time;

    if ( tm_buff ) {
        strftime( curr_time, sizeof(curr_time), "%Y-%m-%d-%H:%M:%S", localtime( &current ) );
        gettimeofday( &detail_time, NULL );
        sprintf( tm_buff, "%s:%03d", curr_time, (int) (detail_time.tv_usec / 1000) );
        return (tm_buff);
    }
    return (NULL);
}

stack trace of the thread that was interrupted by signal.

#0  0x00007fc1b2dc565c in __lll_lock_wait_private () from /lib64/libc.so.6
#1  0x00007fc1b2d70f0c in _L_lock_2462 () from /lib64/libc.so.6
#2  0x00007fc1b2d70d47 in __tz_convert () from /lib64/libc.so.6
#3  0x00000000004708f7 in Logger::getCurrentTimestamp(char*) ()
#4  0x000000000046e80f in Log::write(Logger*, LogType, std::string const&, char const*) ()
#5  0x000000000046df74 in Log::i(Logger*, char const*, std::string const&, std::string const&, char const*, ...) ()
#6  0x0000000000456b67 in ?? ()
#7  <signal handler called>
#8  0x00007fc1b2da8dc5 in _xstat () from /lib64/libc.so.6
#9  0x00007fc1b2d71056 in __tzfile_read () from /lib64/libc.so.6
#10 0x00007fc1b2d703b4 in tzset_internal () from /lib64/libc.so.6
#11 0x00007fc1b2d70d63 in __tz_convert () from /lib64/libc.so.6
#12 0x00000000004708f7 in Logger::getCurrentTimestamp(char*) ()
#13 0x000000000047030e in Logger::writeLog(int, std::string&, std::string&) ()
#14 0x0000000000470633 in Logger::Write(Logger*, int, std::string, std::string const&) ()
#15 0x000000000046eb02 in Log::write(Logger*, LogType, std::string const&, char const*) ()
#16 0x000000000046df74 in Log::i(Logger*, char const*, std::string const&, std::string const&, char const*, ...) ()
Durgesh Tanuku
  • 1,124
  • 2
  • 10
  • 26
  • 2
    You might also want to check rest of the implementation of `LogError` to verify that it is async safe completely. Hint: `printf`, `fprintf`, `sprintf` or any C++ streaming API won't be async safe. The POSIX `write` function is safe. – eerorika Nov 05 '18 at 15:45
  • @user2079303 `s[n]printf()` can be async-signal-safe. It's not mandated by POSIX, but some implementations of `s[n]printf()` are async-signal-safe, such as Solaris. So it can be possible to use `s[n]printf()` to create a string and then use `write()` to emit the string. – Andrew Henle Nov 05 '18 at 17:28
  • If you're wondering how `s[n]printf` could _not_ be async-signal-safe, any C library function that isn't _mandated_ to be async-signal-safe is allowed to use `malloc` internally for scratch space (and I can attest that GNU libc's `*printf` implementation definitely does do that). – zwol Nov 05 '18 at 17:32
  • @AndrewHenle but it's not portable to use `s[n]printf` because not all systems implement it async safely. What would be safe is to copy the source of `s[n]printf` from a system that does implement it async safely, compile it with a different name, and use that instead of the system provided one. First one must verify that copying the source is license-safe operation :) – eerorika Nov 05 '18 at 17:35
  • @zwol [glibc does much worse than just merely using `malloc()`.](https://sourceware.org/git/?p=glibc.git;a=blob;f=libio/vsnprintf.c;h=39b55005280a8a8323b5747d2405c16e61a97af2;hb=HEAD). That's the underlying implementation of glibc's `snprintf()`. It uses a static memory buffer probably obtained with something like `fmemopen()` - with locks. So it's not only non-async-signal-safe, it'll contend with itself if even just *used* from multiple threads. Ooof, that's ugly. – Andrew Henle Nov 05 '18 at 17:43
  • @AndrewHenle I don't think it's quite that bad, there's a separate `_IO_strnfile` object for every call to snprintf, allocated from snprintf's stack frame. – zwol Nov 05 '18 at 18:04

2 Answers2

5

gettimeofday itself is not signal safe, as you already found out.

However, "POSIX.1-2008 marks gettimeofday() as obsolete, recommending the use of clock_gettime(2) instead."(source) and clock_gettime is signal-safe (and also thread-safe), so that might be an alternative for you.

hellow
  • 12,430
  • 7
  • 56
  • 79
3

gettimeofday is not defined as async-signal-safe, but if you pass 0 for the second (timezone) argument, it doesn't have to do anything that time and clock_gettime (both of which are officially async-signal-safe) don't also do, so as a matter of QoI it should be async-signal-safe in that case.

Your deadlock is inside the internal libc function __tz_convert.

#2  0x00007fc1b2d70d47 in __tz_convert () from /lib64/libc.so.6
#3  0x00000000004708f7 in Logger::getCurrentTimestamp(char*) ()

It appears to have been called directly from Logger::getCurrentTimestamp, but that's because it was "tail called" from a documented API function. There are only four functions in GNU libc that call __tz_convert (I grepped the source code): localtime, localtime_r, gmtime, and gmtime_r. Therefore, your problem is not that you are calling gettimeofday, but that you are calling one of those functions.

localtime and gmtime are obviously not async-signal-safe since they write to a global variable. localtime_r and gmtime_r are not async-signal-safe either, because they have to look at the global database of timezone information (yes, even gmtime_r does this — it might be possible to change it not to need to do that, but it still wouldn't be a thing you could rely on cross-platform).

I don't think there's a good workaround. Formatted output from an async signal handler is going to trip over all kinds of other problems; my advice is to restructure your code so that you never need to call logging functions from async signal handlers.

zwol
  • 135,547
  • 38
  • 252
  • 361
  • [gettimeofday](https://github.com/lattera/glibc/blob/master/sysdeps/posix/gettimeofday.c#L49) (in glibc at least) calls `localtime_r` which as you point out calls `__tz_convert` (which has a lock and is therefore very much async unsafe). The problem is in fact (very likely) the call to `gettimeofday`. – eerorika Nov 05 '18 at 15:40
  • @user2079303 That can't be OP's immediate problem, because then `gettimeofday` would be on the stack, since that's not a tail call. However, it's a good point anyway; I forgot about the timezone out-parameter to `gettimeofday` (because I've never seen anyone pass anything other than NULL -- you will notice that `localtime_r` is only called when tz is not NULL). Will adjust answer. – zwol Nov 05 '18 at 17:19
  • Is the shown stack trace from the thread of execution, or from the signal handler? I think that the execution is in `__tz_convert` one way or another (not by the call to `gettimeofday` necessarily, as it is not in the call stack after all) and the lock has been acquired, and at that moment the execution is interrupted and the signal handler calls `gettimeofday` specifically, which also tries to acquire the lock, and blocks indefinitely. – eerorika Nov 05 '18 at 17:26
  • 1
    @user2079303 I think you're right that the deadlock is on the lock acquired inside `__tz_convert`, but, again, `gettimeofday` doesn't call `localtime_r` unless its timezone argument is nonzero, and that call isn't a tail call,. Therefore, if `Logger::getCurrentTimestamp` called `gettimeofday` with a nonzero timezone argument, we would see `gettimeofday` on the stack. We don't. Therefore I deduce that `Logger::getCurrentTimestamp` called one of localtime/localtime_r/gmtime/gmtime_r directly. – zwol Nov 05 '18 at 17:31
  • @zwol You have pointed me in the right direction. I have added definition of `Logger::getCurrentTimestamp` and stack trace of thread that received signal. It looks like call to `localtime()` might be taking lock and causing deadlock. So I have decided to remove `LogInfo()` statement from signal handler as logging is not important for me. But if I want to print logs in future within signal handler, how can I do it ? – Durgesh Tanuku Nov 06 '18 at 12:17
  • @Durgesh I'm afraid the last paragraph of my answer is the best advice I can give you on that subject. – zwol Nov 06 '18 at 12:32