12

Some background:

I have an application that relies on third party hardware and a closed source driver. The driver currently has a bug in it that causes the device to stop responding after a random period of time. This is caused by an apparent deadlock within the driver and interrupts proper functioning of my application, which is in an always-on 24/7 highly visible environment.

What I have found is that attaching GDB to the process, and immediately detaching GDB from the process results in the device resuming functionality. This was my first indication that there was a thread locking issue within the driver itself. There is some kind of race condition that leads to a deadlock. Attaching GDB was obviously causing some reshuffling of threads and probably pushing them out of their wait state, causing them to re-evaluate their conditions and thus breaking the deadlock.

The question:

My question is simply this: is there a clean wait for an application to trigger all threads within the program to interrupt their wait state? One thing that definitely works (at least on my implementation) is to send a SIGSTOP followed immediately by a SIGCONT from another process (i.e. from bash):

kill -19 `cat /var/run/mypidfile` ; kill -18 `cat /var/run/mypidfile`

This triggers a spurious wake-up within the process and everything comes back to life.

I'm hoping there is an intelligent method to trigger a spurious wake-up of all threads within my process. Think pthread_cond_broadcast(...) but without having access to the actual condition variable being waited on.

Is this possible, or is relying on a program like kill my only approach?

John Hargrove
  • 701
  • 6
  • 23
  • 2
    What are your threads blocked on? `gdb` can tell you if they're blocked in user space. `ps axlm` can tell you in the `WCHAN` field. – David Schwartz Dec 26 '12 at 22:56
  • It is difficult for me to say exactly -which- threads are the deadlocked pair. There are two threads in `pthread_cond_wait` which are my best guess as the offending threads. I could be incorrect. This is why I'm attempting to hit -every- thread. I was unaware of `ps axlm` and will use this to gather more data next time I catch the issue. It is highly elusive and there aren't any reproduction steps, unfortunately. I will report my findings. – John Hargrove Dec 26 '12 at 23:10
  • 1
    You can use a script to catch the stack of every thread. `gdb -ex "set pagination 0" -ex "thread apply all bt" --batch -p $(pidof EXECUTABLE_NAME)` – David Schwartz Dec 26 '12 at 23:12
  • That script is very helpful. I'm currently logging this periodically on several servers hoping one of them exhibits the problem and I can use the callstacks to isolate which threads are causing trouble. Thank you! – John Hargrove Dec 28 '12 at 07:12

1 Answers1

5

The way you're doing it right now is probably the most correct and simplest. There is no "wake all waiting futexes in a given process" operation in the kernel, which is what you would need to achieve this more directly.

Note that if the failure-to-wake "deadlock" is in pthread_cond_wait but interrupting it with a signal breaks out of the deadlock, the bug cannot be in the application; it must actually be in the implementation of pthread condition variables. glibc has known unfixed bugs in its condition variable implementation; see http://sourceware.org/bugzilla/show_bug.cgi?id=13165 and related bug reports. However, you might have found a new one, since I don't think the existing known ones can be fixed by breaking out of the futex wait with a signal. If you can report this bug to the glibc bug tracker, it would be very helpful.

R.. GitHub STOP HELPING ICE
  • 208,859
  • 35
  • 376
  • 711
  • I will explore this. Thank you. – John Hargrove Dec 27 '12 at 15:32
  • I'm currently collecting more data based on David's comments on the question above. I believe this will help me better understand the issue and whether a glibc bug is a possibility. As far as my question goes, I'll hold off on accepting this answer for a couple days to see if anyone else has any ideas. The signaling method WORKS, it just seems like it could be better. Thanks for your help. – John Hargrove Dec 28 '12 at 07:15