3

All my threads are stuck at one point, the trace at this point is as below:

(gdb) info threads
  9 Thread 0x7fa872994700 (LWP 10301)  0x000000327b60e264 in __lll_lock_wait () from /lib64/libpthread.so.0
  8 Thread 0x7fa87379c700 (LWP 10302)  0x000000327b2accdd in nanosleep () from /lib64/libc.so.6
  7 Thread 0x7fa871b7c700 (LWP 10303)  0x000000327b2db74d in read () from /lib64/libc.so.6
  6 Thread 0x7fa87117b700 (LWP 10306)  0x000000327b60e264 in __lll_lock_wait () from /lib64/libpthread.so.0
  5 Thread 0x7fa864e14700 (LWP 10307)  0x000000327b60e264 in __lll_lock_wait () from /lib64/libpthread.so.0
  4 Thread 0x7fa85ffff700 (LWP 10308)  0x000000327b2db7ad in write () from /lib64/libc.so.6
  3 Thread 0x7fa85f5fe700 (LWP 10309)  0x000000327b60e264 in __lll_lock_wait () from /lib64/libpthread.so.0
  2 Thread 0x7fa85ebfd700 (LWP 10311)  0x000000327b2accdd in nanosleep () from /lib64/libc.so.6
* 1 Thread 0x7fa87379e720 (LWP 10300)  0x000000327b60822d in pthread_join () from /lib64/libpthread.so.0

I am trying to find if this is related to my code or any issue with system configuration. It is working on all other machines. The issue is happening on one machine only on every run. The configuration details of this machine is as below:

bash-4.1$ cat /etc/redhat-release Red Hat Enterprise Linux Server release 6.5 (Santiago)

bash-4.1$ uname -a Linux localhost 2.6.32-431.el6.x86_64 #1 SMP Sun Nov 10 22:19:54 EST 2013 x86_64 x86_64 x86_64 GNU/Linux

bash-4.1$ rpm -qa |grep glibc glibc-devel-2.12-1.132.el6.x86_64 glibc-2.12-1.132.el6.x86_64 glibc-common-2.12-1.132.el6.x86_64 glibc-headers-2.12-1.132.el6.x86_64

Also for reference, Below is the config of the machine where threads are not getting stuck(working fine):

> cat /etc/redhat-release
Red Hat Enterprise Linux Server release 6.3 (Santiago)

> uname -a
Linux localhost 2.6.32-279.el6.x86_64 #1 SMP Wed Jun 13 18:24:36 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux

> rpm -qa |grep glibc
glibc-headers-2.12-1.80.el6.x86_64
compat-glibc-headers-2.5-46.2.x86_64
compat-glibc-2.5-46.2.x86_64
glibc-devel-2.12-1.80.el6.x86_64
glibc-common-2.12-1.80.el6.x86_64
glibc-2.12-1.80.el6.i686
glibc-devel-2.12-1.80.el6.i686
glibc-2.12-1.80.el6.x86_64
Karsten Koop
  • 2,475
  • 1
  • 18
  • 23
Majid Khan
  • 31
  • 1
  • 1
  • 2
  • There is one thread in `read` and one in `write`, are those also stuck? Generally, if one hits a bug like this, it is very unlikely to be a bug in a system library that's currently running on millions of machines, and very likely to be a bug in one's own code. – Karsten Koop Sep 26 '16 at 06:37
  • 1
    `__lll_lock_wait()` is usually because you're trying to lock a mutex that is already locked by another thread - so if it's working on other machines, it looks a bit like a race condition resulting in a deadlock. Your glibc looks pretty old though (current is 2.24), so if you're doing fancy stuff with for instance priority inheritance mutexes, you might also be hitting a bug of sorts (I've had a problem with that: http://stackoverflow.com/questions/11878445/cancelling-pthread-cond-wait-hangs-with-prio-inherit-mutex). Try seeing if you can isolate the problem and make a small testcase maybe? – sonicwave Sep 26 '16 at 06:58
  • @Karsten Koop , yes both the threads are stuck. – Majid Khan Sep 26 '16 at 07:19
  • Have you tried using [valgrind](http://valgrind.org/docs/manual/hg-manual.html)? – Selçuk Cihan Sep 26 '16 at 07:40

1 Answers1

6

As suggested in this answer https://stackoverflow.com/a/3491304/108153, look at each thread that is waiting traceback,

(gdb) thr 9
(gdb) bt

#0  0x00007f5e45c553dd in __lll_lock_wait () at /lib64/libpthread.so.0
#1  0x00007f5e45c4e7d4 in pthread_mutex_lock () at /lib64/libpthread.so.0
#2  0x00007f5e458cc84f in gst_element_set_state_func (element=0x7f5d94461ca0, state=GST_STATE_READY) at gstelement.c:2831

go to the stack frame that locked the mutex and look at the mutex for the thread id of the locker.

(gdb) f 2  # look frame 2, as an example
#2  0x00007f5e458cc84f in gst_element_set_state_func (element=0x7f5d94461ca0, state=GST_STATE_READY)
    at gstelement.c:2831
2831      GST_STATE_LOCK (element);

find the symbol of the mutex that is being attempted to lock, and print it's contents

(gdb) p element.state_lock
$3 = {p = 0x7f5d0c03f2a0, i = {0, 0}}

(gdb) p *(struct __pthread_mutex_s *)element.state_lock.p
$6 = {__lock = 2, __count = 1, __owner = 11889, __nusers = 1, __kind = 1, __spins = 0, __elision = 0, 
  __list = {__prev = 0x0, __next = 0x0}}

if you don't have the symbol but have the address, you can print it out by examining the memory.

(gdb) x/4x 0x7f5d0c03f2a0   # address of the mutex
0x7f5d0c03f2a0: 0x00000002      0x00000001      0x00002e71      0x00000001
(gdb) p 0x2e71
$7 = 11889

And on the current version of linux pthreads, the owner is in the third value. As above in the question, LWP #10311, look at thread 2 and see why is blocked. Or in this example, LWP #11889, thread 18.

(gdb) info thr
[ ... ]
  18   Thread 0x7f5dc9dff700 (LWP 11889) "task114"        0x00007f5e45c5203c in pthread_cond_wait@@GLIBC_2.3.2

(gdb) thr 18
(gdb) bt
#0  0x00007f5e45c5203c in pthread_cond_wait@@GLIBC_2.3.2 () at /lib64/libpthread.so.0
[ ... ]
codeDr
  • 1,535
  • 17
  • 20
  • What should 1 do, when the struct says that the owner is 0 ? Sorry to hijack this comment, but I tried to open a question here ( https://stackoverflow.com/questions/74501731/how-do-i-debug-a-mutex-that-is-not-locking ) but... Since I basically followed your suggestion here I was trying to see if you had any more ideas that could help me debug my problem – ludeed Nov 19 '22 at 23:00
  • I posted an answer to your question. I hope it helps you find the issue. – codeDr Dec 07 '22 at 17:27