1

I have multithreaded application where I spawn a few threads and do a pthread_join upon completion.

The main thread spawns threads and waits on pthread_join() for the worker threads to join. I am facing a issue where the main thread is waiting indefinitely in pthread_join() and all the worker threads have exited, leading the program to hang. I identified that all worker threads have exited by checking info thread on gdb since it lists only the main thread. Its is known that calling pthread_join() on a exited thread will return immediately. But this seems different. This is the gdb stack trace.

#0  0x00007f45fefebeec in __lll_lock_wait_private () from /lib64/libc.so.6
#1  0x00007f45fef68a6f in _L_lock_5333 () from /lib64/libc.so.6
#2  0x00007f45fef62408 in _int_free () from /lib64/libc.so.6
#3  0x00007f45ffbe5088 in _dl_deallocate_tls () from /lib64/ld-linux-x86-64.so.2
#4  0x00007f45ff9bde67 in __free_stacks () from /lib64/libpthread.so.0
#5  0x00007f45ff9bdf7f in __deallocate_stack () from /lib64/libpthread.so.0
#6  0x00007f45ff9bff93 in pthread_join () from /lib64/libpthread.so.0
#7  0x00007f45f87a6fe1 in waitForWorkerThreadsToExit () at src/server.c:133
#8  ServerLoop (arg=<optimized out>) at src/server.c:662
#9  0x00007f45ff9bee25 in start_thread () from /lib64/libpthread.so.0
#10 0x00007f45fefde34d in clone () from /lib64/libc.so.6

I am on CentOS7 and Linux kernel 3.10

Can someone help? TIA

Employed Russian
  • 199,314
  • 34
  • 295
  • 362
  • `_dl_deallocate_tls` sounds like it has to do with thread-local storage; do any of your threads use that feature? If so, you might try temporarily disabling it and see if that makes the fault go away. – Jeremy Friesner Jan 07 '20 at 05:45
  • @JeremyFriesner yes we use thread-local storage extensively. I am afraid that it cant be disabled. – krithikaGopalakrishnan Jan 07 '20 at 06:09
  • You say, "the worker threads have exited." What does that mean? What caused them to "exit?" Your main thread appears to be waiting for a mutex. Is it possible that one of the "workers" was forcibly _killed_ while holding the lock? – Solomon Slow Jan 07 '20 at 14:12
  • To get help, you need to add a few things: 1. you should tell which exact version of GLIBC you are using. 2. you should install libc6-dbg or similar package (`debuginfo-install -y ...`) and get the "hang" stack trace with file/line info. With that, we'll be able to tell *which* lock `libc.so.6` is blocking on. – Employed Russian Jan 07 '20 at 14:57
  • (GNU libc) 2.17 is the glibc version @EmployedRussian – krithikaGopalakrishnan Jan 08 '20 at 07:22

1 Answers1

0

One of the other threads is leaving without relinquishing the lock. As suggested here you can check the thread id for owner of this mutex to know which thread is the culprit.