5

I have a multi-threaded C++ program that deadlocks in some rare cases. The problem is hard to reproduce and I can only reproduce it in a remote machine. The method I want to use for solving this problem is

  1. run the program
  2. wait for deadlock
  3. send abort signal to it for generating core dump
  4. copy the dump back to my local machine
  5. use gdb to debug it

I do not have gdb on the remote machine and cannot install anything on it. The problem is when I am debugging the core dump (obtained from either a dead-locked or normally running process on the remote machine), the back-trace of most of the threads show only:

(gdb) bt
#0  pthread_cond_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:261
#1  0x0000000000000000 in ?? ()

I am using a statically linked binary which is compiled with "-g -O1" options. When I abort a process of the same binary on my local machine, gdb can extract the entire stack from core dump and there is no such problem (I cannot reproduce the deadlock however). My remote machine is SLES and my local machine is ubuntu.

Any idea?

Edit:

Found someone else with the same problem, but still with no solutions: http://groups.google.com/group/google-coredumper/browse_thread/thread/2ca9bcf9465d1050 (I am not using google coredumper, but it seems like google coredumper fails with the same error, this suggests that perhaps the problem is with SLES 11)

Shayan Pooya
  • 1,049
  • 1
  • 13
  • 22

2 Answers2

3

Note that you can also use gcore to create a core file without aborting. Have you tried running pstack on the remote host (assuming it's installed) to see if you can get a backtrace that way?

Otherwise, if the shared objects used by your application are different on your local host and remote host, gdb won't be able to match the memory offsets properly and the backtrace will probably get all confused. If you're able to copy all the relevant .so files from the remote host to some place locally I believe you can direct gdb to read from them instead of the normally installed versions.

EDIT: try running pstack on your build machine and see if it can pick up a stack.

Mark B
  • 95,107
  • 10
  • 109
  • 188
  • pstack and gcore are not installed. I am linking the binary statically (using -static) and there is only one binary which I copy to remote machine and run (no shared libraries) – Shayan Pooya Jul 28 '11 at 17:28
  • Are you sure the binary is completely static? I guess it is linked against libc.so. What is output of ldd on binary? – ks1322 Jul 28 '11 at 18:31
  • Yes. The libc versions are not compatible in the local and remote machines and I had to compile it with -static the output of `ldd` binary is "not a dynamic executable" – Shayan Pooya Jul 28 '11 at 18:41
  • @matt , in gdb, help gcore prints: "Save a core file with the current state of the debugged process." So I guess it only makes sense to run it on the remote machine. But the problem is that I don't have gdb on the remote machine. If I had it, I could run the program under gdb. – Shayan Pooya Aug 01 '11 at 22:17
1

What is the age of your glibc? Are you perhaps missing this:

commit ad2be8527ac0f19f129fc4519d823cbe48239c78
Author: Ulrich Drepper <drepper@redhat.com>
Date:   Sun Apr 13 08:36:19 2003 +0000

    Update.

        * sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S: Add unwind info.
        * sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S: Likewise.
        * sysdeps/unix/sysv/linux/i386/i486/pthread_cond_wait.S: Likewise.
Employed Russian
  • 199,314
  • 34
  • 295
  • 362
  • No. GCC is kind of new (2009). My local machine is Ubuntu 9.10 with kernel 2.6.31. – Shayan Pooya Jul 29 '11 at 16:49
  • GCC has nothing to do with it, and neither does the kernel. It's glibc that needs to be new enough. – Employed Russian Jul 29 '11 at 18:16
  • This commit is done in the year 2003. And my GCC, glibc, kernel and other things are copyrighted to 2009. So I guess they include this commit. However, ldd --version shows version 2.10.1 – Shayan Pooya Jul 29 '11 at 18:48