24

I've set breakpoints on exit and _exit and my program (multithreaded app, running on linux 2.6.16.46-0.12 sles10), is somehow still exiting in a way I can't locate

(gdb) c
...
[New Thread 47513671297344 (LWP 15279)]
[New Thread 47513667103040 (LWP 15280)]
[New Thread 47513662908736 (LWP 15281)]

Program exited with code 0177.
(gdb)

the exit functions reside in libc so there's no deferred load shared library issues. Anybody know of some other mysterious trigger for exit that can't be caught?

EDIT: the problem is now academic only. I tried binary search debugging, backing out a subset of my changes (the problem went away). After I applied them again in sequence, I can no longer repro the problem, even with things restored to the original state.

EDIT2: I found one reason for this sort of error recently, which may have been the original source for this problem. For historical reasons our product uses the evil linker flag -Bsymbolic. Among the side effects of this is that when a symbol is undefined but called, the GLIBC runtime linker will bomb in exactly this way, and you see it in the debugger as a process exited with 0177. When the runtime linker aborts this way, I'd guess it makes the syscall to _exit directly (rather than using the C runtime library exit() or _exit()). That would be consistent with the fact that I was unable to catch this with an the exit breakpoints in the debugger.

Peeter Joot
  • 7,848
  • 7
  • 48
  • 82

3 Answers3

40

There are two common reasons for _exit breakpoint to "miss" -- either GDB didn't set the breakpoint in the right place, or the program performs (a moral equivalent of) syscall(SYS_exit, ...)

What do info break and disassemble _exit say?

You might be able to convince GDB to set the breakpoint correctly with break *&_exit. Alternatively, GDB-7.0 supports catch syscall. Something like this should work (assuming Linux/x86_64; note that on ix86 the numbers will be different) regardless of how the program exits:

(gdb) catch syscall 60
Catchpoint 3 (syscall 'exit' [60])
(gdb) catch syscall 231
Catchpoint 4 (syscall 'exit_group' [231])
(gdb) c

Catchpoint 4 (call to syscall 'exit_group'), 0x00007ffff7912f3d in _exit () from /lib/libc.so.6

Update:
Your comment indicates that _exit breakpoint is set correctly, so it's likely that your process just doesn't execute _exit.

That leaves syscall(SYS_exit, ...) and one other possibility (which I missed before): all threads executing pthread_exit. You might want to set a breakpoint on pthread_exit as well (and execute info thread each time you hit it -- the last thread to do pthread_exit will cause the process to terminate).

Edit:

Also worth noting that you can use mnemonic names, rather than syscall numbers. You can also simultaneously add multiple syscalls to the catch list like so:

(gdb) catch syscall exit exit_group
Catchpoint 2 (syscalls 'exit' [1] 'exit_group' [252])
Parthian Shot
  • 1,390
  • 1
  • 13
  • 24
Employed Russian
  • 199,314
  • 34
  • 295
  • 362
  • I'll try building gdb 7 and seeing what it shows. the *& gives the same instruction address:
    (gdb) b _exit
    Breakpoint 2 at 0x2aeea040f250
    (gdb) b *&_exit
    Note: breakpoint 2 also set at pc 0x2aeea040f250.
    Breakpoint 3 at 0x2aeea040f250
    
    0x00002aeea040f250 <_exit+0>:   mov    %fs:0x0,%r9
    ...
    0x00002aeea040f275 <_exit+37>:  syscall
    
    (Looks like a fairly standard syscall). I think I've at least isolated the code change that leads to this mysterious exit, just don't understand the details yet.
    – Peeter Joot Nov 23 '09 at 04:05
  • 4
    Would be better to use `catch syscall exit` and `catch syscall exit_group` instead of numeric values. On my system, for example, `exit` is `[1]` not `[60]`. – Ruslan May 24 '14 at 07:23
  • Additionally, you can set both at once with `catch syscall exit exit_group`. In fact, editing it now... – Parthian Shot Jun 10 '15 at 17:51
2

Setting the breakpoint on _exit was a good idea.

You might also try linking statically, just to take a stack of potential gdb complications off the table.

0177 is suspiciously like the wait status wait(2) returns for child stopped, but gdb is printing the exit status, which is a different thing, so that's probably a real exit argument.

DigitalRoss
  • 143,651
  • 25
  • 248
  • 329
  • OP said he already have breakpoints on _exit and exit. Also, 0177 is 127. How in the world did you get from 127 to SIGCHLD? – Employed Russian Nov 23 '09 at 03:21
  • Oh, missed exit. But I'm right about wait status. I'm obviously not talking about the signal number, but the status `wait(2)` returns for a stopped process. Look at this: $ grep IFSTOPPED /usr/include/bits/waitstatus.h `#define __WIFSTOPPED(status) (((status) & 0xff) == 0x7f)`, AND, `0x7f == 0177`. But I agree that's not what is happening here. – DigitalRoss Nov 23 '09 at 03:56
1

It might be that you have some lazy references unresolved in some shared library loaded into process. I have exactly the same situation that "someone somewhere" exited process and that appeared to be unresolved reference.

Check your process with "ldd -r" option.

Looks like ld.so or whatever does lazy resolving of some symbols to uniform exit function (which should be abort IMHO).

My situation:

$ ldd ./program
undefined symbol: XXXX  (/usr/lib/libYYY.so)

$./program
program: started! 
...
<program is running regardless of undefined references>

Now exit appeared when I've invoked some scenario that used function that was undefined. It always exited with exitcode=127 and gdb reported 0177.

Zbigniew Zagórski
  • 1,921
  • 1
  • 13
  • 23
  • That doesn't appear to be the case here. I get no undefined symbols in our executable (not exit nor anything else). – Peeter Joot Feb 16 '10 at 21:22