18

i have two c files:

a.c

void main(){
    ...
    getvtable()->function();
}

the vtable is pointing to a function that is located in b.c:

void function(){
    malloc(42);
}

now if i trace the program in valgrind I get the following:

==29994== 4,155 bytes in 831 blocks are definitely lost in loss record 26 of 28
==29994==    at 0x402CB7A: malloc (in /usr/lib/valgrind/vgpreload_memcheck-x86-linux.so)
==29994==    by 0x40A24D2: (below main) (libc-start.c:226)

so the call to function is completely ommited on the stack! How is it possible? In case I use GDB, a correct stack including "function" is shown.

Debug symbols are included, Linux, 32-bit.

Upd:

Answering the first question, I get the following output when debugging valgrind's GDB server. The breakpoint is not coming, while it comes when i debug directly with GDB.

stasik@gemini:~$ gdb -q
(gdb) set confirm off
(gdb) target remote | vgdb
Remote debugging using | vgdb
relaying data between gdb and process 11665
[Switching to Thread 11665]
0x040011d0 in ?? ()
(gdb) file /home/stasik/leak.so
Reading symbols from /home/stasik/leak.so...done.
(gdb) break function
Breakpoint 1 at 0x110c: file ../../source/leakclass.c, line 32.
(gdb) commands
Type commands for breakpoint(s) 1, one per line.
End with a line saying just "end".
>silent
>end
(gdb) continue
Continuing.

Program received signal SIGTRAP, Trace/breakpoint trap.
0x0404efcb in ?? ()
(gdb) source thread-frames.py
Stack level 0, frame at 0x42348a0:
 eip = 0x404efcb; saved eip 0x4f2f544c
 called by frame at 0x42348a4
 Arglist at 0x4234898, args:
 Locals at 0x4234898, Previous frame's sp is 0x42348a0
 Saved registers:
  ebp at 0x4234898, eip at 0x423489c
Stack level 1, frame at 0x42348a4:
 eip = 0x4f2f544c; saved eip 0x6e492056
 called by frame at 0x42348a8, caller of frame at 0x42348a0
 Arglist at 0x423489c, args:
 Locals at 0x423489c, Previous frame's sp is 0x42348a4
 Saved registers:
  eip at 0x42348a0
Stack level 2, frame at 0x42348a8:
 eip = 0x6e492056; saved eip 0x205d6f66
 called by frame at 0x42348ac, caller of frame at 0x42348a4
 Arglist at 0x42348a0, args:
 Locals at 0x42348a0, Previous frame's sp is 0x42348a8
 Saved registers:
  eip at 0x42348a4
Stack level 3, frame at 0x42348ac:
 eip = 0x205d6f66; saved eip 0x61746144
---Type <return> to continue, or q <return> to quit---
 called by frame at 0x42348b0, caller of frame at 0x42348a8
 Arglist at 0x42348a4, args:
 Locals at 0x42348a4, Previous frame's sp is 0x42348ac
 Saved registers:
  eip at 0x42348a8
Stack level 4, frame at 0x42348b0:
 eip = 0x61746144; saved eip 0x65736162
 called by frame at 0x42348b4, caller of frame at 0x42348ac
 Arglist at 0x42348a8, args:
 Locals at 0x42348a8, Previous frame's sp is 0x42348b0
 Saved registers:
  eip at 0x42348ac
Stack level 5, frame at 0x42348b4:
 eip = 0x65736162; saved eip 0x70616d20
 called by frame at 0x42348b8, caller of frame at 0x42348b0
 Arglist at 0x42348ac, args:
 Locals at 0x42348ac, Previous frame's sp is 0x42348b4
 Saved registers:
  eip at 0x42348b0
Stack level 6, frame at 0x42348b8:
 eip = 0x70616d20; saved eip 0x2e646570
 called by frame at 0x42348bc, caller of frame at 0x42348b4
 Arglist at 0x42348b0, args:
---Type <return> to continue, or q <return> to quit---
 Locals at 0x42348b0, Previous frame's sp is 0x42348b8
 Saved registers:
  eip at 0x42348b4
Stack level 7, frame at 0x42348bc:
 eip = 0x2e646570; saved eip 0x0
 called by frame at 0x42348c0, caller of frame at 0x42348b8
 Arglist at 0x42348b4, args:
 Locals at 0x42348b4, Previous frame's sp is 0x42348bc
 Saved registers:
  eip at 0x42348b8
Stack level 8, frame at 0x42348c0:
 eip = 0x0; saved eip 0x0
 caller of frame at 0x42348bc
 Arglist at 0x42348b8, args:
 Locals at 0x42348b8, Previous frame's sp is 0x42348c0
 Saved registers:
  eip at 0x42348bc
(gdb) continue
Continuing.

Program received signal SIGTRAP, Trace/breakpoint trap.
0x0404efcb in ?? ()
(gdb) continue
Continuing.
Stasik
  • 2,568
  • 1
  • 25
  • 44

3 Answers3

5

I see two possible reasons:

  • Valgrind is using a different stack unwind method than GDB
  • The address space layout is different while running your program under the two environments and you're only hitting stack corruption under Valgrind.

We can gain more insight by using Valgrind's builtin gdbserver.

Save this Python snippet to thread-frames.py

import gdb

f = gdb.newest_frame()
while f is not None:
    f.select()
    gdb.execute('info frame')
    f = f.older()

t.gdb

set confirm off
file MY-PROGRAM
break function
commands
silent
end
run
source thread-frames.py
quit

v.gdb

set confirm off
target remote | vgdb
file MY-PROGRAM
break function
commands
silent
end
continue
source thread-frames.py
quit

(Change MY-PROGRAM, function in the scripts above and the commands below as required)

Get details about the stack frames under GDB:

$ gdb -q -x t.gdb
Breakpoint 1 at 0x80484a2: file valgrind-unwind.c, line 6.
Stack level 0, frame at 0xbffff2f0:
 eip = 0x80484a2 in function (valgrind-unwind.c:6); saved eip 0x8048384
 called by frame at 0xbffff310
 source language c.
 Arglist at 0xbffff2e8, args: 
 Locals at 0xbffff2e8, Previous frame's sp is 0xbffff2f0
 Saved registers:
  ebp at 0xbffff2e8, eip at 0xbffff2ec
Stack level 1, frame at 0xbffff310:
 eip = 0x8048384 in main (valgrind-unwind.c:17); saved eip 0xb7e33963
 caller of frame at 0xbffff2f0
 source language c.
 Arglist at 0xbffff2f8, args: 
 Locals at 0xbffff2f8, Previous frame's sp is 0xbffff310
 Saved registers:
  ebp at 0xbffff2f8, eip at 0xbffff30c

Get the same data under Valgrind:

$ valgrind --vgdb=full --vgdb-error=0 ./MY-PROGRAM

In another shell:

$ gdb -q -x v.gdb
relaying data between gdb and process 574
0x04001020 in ?? ()
Breakpoint 1 at 0x80484a2: file valgrind-unwind.c, line 6.
Stack level 0, frame at 0xbe88e2c0:
 eip = 0x80484a2 in function (valgrind-unwind.c:6); saved eip 0x8048384
 called by frame at 0xbe88e2e0
 source language c.
 Arglist at 0xbe88e2b8, args: 
 Locals at 0xbe88e2b8, Previous frame's sp is 0xbe88e2c0
 Saved registers:
  ebp at 0xbe88e2b8, eip at 0xbe88e2bc
Stack level 1, frame at 0xbe88e2e0:
 eip = 0x8048384 in main (valgrind-unwind.c:17); saved eip 0x4051963
 caller of frame at 0xbe88e2c0
 source language c.
 Arglist at 0xbe88e2c8, args: 
 Locals at 0xbe88e2c8, Previous frame's sp is 0xbe88e2e0
 Saved registers:
  ebp at 0xbe88e2c8, eip at 0xbe88e2dc

If GDB can successfully unwind the stack while connecting to "valgrind --gdb" then it's a problem with Valgrind's stack unwind algorithm. You can inspect the "info frame" output carefully for inline and tail call frames or some other reason that could throw Valgrind off. Otherwise it's probably stack corruption.

scottt
  • 7,008
  • 27
  • 37
  • if you try `nm -an file.out | grep function` what do you get ? – 0x90 May 28 '13 at 21:09
  • you and him, assuming you have already setup all the environment on your machine. I believe gcc will inline `function` by default – 0x90 May 28 '13 at 21:15
  • @0x90, since @Stasik has `function` in another `.so`. `nm` would show `U function`. Note that in my logs above I didn't have that setup. I did a quick test and moving `function` to a separate `.so` doesn't make a difference in my synthetic test code though. – scottt May 28 '13 at 21:26
  • if function is in .so it can't be inlined. – 0x90 May 28 '13 at 21:31
  • @0x90, Stasik mentioned that in a comment on the question. – scottt May 28 '13 at 21:33
  • @scott: wow thank you for the nice answer! I had some problems fiddling with file (derived load of the .so), but i got the first example working, breakpoints are coming, stack has the function "function". With your second example I get something like this: http://pastebin.com/5k5GhDsj. The breakpoint in the function "function" is not coming at all! Is it stack corruption? What's next? – Stasik May 29 '13 at 07:38
  • @Stasik what I wanted to do is useless since the function is in a `.so` – 0x90 May 29 '13 at 07:42
  • @Stasik: Could you add to your answer, what you put to *pastebin*, please. When the pastebin-link will be gone some day, it will be difficult to follow this thread. – alk May 29 '13 at 10:29
  • @alk: i hear it does not fit into 600 chars limit – Stasik May 29 '13 at 12:38
  • This talk of "frame" makes me think that maybe Valgrind unwinds the stack based on frame pointers rather than return addresses. And such small functions without local variables would not use frame pointers. – Medinoc May 31 '13 at 09:53
  • @Medinoc Unfortunately, I do have two local variables in the 'real' function. Can I read more on frame pointers vs. return addresses somewhere? – Stasik Jun 01 '13 at 17:40
  • I'm not sure where. There are some on Raymond Chen's blog, but it's usually *part* of the subject matter rather than an article entirely dedicated to it: http://blogs.msdn.com/b/oldnewthing/archive/2004/01/16/59415.aspx http://blogs.msdn.com/b/oldnewthing/archive/2011/03/09/10138401.aspx http://blogs.msdn.com/b/oldnewthing/archive/2011/03/16/10141735.aspx – Medinoc Jun 03 '13 at 10:34
5

Ok, compiling all .so parts and the main program with an explicit -O0 seems to solve the problem. It seems that some of the optimizations of the 'core' program that was loading the .so (so was always compiled unoptimized) was breaking the stack.

Stasik
  • 2,568
  • 1
  • 25
  • 44
2

This is Tail-call optimization in action.

The function function calls malloc as the last thing it does. The compiler sees this and kills the stack frame for function before it calls malloc. The advantage is that when malloc returns it returns directly to whichever function called function. I.e. it avoids malloc returning to function only to hit yet another return instruction.

In this case the optimization has prevented an unnecessary jump and made stack usage slightly more efficient, which is nice, but in the case of a recursive tail call then this optimization is a huge win as it turns a recursion into something more like iteration.

As you've discovered already, disabling optimization makes debugging much easier. If you want to debug optimized code (for performance testing, perhaps), then, as @Zang MingJie already said, you can disable this one optimization with -fno-optimize-sibling-calls.

ams
  • 24,923
  • 4
  • 54
  • 75
  • It does not explain the GDB/Valgrind stack difference - remember, I was able to see the function in GDB but not in Valgrind. Following the tip of Zang MingJie did not bring the remedy. – Stasik Jun 06 '13 at 14:24
  • That's interesting and surprising. – ams Jun 06 '13 at 14:28