3

I am currently trying to debug a core in my C++ app. The customer has reported a SEGFAULT core with following thread list:

...Other threads go above here
  3 Thread 0xf73a2b70 (LWP 2120)  0x006fa430 in __kernel_vsyscall ()
  2 Thread 0x2291b70 (LWP 2212)  0x006fa430 in __kernel_vsyscall ()
* 1 Thread 0x218fb70 (LWP 2210)  0x00000000 in ?? ()

The thing that puzzles me is the thread that crashed which points 0x00000000. If I try to inspect backtrace, I get:

Thread 1 (Thread 0x1eeeb70 (LWP 27156)):
#0  0x00000000 in ?? ()
#1  0x00281da7 in SomeClass1::_someKnownMethod1 (this=..., elem=...) at path_to_cpp_file:line_number
#2  0x0028484d in SomeClass2::_someKnownMethod2 (this=..., stream=..., stanza=...) at path_to_cpp_file:line_number
#3  0x002958b2 in SomeClass3::_someKnownMethod3 (this=..., stream=..., elem=...) at path_to_cpp_file:line_number

I appologize about redaction - a limitations of NDA.

Obviously, the top frame is quite unknown. My first guess was that PC register got corrupted by some stack overwrite.

I have tried reproducting the issue in my local deployment by supplying the same call that was seen in Frame #1 but the crash never happened.

It is a known fact that these cores are very difficult to debug? But does anyone has some hint on what to try out?

Update

   0x00281d8b <+171>:   mov    edx,DWORD PTR [ebp+0x8]
   0x00281d8e <+174>:   mov    ecx,DWORD PTR [ebp+0xc]
   0x00281d91 <+177>:   mov    eax,DWORD PTR [edx+0x8]
   0x00281d94 <+180>:   mov    edx,DWORD PTR [eax]
   0x00281d96 <+182>:   mov    DWORD PTR [esp+0x8],ecx
   0x00281d9a <+186>:   mov    ecx,DWORD PTR [ebp+0x8]
   0x00281d9d <+189>:   mov    DWORD PTR [esp],eax
   0x00281da0 <+192>:   mov    DWORD PTR [esp+0x4],ecx
   0x00281da4 <+196>:   call   DWORD PTR [edx+0x14]
=> 0x00281da7 <+199>:   mov    ebx,DWORD PTR [ebp-0xc]
   0x00281daa <+202>:   mov    esi,DWORD PTR [ebp-0x8]
   0x00281dad <+205>:   mov    edi,DWORD PTR [ebp-0x4]
   0x00281db0 <+208>:   mov    esp,ebp
   0x00281db2 <+210>:   pop    ebp
   0x00281db3 <+211>:   ret
   0x00281db4 <+212>:   lea    esi,[esi+eiz*1+0x0]

... should have been the one from Frame #0, but from the disassembly this makes little sense. It is like the program has crashed while returning from Frame #1, but why am I seeing the invalid Frame #0? Or does this frame tear down part belongs to a function onPacket?

Update #2:

(gdb) p/x $edx
$5 = 0x1deb664
(gdb) print _listener
$6 = (jax::MyClass &) @0xf6dbf6c4: {_vptr.MyClass= 0x1deb664}
Jovan Perovic
  • 19,846
  • 5
  • 44
  • 85
  • take your frame #1 and look at the line in the code there. Seems like this line want to use a null pointer. Depending on your debug symbols you can go into frame #1 context with `up` and print the variables at this point. – Hayt Sep 14 '16 at 09:16
  • Hey Hayt, I followed the suggestion and all variables seem ok so far. But thanks a lot, because, I was looking at this all backwards (as I mentioned in rainer's comment) – Jovan Perovic Sep 14 '16 at 10:34
  • Your original crash likely happens when a `CALL *$r` is executed at ``0x00281da7`, with some register `$r` being NULL. Most likely the actual `CALL` instruction is at `0x00281da2` or `0x00281da5`. Your update lists `0x00280da7` as the current instruction, which is `0x1000` away from where it should be.This likely means that in the update you've disassembled a different executable, making diagnosis unnecessarily difficult and confusing. – Employed Russian Sep 15 '16 at 03:41
  • Hey @EmployedRussian, let me check that. I thought I have provided a correct disassembly... – Jovan Perovic Sep 15 '16 at 08:54
  • @EmployedRussian Seems strange, but I am fairly sure that the core I have matches the executable I have analyzed it against :-/ – Jovan Perovic Sep 15 '16 at 09:51
  • @EmployedRussian: As Eirik sugested, I tried extracting disassembly without `/m` which returned different addresses... – Jovan Perovic Sep 15 '16 at 11:31
  • @JovanPerovic *Now* your disassembly matches. There is a `CALL *($rdx+0x14)` at `0x00281da4`, and that call (to some virtual function of `jax::MyClass`) jumped to `NULL` as in rainer`s answer. `x/20a 0x1deb664` should show the state of `MyClass`s virtual table. – Employed Russian Sep 15 '16 at 14:09
  • Yes, I get all zeros :-/ – Jovan Perovic Sep 15 '16 at 14:38

3 Answers3

5

Expanding on Hayt's comment, since the rest of the stack looks fine, I'd suspect that something is going wrong in frame #1; consider the following (obviously incorrect) program, which generates a similar stack trace:

int main() {
    void (*foo)() = 0;
    foo();

    return 0;
}

Stack Trace:

(gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x000000000040056a in main ()
rainer
  • 6,769
  • 3
  • 23
  • 37
  • Aha, that is a good lead. I was investigating this all backwards - I thought that when a call in `Frame #0` did `return;` that registers got corrupted. On the other hand, the line of code which goes to `Frame #0` uses a reference rather than pointer, so I guess that one cannot be null. However, there is always a possibility that the original object instance got `delete`d? – Jovan Perovic Sep 14 '16 at 09:52
  • 1
    Can you share the line and some info about the reference in frame #1? The reference might still be `0` (you can't set it to `0`, but if something else goes wrong, this could still happen). Maybe also check the `this` pointer in frame #1: if `this` is corrupt, and the reference is stored as a class member, dereferencing the corrupt `this` pointer might give you the 0 reference. – rainer Sep 14 '16 at 10:23
  • I am trying to dig that one out. It seems that reference is valid, but I am doing some investigating into internal state of the calling object. All arguments and `this` seem fine so far. Let me try digging a bit thought some printout... – Jovan Perovic Sep 14 '16 at 10:33
3

If frame 1 does not make sense at a source level, you might try looking at disassembly of frame 1. After selecting that frame, disass $pc should show you the disassembly for the entire function, with => to indicate the return address (the instruction immediately after the call to frame 0).

In the case of a null function pointer dereference, the instruction for the call to frame 0 might involve a simple register dereference, in which case you'd want to understand how that register obtained the null value. In some cases including /m in a disass command can be helpful, although it can cause confusion because of the distinction between instruction boundaries and source line boundaries. Omitting /m is more likely to display a meaningful return address.

The => in the updated disassembly (without /m) makes sense. In any frame aside from frame 0, the pc value (what the => points at in the disassembly) indicates the instruction which will execute when the next lowest numbered frame returns (which, due to the crash, did not occur in this case). The pc value in frame 1 is not the value of the pc register at the time of the crash, but rather the saved pc value pushed on the stack by the call instruction. One way to see that is to compare output from x/a $sp in frame 0 to x/i $pc in frame 1.

One way to interpret this disassembly is that edx is some object, and [edx+0x14] points into its vtable. One way the vtable might wind up with a null pointer is a memory allocation issue with a stale reference to a chunk of memory which has been deallocated and subsequently overwritten by its rightful owner (the next piece of code to allocate that chunk). If any of that is applicable here, it can work either way (the code in frame 1 might be the culprit, or it might be the victim). There are other reasons memory might be overwritten with incorrect contents, but double allocation might be a good place to start.

It probably makes sense to examine the contents of the object referenced by edx in frame 1, to see if there are any other anomalies besides what could be an incorrect vtable. Both the print command and the x command (within gdb) can be useful for this. My best guess about which object is referenced by edx, based on disass/m output (at this writing, visible only in the edit history of the question), is _listener, but it would be best to confirm that by further study of the disassembly (the excerpt available here does not seem to include the instruction that determines the value of edx).

Eirik Fuller
  • 1,454
  • 11
  • 9
  • Thanks Eirik. I have listed a disassembly of a `frame #1` (with `/m` flag). I guess that `0x00280da4 <+196>: call *0x14(%edx)` the point where the code tries to jump to a `Frame #0`. But does this mean that the function has actually crashed while trying to exit the `Frame #1`? Because, if I am reading this right, it has passed the the `return` statement of a `Frame #1`... – Jovan Perovic Sep 14 '16 at 15:56
  • Whoa! Good point! Running `disass` without `/m` actually shows different addresses! I am updating my question... – Jovan Perovic Sep 15 '16 at 11:23
  • How closely have you examined `this` in frame 1? I'd expect you can find the null function pointer at offset `0x14`, though you might not see it with just the `print` command (the `x` command is more likely to show it). Perhaps I can update my answer to offer advice on how to proceed. – Eirik Fuller Sep 15 '16 at 12:55
  • You are a savior! Printing offset `0x14` in fact showed `0x00000000`. I am going to drill down the clue from this one :) – Jovan Perovic Sep 15 '16 at 13:12
  • I might have been mistaken in assuming `edx` is `this` in frame 1. Perhaps you should check whether gdb shows the same value for `_listener` as for `edx`. – Eirik Fuller Sep 15 '16 at 13:13
  • No, you are quite correct :) Register `EDX` holds `0x1deb664` and printing `_listener` also reveals the same value... (updated the question) – Jovan Perovic Sep 15 '16 at 13:18
  • Cool, so perhaps you can examine _listener to see if its contents make sense. If there are other problems evident aside from the vtable, the challenge is to determine how _listener got clobbered. – Eirik Fuller Sep 15 '16 at 13:22
  • :( yes, that is true. I dumped the memory surrounding the address inside of `$edx` and got whole lot of zeros :( – Jovan Perovic Sep 15 '16 at 13:48
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/123449/discussion-between-eirik-fuller-and-jovan-perovic). – Eirik Fuller Sep 15 '16 at 14:52
0

See also gdb can't access memory address error for the case (in one of the comments) where where rogue unmap unmapped memory for stacks of a few other threads and crashed with core dump pretty difficult to use.

Jacek Tomaka
  • 422
  • 7
  • 15