3

Got some wierd occasional segfault in some code calling virtual member functions. Segfault happens approximatly on averrage once in 30k calls.

I am using virtual methods to implement a template method pattern.

Code line it occurs is the first line of

GenericDevice::updateValue()
{
     ...
     double tmpValue=getValue();
     Value=tmpValue;
     ...
}

with

class GenericDevice
{
    public: 
    void updateValue();
    void print(string& result);
    ...
    protected:
    virtual double getValue()const=0;
    ...
    private:
    std::atomic<double> Value;
    ...
}

A class GenericDevice is provided later by loading a dynamic library at runtime

class SpecializedDeviced : public
{
    ...
    virtual double getValue()const final;
    ... 
}

I was able to obtain a coredump when the problem occurred and looked at the assembly code:

0x55cd3ef036f4 GenericDevice::updateValue()+92   mov    -0x38(%rbp),%rax   
0x55cd3ef036f8 GenericDevice::updateValue()+96   mov    (%rax),%rax 
0x55cd3ef036fb GenericDevice::updateValue()+99   add    $0x40,%rax  
0x55cd3ef036ff GenericDevice::updateValue()+103  mov    (%rax),%rax 
0x55cd3ef03702 GenericDevice::updateValue()+106  mov   -0x38(%rbp),%rdx
0x55cd3ef03706 GenericDevice::updateValue()+110  mov   %rdx,%rdi         
0x55cd3ef03709 GenericDevice::updateValue()+113  callq  *%rax
0x55cd3ef0370b <GenericDevice::updateValue()+115>  movq   %xmm0,%rax          
0x55cd3ef03710 <GenericDevice::updateValue()+120>  mov    %rax,-0x28(%rbp) 
0x55cd3ef03714 <GenericDevice::updateValue()+124>  mov    -0x38(%rbp),%rax  
0x55cd3ef03718 <GenericDevice::updateValue()+128>  lea    0x38(%rax),%rdx     
0x55cd3ef0371c <GenericDevice::updateValue()+132>  mov    -0x28(%rbp),%rax    
0x55cd3ef03720 <GenericDevice::updateValue()+136>  mov    %rax,-0x40(%rbp)    
0x55cd3ef03724 <GenericDevice::updateValue()+140>  movsd  -0x40(%rbp),%xmm0  

The segfault is exspected to have occured in 0x55cd3ef03709 GenericDevice::updateValue()+113.

where
#0  0x000055cd3ef0370a in MyNamespace::GenericDevice::updateValue (this=0x55cd40586698) at ../src/GenericDevice.cpp:22
#1  0x000055cd3ef038d2 in MyNamespace::GenericDevice::print (this=0x55cd40586698,result="REDACTED"...) at ../src/GenericDevice.cpp:50
...

The function GenericDevice::updateValue() was called as intended

<GenericDevice::print(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&)+301>  callq  0x55cd3ef03698 <GenericDevice::updateValue()>

Reason being rax being set to 0x0.

Register group: general
rax            0x0              0  
rbx            0x5c01b8a2       1543616674  
rcx            0x2              2  
rdx            0x28             40  
rsi            0x2              2  
rdi            0x55cd40586630   94340036191792  
rbp            0x7ffe39086e60   0x7ffe39086e60  
rsp            0x7ffe39086e20   0x7ffe39086e20  
r8             0x7fbb06e7e8a0   140441251473568  
r9             0x3              3  
r10            0x33             51  
r11            0x206            518                       
r12            0x55cd3ef19438   94340012676152  
r13            0x7ffe39089010   140729855283216   
r14            0x0              0   
r15            0x0              0  
rip            0x55cd3ef0370a  0x55cd3ef0370a<GenericDevice::updateValue()+114>                     eflags         0x10206  [ PF IF RF ]               
cs             0x33     51
ss             0x2b     43
ds             0x0      0  
es             0x0      0  
fs             0x0      0   
gs             0x0      0 

By performing the calculations from the assembly excerpt I as able to confirm that the assembly code and the data it uses matches the expected virtual function call and starts with correct data:

  1. this pointer of the object is used

    (gdb) x /g $rbp-0x38  
    0x7ffe39086e28: 0x000055cd40586698   
    (gdb) p this  
    $1 = (GenericDevice * const) 0x55cd40586698
    
  2. pointer to vtable is correct (first element of *this)

    (gdb) x 0x000055cd40586698  
    0x55cd40586698: 0x00007fbb070c1aa0
    (gdb) info vtbl this  
    vtable for 'GenericDevice' @ 0x7fbb070c1aa0 (subobject @ 0x55cd40586698):
    
  3. vtable contains address of method we are looking for.

    (gdb) info vtbl this  
    vtable for 'GenericDevice' @ 0x7fbb070c1aa0 (subobject @ 0x55cd40586698):  
    ...  
    [8]: 0x7fbb06e7bf50 non-virtual thunk to MyNamespace::SpecializedDevice::getValue() const.
    
  4. correct offset for vtable is used

    (gdb) x 0x00007fbb070c1aa0+0x40  
    0x7fbb070c1ae0 <_ZTVN12MyNamespace11SpecializedDeviceE+168>: 0x00007fbb06e7bf50
    

Conclusion so far: By stepping through the assembler code use of correct data and instruction was validated.

  • Correct data was used: Memory corruption can be ruled out.
  • Assemble instructions seem correct: Coding/Compile error can be ruled out
  • vtable looks ok: error when loading library at runtime can be excluded: Also function usuallly runs fine for tens thousand of times.

Please feel free to point out any errors in my reasoning.

Yet still the value in register rax is zero instead of the exspected 0x7fbb070c1ae0

  • Could this indicate a Hardware error in one (rarely used) cpu core? Would explain rare and random occurence but I would expect problems with other programms and OS as well.

Processor Model is Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz

Thanks in advance!

Update: I have found the $RIP marker
0x55cd3ef0370a MyNamespace::GenericDevice::updateValue()+114 shlb 0x48(%rsi)

The assembly shown by gdb seems to change after scrolling. Which is why i didn't see the marker in the first attempt. After starting gdb and typing layout asm I get:

>0x55cd3ef0370a <MyNamespace::GenericDevicer::updateValue()+114>  shlb   0x48(%rsi)           
0x55cd3ef0370d <MyNamespace::GenericDevicer::updateValue()+117>  movd   %mm0,%eax            
0x55cd3ef03710 <MyNamespace::GenericDevicer::updateValue()+120>  mov    %rax,-0x28(%rbp)     
0x55cd3ef03714 <MyNamespace::GenericDevicer::updateValue()+124>  mov    -0x38(%rbp),%rax     
0x55cd3ef03718 <MyNamespace::GenericDevicer::updateValue()+128>  lea    0x38(%rax),%rdx   
0x55cd3ef0371c <MyNamespace::GenericDevicer::updateValue()+132>  mov    -0x28(%rbp),%rax
0x55cd3ef03720 <MyNamespace::GenericDevicer::updateValue()+136>  mov    %rax,-0x40(%rbp)
0x55cd3ef03724 <MyNamespace::GenericDevicer::updateValue()+140>  movsd  -0x40(%rbp),%xmm0  

...

After scrolling the ams in gdb I get the code posted in the original question. The code in the original question matches the code from the executable file. The code posted above does partially deviate from the executable.

The shlb instruction makes no sense to me. Couldn't even find the instruction in the Intel® 64 and IA-32 Architectures Software Developer’s Manual. Closest match was shl.

TLepold
  • 57
  • 4
  • 3
    If we can believe your register dump `rdx` and `rdi` are also wrong, what's more, they are not even equal even though a `mov %rdx,%rdi` has just been performed. Either your register dump is not from that place or something very strange is happening. – Jester Dec 03 '18 at 15:23
  • Your function as a non void return is'nt it? What is the return type, the assembly looks strange. – Oliv Dec 03 '18 at 15:41
  • 1
    `rip 0x55cd3ef0370a` is very wrong, it's in the middle of the instruction. That's the actual cause of your fault, execution came from somewhere else and hit the middle of that instruction. – Jester Dec 03 '18 at 16:22
  • Could you please clarify 'execution came from somewhere else' ? – TLepold Dec 03 '18 at 16:31
  • The cpu was executing code somewhere else when it came across a control transfer instruction with a broken target address that eventually ended up in the middle of the instruction around the fault site. Might not have to be the exact address, could be a few bytes earlier if the bytes decode to non-faulting instructions. This is going to be hard to debug. Even though some CPUs allow recording of the last branches taken I don't think that is enabled or saved into a core dump. – Jester Dec 03 '18 at 16:35
  • @Oliv which function are your refering to? Also the signature of both functions is provided in the question. – TLepold Dec 04 '18 at 12:17
  • GenericDevice::updateValue() – Oliv Dec 04 '18 at 12:18
  • @Oliv function signature is 'void GenericDevice::updateValue()'. Corrected function is now in source code excerpt – TLepold Dec 04 '18 at 13:40
  • @Jester: the function member void updateValue() seems to have been called as intended with the correct address. I added the info in the original post. I also have found the rip marker with gdb (also see original question) which does not seem to be correct. It is possible that some other programm part has corrupted the instruction sequence of void updateValue() ? If so can this explain the strange behaviour of gdb i described? – TLepold Dec 04 '18 at 14:05
  • Upon closer examination this looks like a memory hardware problem since it's a bit error. The return address got changed from `0x0b` to `0x0a` that is the least significant bit got zeroed. Could also be stack corruption but that's unlikely to only affect a single bit. – Jester Dec 04 '18 at 14:28
  • `0x55cd3ef03709 SpecializedDevice::updateValue()+113 callq *%rax` is the call to `double GenericDevice::getValue() const` – TLepold Dec 04 '18 at 14:45
  • 1
    Yes. That probably executed correctly but when that function tried to return it used a corrupted address hence continued in the middle of the `call` instead of at the following instruction. – Jester Dec 04 '18 at 14:47
  • @TLepold Realy this assembly does not make sense, after the call it uses `rsi` but `rsi` is not callee preserved. It stores the vtable pointer in `rdi` while I would haved expect the `this` pointer here and it stores also the vtable pointer in `rdx` while this is the third argument regisiter or the second return register. I think there is some magic here! See the [ps-abi](https://www.uclibc.org/docs/psABI-x86_64.pdf) page 21. – Oliv Dec 04 '18 at 14:57
  • @Oliv the `rsi` is just because the broken return address ends up in the middle of an instruction so it decodes to that `shl`. The vtable in `rdx` is just a harmless missed optimization since the called function takes no arguments (apart from `this`). – Jester Dec 04 '18 at 14:59
  • @Jester Ok! So it looks the callee called ret which explains why rax does not hold the callee address any more, no? Why is the vtable pointer passed as argument and not the this pointer?? – Oliv Dec 04 '18 at 15:04
  • The `this` pointer is passed since it's reloaded on line `+106 mov -0x38(%rbp),%rdx` – Jester Dec 04 '18 at 15:17
  • @Jester Ooohh!! Thank you :). – Oliv Dec 04 '18 at 15:28
  • @Jester. I can confirm that its is highly plausible thant getValue() was executed, as I was able to find some residue data in the xmm0 and xmm1 registers which implies the function was executed. I will now test the hardware problem hypothesis. I already ran the programm on a VM for 2.5 days without problem while the real thing failed occasionaly, only difference being that Specialized device is a mockup in this case. I will now try to selectively deactivate cpu cores as I supect that the fault does not come from main memory. Virtual addresses / ASLR should make a main memory as cause unlikely? – TLepold Dec 07 '18 at 09:51

2 Answers2

1

As @Jester noted, your other register values do not match the code you say the crash happened in.

I was able to obtain a coredump when the problem occurred and looked at the assembly code: ... The segfault occured in the last line of the assembly excerpt.

How do you know that? What is the output from where?

Normally, there should be a current $RIP marker, like this:

   0x55cd3ef036f4 GenericDevice::updateValue()+92   mov    -0x38(%rbp),%rax   
   0x55cd3ef036f8 GenericDevice::updateValue()+96   mov    (%rax),%rax 
   0x55cd3ef036fb GenericDevice::updateValue()+99   add    $0x40,%rax  
   0x55cd3ef036ff GenericDevice::updateValue()+103  mov    (%rax),%rax 
   0x55cd3ef03702 GenericDevice::updateValue()+106  mov   -0x38(%rbp),%rdx
   0x55cd3ef03706 GenericDevice::updateValue()+110  mov   %rdx,%rdi         
   0x55cd3ef03709 GenericDevice::updateValue()+113  callq  *%rax
=> 0x55cd3ef0370e GenericDevice::updateValue()+118  ....

Do you see that marker?

If not, your crash is likely somewhere else (but good job analyzing your data).

If you do see the marker, other details, such as the exact processor make and model may matter (see e.g. this question and answer).

Employed Russian
  • 199,314
  • 34
  • 295
  • 362
  • 2
    Not chopping off `rip` from the register dump would have also helped. – Jester Dec 03 '18 at 15:34
  • Use the [edit link](https://stackoverflow.com/posts/53596525/edit) to add that info to the original question. – Jester Dec 03 '18 at 16:02
  • 1
    @TLepold Please edit your question and supply *complete* output from `where` and `info registers`. It is likely that you are misinterpreting the actual location of your crash. – Employed Russian Dec 03 '18 at 16:10
  • @TLepold The `$RIP` being in the middle of instruction is suspiciously like https://stackoverflow.com/questions/4703844/unexplainable-core-dump/15854827#15854827 – Employed Russian Dec 03 '18 at 16:27
1

The call statement pushes the return address on the stack before executing the called function. Source Intel® 64 and IA-32 Architectures Software Developer’s Manual page 225. Another thread was holding an invalid reference to a variable on the same stack and decremented it, which was the stored return address. Basically the thread was supposed to hold a reference to a counter counting how many jobs of GenericDevice::updateValue() are still pending. Upon timeout the counter would go out of scope but the executing thread was still hold the now invalid reference. the timeout would occur rarely and only with read devices instead of mockups. Thus the return address stored on the stack was ocasionally corrupted.

TLepold
  • 57
  • 4