Got some wierd occasional segfault in some code calling virtual member functions. Segfault happens approximatly on averrage once in 30k calls.
I am using virtual methods to implement a template method pattern.
Code line it occurs is the first line of
GenericDevice::updateValue()
{
...
double tmpValue=getValue();
Value=tmpValue;
...
}
with
class GenericDevice
{
public:
void updateValue();
void print(string& result);
...
protected:
virtual double getValue()const=0;
...
private:
std::atomic<double> Value;
...
}
A class GenericDevice is provided later by loading a dynamic library at runtime
class SpecializedDeviced : public
{
...
virtual double getValue()const final;
...
}
I was able to obtain a coredump when the problem occurred and looked at the assembly code:
0x55cd3ef036f4 GenericDevice::updateValue()+92 mov -0x38(%rbp),%rax
0x55cd3ef036f8 GenericDevice::updateValue()+96 mov (%rax),%rax
0x55cd3ef036fb GenericDevice::updateValue()+99 add $0x40,%rax
0x55cd3ef036ff GenericDevice::updateValue()+103 mov (%rax),%rax
0x55cd3ef03702 GenericDevice::updateValue()+106 mov -0x38(%rbp),%rdx
0x55cd3ef03706 GenericDevice::updateValue()+110 mov %rdx,%rdi
0x55cd3ef03709 GenericDevice::updateValue()+113 callq *%rax
0x55cd3ef0370b <GenericDevice::updateValue()+115> movq %xmm0,%rax
0x55cd3ef03710 <GenericDevice::updateValue()+120> mov %rax,-0x28(%rbp)
0x55cd3ef03714 <GenericDevice::updateValue()+124> mov -0x38(%rbp),%rax
0x55cd3ef03718 <GenericDevice::updateValue()+128> lea 0x38(%rax),%rdx
0x55cd3ef0371c <GenericDevice::updateValue()+132> mov -0x28(%rbp),%rax
0x55cd3ef03720 <GenericDevice::updateValue()+136> mov %rax,-0x40(%rbp)
0x55cd3ef03724 <GenericDevice::updateValue()+140> movsd -0x40(%rbp),%xmm0
The segfault is exspected to have occured in 0x55cd3ef03709 GenericDevice::updateValue()+113.
where
#0 0x000055cd3ef0370a in MyNamespace::GenericDevice::updateValue (this=0x55cd40586698) at ../src/GenericDevice.cpp:22
#1 0x000055cd3ef038d2 in MyNamespace::GenericDevice::print (this=0x55cd40586698,result="REDACTED"...) at ../src/GenericDevice.cpp:50
...
The function GenericDevice::updateValue() was called as intended
<GenericDevice::print(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&)+301> callq 0x55cd3ef03698 <GenericDevice::updateValue()>
Reason being rax being set to 0x0.
Register group: general
rax 0x0 0
rbx 0x5c01b8a2 1543616674
rcx 0x2 2
rdx 0x28 40
rsi 0x2 2
rdi 0x55cd40586630 94340036191792
rbp 0x7ffe39086e60 0x7ffe39086e60
rsp 0x7ffe39086e20 0x7ffe39086e20
r8 0x7fbb06e7e8a0 140441251473568
r9 0x3 3
r10 0x33 51
r11 0x206 518
r12 0x55cd3ef19438 94340012676152
r13 0x7ffe39089010 140729855283216
r14 0x0 0
r15 0x0 0
rip 0x55cd3ef0370a 0x55cd3ef0370a<GenericDevice::updateValue()+114> eflags 0x10206 [ PF IF RF ]
cs 0x33 51
ss 0x2b 43
ds 0x0 0
es 0x0 0
fs 0x0 0
gs 0x0 0
By performing the calculations from the assembly excerpt I as able to confirm that the assembly code and the data it uses matches the expected virtual function call and starts with correct data:
this pointer of the object is used
(gdb) x /g $rbp-0x38 0x7ffe39086e28: 0x000055cd40586698 (gdb) p this $1 = (GenericDevice * const) 0x55cd40586698
pointer to vtable is correct (first element of *this)
(gdb) x 0x000055cd40586698 0x55cd40586698: 0x00007fbb070c1aa0 (gdb) info vtbl this vtable for 'GenericDevice' @ 0x7fbb070c1aa0 (subobject @ 0x55cd40586698):
vtable contains address of method we are looking for.
(gdb) info vtbl this vtable for 'GenericDevice' @ 0x7fbb070c1aa0 (subobject @ 0x55cd40586698): ... [8]: 0x7fbb06e7bf50 non-virtual thunk to MyNamespace::SpecializedDevice::getValue() const.
correct offset for vtable is used
(gdb) x 0x00007fbb070c1aa0+0x40 0x7fbb070c1ae0 <_ZTVN12MyNamespace11SpecializedDeviceE+168>: 0x00007fbb06e7bf50
Conclusion so far: By stepping through the assembler code use of correct data and instruction was validated.
- Correct data was used: Memory corruption can be ruled out.
- Assemble instructions seem correct: Coding/Compile error can be ruled out
- vtable looks ok: error when loading library at runtime can be excluded: Also function usuallly runs fine for tens thousand of times.
Please feel free to point out any errors in my reasoning.
Yet still the value in register rax is zero instead of the exspected 0x7fbb070c1ae0
- Could this indicate a Hardware error in one (rarely used) cpu core? Would explain rare and random occurence but I would expect problems with other programms and OS as well.
Processor Model is Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
Thanks in advance!
Update:
I have found the $RIP marker
0x55cd3ef0370a MyNamespace::GenericDevice::updateValue()+114 shlb 0x48(%rsi)
The assembly shown by gdb seems to change after scrolling. Which is why i didn't see the marker in the first attempt. After starting gdb and typing layout asm I get:
>0x55cd3ef0370a <MyNamespace::GenericDevicer::updateValue()+114> shlb 0x48(%rsi)
0x55cd3ef0370d <MyNamespace::GenericDevicer::updateValue()+117> movd %mm0,%eax
0x55cd3ef03710 <MyNamespace::GenericDevicer::updateValue()+120> mov %rax,-0x28(%rbp)
0x55cd3ef03714 <MyNamespace::GenericDevicer::updateValue()+124> mov -0x38(%rbp),%rax
0x55cd3ef03718 <MyNamespace::GenericDevicer::updateValue()+128> lea 0x38(%rax),%rdx
0x55cd3ef0371c <MyNamespace::GenericDevicer::updateValue()+132> mov -0x28(%rbp),%rax
0x55cd3ef03720 <MyNamespace::GenericDevicer::updateValue()+136> mov %rax,-0x40(%rbp)
0x55cd3ef03724 <MyNamespace::GenericDevicer::updateValue()+140> movsd -0x40(%rbp),%xmm0
...
After scrolling the ams in gdb I get the code posted in the original question. The code in the original question matches the code from the executable file. The code posted above does partially deviate from the executable.
The shlb instruction makes no sense to me. Couldn't even find the instruction in the Intel® 64 and IA-32 Architectures Software Developer’s Manual. Closest match was shl.