"Unexplainable" core dump

Question

I've seen many core dumps in my life, but this one has me stumped.

Context:

multi-threaded Linux/x86_64 program running on a cluster of AMD Barcelona CPUs
the code that crashes is executed a lot
running 1000 instances of the program (the exact same optimized binary) under load produces 1-2 crashes per hour
the crashes happen on different machines (but the machines themselves are pretty identical)
the crashes all look the same (same exact address, same call stack)

Here are the details of the crash:

Program terminated with signal 11, Segmentation fault.
#0  0x00000000017bd9fd in Foo()
(gdb) x/i $pc
=> 0x17bd9fd <_Z3Foov+349>: rex.RB orb $0x8d,(%r15)

(gdb) x/6i $pc-12
0x17bd9f1 <_Z3Foov+337>:    mov    (%rbx),%eax
0x17bd9f3 <_Z3Foov+339>:    mov    %rbx,%rdi
0x17bd9f6 <_Z3Foov+342>:    callq  *0x70(%rax)
0x17bd9f9 <_Z3Foov+345>:    cmp    %eax,%r12d
0x17bd9fc <_Z3Foov+348>:    mov    %eax,-0x80(%rbp)
0x17bd9ff <_Z3Foov+351>:    jge    0x17bd97e <_Z3Foov+222>

You'll notice that the crash happened in the middle of instruction at 0x17bd9fc, which is after return from a call at 0x17bd9f6 to a virtual function.

When I examine the virtual table, I see that it is not corrupted in any way:

(gdb) x/a $rbx
0x2ab094951f80: 0x3f8c550 <_ZTI4Foo1+16>
(gdb) x/a 0x3f8c550+0x70
0x3f8c5c0 <_ZTI4Foo1+128>:  0x2d3d7b0 <_ZN4Foo13GetEv>

and that it points to this trivial function (as expected by looking at the source):

(gdb) disas 0x2d3d7b0
Dump of assembler code for function _ZN4Foo13GetEv:
   0x0000000002d3d7b0 <+0>: push   %rbp
   0x0000000002d3d7b1 <+1>: mov    0x70(%rdi),%eax
   0x0000000002d3d7b4 <+4>: mov    %rsp,%rbp
   0x0000000002d3d7b7 <+7>: leaveq 
   0x0000000002d3d7b8 <+8>: retq   
End of assembler dump.

Further, when I look at the return address that Foo1::Get() should have returned to:

(gdb) x/a $rsp-8
0x2afa55602048: 0x17bd9f9 <_Z3Foov+345>

I see that it points to the right instruction, so it's as if during the return from Foo1::Get(), some gremlin came along and incremented %rip by 4.

Plausible explanations?

Did you ever find out what caused this? If so, I'd be very interested to hear what it was! — us2012, Mar 24 '13 at 16:11

score 76 · Accepted Answer · edited Jan 03 '23 at 20:12

76

So, unlikely as it may seem, we appear to have hit an actual bona-fide CPU bug.

https://web.archive.org/web/20130228081435/http://support.amd.com/us/Processor_TechDocs/41322_10h_Rev_Gd.pdf has erratum #721:

721 Processor May Incorrectly Update Stack Pointer

Description

Under a highly specific and detailed set of internal timing conditions, the processor may incorrectly update the stack pointer after a long series of push and/or near-call instructions, or a long series of pop and/or near-return instructions. The processor must be in 64-bit mode for this erratum to occur.

Potential Effect on System

The stack pointer value jumps by a value of approximately 1024, either in the positive or negative direction. This incorrect stack pointer causes unpredictable program or system behavior, usually observed as a program exception or crash (for example, a #GP or #UD).

Suggested Workaround

System software may set MSRC001_1029[0] = 1b.

edited Jan 03 '23 at 20:12

user3840170

26,597
4
30
62

answered Apr 06 '13 at 18:59

Employed Russian

199,314
34
295
362

3

Ouch. Is it actually a "highly specific" condition - i.e., did you manage to fix it by slightly changing the code produced at the problematic point? – us2012 Apr 06 '13 at 20:37
27

@us2012 Our code and compilers are constantly changing, and the problem disappeared as suddenly as it appeared ... only to happen again 2 years later in a completely unrelated executable. – Employed Russian Apr 06 '13 at 21:46
3

Did you try the suggested workaround? – Robin Davies Jan 04 '23 at 04:16

score 8 · Answer 2 · answered Jan 16 '11 at 04:56

I've once seen an "illegal opcode" crash right in the middle of an instruction. I was working on a Linux port. Long story short, Linux subtracts from the instruction pointer in order to restart a syscall, and in my case this was happening twice (if two signals arrived at the same time).

So that's one possible culprit: the kernel fiddling with your instruction pointer. There may be some other cause in your case.

Bear in mind that sometimes the processor will understand the data it's processing as an instruction, even when it's not supposed to be. So the processor may have executed the "instruction" at 0x17bd9fa and then moved on to 0x17bd9fd and then generated an illegal opcode exception. (I just made that number up, but experimenting with a disassembler can show you where the processor might have "entered" the instruction stream.)

Happy debugging!

I have considered signals, but there are several "strikes" against them being the cause: 1. note that there are no system calls anywhere around this code; 2. this thread should not be receiving any async signals; 3. if a signal was causing this, how do you explain the crash happening on *exact* same address in all crashed programs? — Employed Russian, Jan 16 '11 at 05:07
I didn't suggest your problem may be signals. (That was just the bug in the port that was behind my problem.) My point was that factors completely external to your program - like a kernel bug - may be causing this problem. Another thing that can mess with your instruction pointer is exception handling. — Artelius, Jan 22 '11 at 00:19

"Unexplainable" core dump

2 Answers2

721 Processor May Incorrectly Update Stack Pointer

Description

Potential Effect on System

Suggested Workaround

Linked