MCE Error Codes/Pink Screen - Should they be a cause for concern?

Question

So I recently purchased a server grade system along with all server grade peripherals. I'm licensed for ESXi 6 and have all recent patches installed. System has been running around 2 weeks now and all of a sudden I had a complete crash.

I've interpreted this error code as "Internal Timer Error". I've forwarded the info to SuperMicro but to be honest I'm not very confident with their responses so far. My interpretation was that the system simply should not crash - for the reason that it's a Xeon with ECC memory running ESXi.

Is it possible that this was some one off error and shouldn't happen again? How would you handle this? Looking for some advice from those who have seen these types of errors and what they end up actually doing.

score 3 · Accepted Answer · answered Oct 02 '15 at 22:19

3

You see this error (MCE, machine check exception) precisely because it has ECC RAM.

You have some broken hardware somewhere, most likely a memory stick but possibly one or more processors (CPU 10 perhaps?) or something in between. Invoke your support contract.

It can be other bits of the hardware also, but every time I have seen this it has been faulty ECC RAM experiencing multiple-bit faults. If the MCE decoded as "internal timer error", the next most likely thing is a faulty CPU or mainboard.

answered Oct 02 '15 at 22:19

Falcon Momot

25,244
15
63
92

Is there anyway to tell the difference between the two? I'm pretty confident that I have decoded it correctly. – davewolfs Oct 02 '15 at 22:32
I believe the codes are vendor-specific, and I don't actually see the MCE code in there. But, surely your vendor (awful though supermicro may be) has some kind of diagnostic tool you can run... either way, you should get them to fix the hardware or go fix the hardware. Just like any other time, go isolate the broken component. – Falcon Momot Oct 02 '15 at 22:43
Can a utility like memtest86+ be useful in this case or unlikely to help? – davewolfs Oct 02 '15 at 22:48
It can be useful. – Falcon Momot Oct 02 '15 at 23:25
Any opinions on Intel's Product Specification Updates. I'm seeing some stuff in there related to Internal Timer Errors. I suppose the CPU's themselves can have bugs (or their bioses). – davewolfs Oct 02 '15 at 23:50
They can, but if there is a bug in there chances are there is a microcode update too. – Falcon Momot Oct 02 '15 at 23:51
Kinda odd that the stack shows "Power_Halt" and there is a reported bug with a potential bios fix listed here under BDE54, also shows same MCE code. http://www3.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-d-1500-specification-update.pdf – davewolfs Oct 03 '15 at 02:11

score 3 · Answer 2 · answered Oct 02 '15 at 22:32

3

Yes, it's a cause for concern. The server crashed!

Check your RAM and your CPU socket pins (if you hand-assembled the server).

That's about all the info you'll get. You can open a support case with VMware and they'll analyze the crash dump for you.

answered Oct 02 '15 at 22:32

ewwhite

197,159
92
443
809

MCE Error Codes/Pink Screen - Should they be a cause for concern?

2 Answers2