6

So I recently purchased a server grade system along with all server grade peripherals. I'm licensed for ESXi 6 and have all recent patches installed. System has been running around 2 weeks now and all of a sudden I had a complete crash.

I've interpreted this error code as "Internal Timer Error". I've forwarded the info to SuperMicro but to be honest I'm not very confident with their responses so far. My interpretation was that the system simply should not crash - for the reason that it's a Xeon with ECC memory running ESXi.

Is it possible that this was some one off error and shouldn't happen again? How would you handle this? Looking for some advice from those who have seen these types of errors and what they end up actually doing.

Crash

davewolfs
  • 235
  • 3
  • 7

2 Answers2

3

You see this error (MCE, machine check exception) precisely because it has ECC RAM.

You have some broken hardware somewhere, most likely a memory stick but possibly one or more processors (CPU 10 perhaps?) or something in between. Invoke your support contract.

It can be other bits of the hardware also, but every time I have seen this it has been faulty ECC RAM experiencing multiple-bit faults. If the MCE decoded as "internal timer error", the next most likely thing is a faulty CPU or mainboard.

Falcon Momot
  • 25,244
  • 15
  • 63
  • 92
  • Is there anyway to tell the difference between the two? I'm pretty confident that I have decoded it correctly. – davewolfs Oct 02 '15 at 22:32
  • I believe the codes are vendor-specific, and I don't actually see the MCE code in there. But, surely your vendor (awful though supermicro may be) has some kind of diagnostic tool you can run... either way, you should get them to fix the hardware or go fix the hardware. Just like any other time, go isolate the broken component. – Falcon Momot Oct 02 '15 at 22:43
  • Can a utility like memtest86+ be useful in this case or unlikely to help? – davewolfs Oct 02 '15 at 22:48
  • It can be useful. – Falcon Momot Oct 02 '15 at 23:25
  • Any opinions on Intel's Product Specification Updates. I'm seeing some stuff in there related to Internal Timer Errors. I suppose the CPU's themselves can have bugs (or their bioses). – davewolfs Oct 02 '15 at 23:50
  • They can, but if there is a bug in there chances are there is a microcode update too. – Falcon Momot Oct 02 '15 at 23:51
  • Kinda odd that the stack shows "Power_Halt" and there is a reported bug with a potential bios fix listed here under BDE54, also shows same MCE code. http://www3.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-d-1500-specification-update.pdf – davewolfs Oct 03 '15 at 02:11
3

Yes, it's a cause for concern. The server crashed!

Check your RAM and your CPU socket pins (if you hand-assembled the server).

That's about all the info you'll get. You can open a support case with VMware and they'll analyze the crash dump for you.

ewwhite
  • 197,159
  • 92
  • 443
  • 809