0

I recently read this paper that @codinghorror twitted about, and I wonder how do I know that my server failed due to a memory error? Particularly, how do I know if it was a correctable or an uncorrectable error and on which DIMM it happened?

Andriy Volkov
  • 231
  • 2
  • 3
  • 9

4 Answers4

2

SNMP traps / messages are your best bet for having a pro-active notification about a memory/DIMM error. Products like HP Systems Insight Manager, HP OpenView, and Dell OpenManage offer several configurable rules to forward SNMP messages to emails/sms/pagers to let you know exactly when a memory error or degradation occurs.

mctsonic
  • 437
  • 2
  • 4
1

If your server is any good it has a BIOS and BMC combo that track these errors and log them in IPMI so you can see them. Normally your server will halt on an uncorrectable ECC error, the BIOS takes over in an SMI interrupt and log it in the BMC. It then resumes operation to the OS which has nothing better to do than reboot normally (sometimes it is possible to kill a process and go on). The IPMI SEL log should be the sign of an ECC error.

If your server doesn't have a good BMC/BIOS you can resort to using crash kernel loaded to which the host kernel will jump and it can log a full stack trace and dmesg log to be reviewed later to grab this info from. It will be logged in the dmesg of the crashed kernel with bold letters "HARDWARE ERROR".

Baruch Even
  • 1,073
  • 6
  • 18
0

just use memtest! It will tell you exactly which DIMM is having problems. http://www.memtest86.com/

geeklin
  • 528
  • 2
  • 10
  • Unfortunately, that required rebooting the server and having it offline while this runs. With ECC FBDIMMs, you can usually keep the server running normally even if there are faults, just with reduced performance. as mctsonic said, SNMP or other vendor-specific monitoring tools are probably the way to go here. – MDMarra Sep 06 '09 at 00:53
0

Check the server's own diagnostics. As you've told us absolutely nothing about the server that's as detailed an answer as I can give.

John Gardeniers
  • 27,458
  • 12
  • 55
  • 109