1

I'm curious as to whether or not there's some performance counter that will log the number of ECC corrections required, that could perhaps be tracked as an early indicator of memory failure. I imagine it would theoretically be accessible in the same way that page faults from the tlb are reported to the OS?

Solutions for Windows or FreeBSD and Linux are welcome.

Mahmoud Al-Qudsi
  • 559
  • 1
  • 6
  • 23

3 Answers3

3

For Linux:

Install mcelog and it will log all errors into /var/log/mcelog.log

You can also look at the Linux sysfs, see the EDAC documentation for the relevant information: https://www.kernel.org/doc/Documentation/edac.txt

Baruch Even
  • 1,073
  • 6
  • 18
1

Most hardware handles this logging natively. For example HP's iLO baseboard management controller spouts ECC memory error activity to its Integrated Management Log.

So, the generic answer for the generic question is: Check your hardware management system's capabilities and resources.

Hyppy
  • 15,608
  • 1
  • 38
  • 59
1

Or read this page, it talks about using Linux's kernel EDAC to query the memory controller, and provides some examples scripts.: http://www.admin-magazine.com/Articles/Monitoring-Memory-Errors

more /sys/devices/system/edac/mc/mc0/ue_count