0

The following errors show up in dmesg 10-20 times per day:

MCA: Bank 5, Status 0x8c00004000010092
MCA: Global Cap 0x0000000001000c10, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x206d7, APIC ID 0
MCA: CPU 0 COR (1) RD channel 2 memory error
MCA: Address 0xbb5561e80 (Mode: Physical Address, LSB: 6)
MCA: Misc 0x2140109086

The CPU is always 0, and the "bank" is always 5. The "Misc" and the "Address" vary, but are often the same.

The motherboard is identified thus:

CPU: Intel(R) Xeon(R) CPU E5-1620 0 @ 3.60GHz (3591.44-MHz K8-class CPU)
  Origin="GenuineIntel"  Id=0x206d7  Family=0x6  Model=0x2d  Stepping=7
Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
Features2=0x1fbee3ff<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,x2APIC,POPCNT,TSCDLT,AESNI,XSAVE,OSXSAVE,AVX>
  AMD Features=0x2c100800<SYSCALL,NX,Page1GB,RDTSCP,LM>
  AMD Features2=0x1<LAHF>
  XSAVE Features=0x1<XSAVEOPT>
  VT-x: PAT,HLT,MTF,PAUSE,EPT,UG,VPID
  TSC: P-state invariant, performance statistics
real memory  = 137438953472 (131072 MB)
avail memory = 133741539328 (127545 MB)
Event timer "LAPIC" quality 600
ACPI APIC Table: <LENOVO TC-A0   >
FreeBSD/SMP: Multiprocessor System Detected: 8 CPUs
FreeBSD/SMP: 1 package(s) x 4 core(s) x 2 hardware threads

Should I replace a DIMM (and how do I identify it?), or is ECC doing its job, and there is no need to worry? Yet?

Adding output of mcelog:

Hardware event. This is not a software error.
MCE 458
CPU 0 BANK 5 TSC 10283dbf8f01bc 
MISC 21401e9e86 ADDR bb5561e80 
TIME 1665418335 Mon Oct 10 12:12:15 2022
MCG status:
STATUS cc00010000010092 MCGSTATUS 0
MCGCAP 1000c10 APICID 0 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 45 Step 7
Mikhail T.
  • 2,338
  • 1
  • 24
  • 55

1 Answers1

0

Please follow below.

  1. Check mcelog if that is hardware or software issue.
  2. Plug out and plugin dimm and see the logs again after cleaning motherboard/dimm slots.
  3. Check if you can see ECC lines in dmesg
  4. You can also try memtest if possible.
  5. Try removing/replacing dimm and check if this is related to dimm or motherboard.
asktyagi
  • 2,860
  • 2
  • 8
  • 25
  • I added the output of `mcelog`. The errors don't show up all the time -- only occasionally. Should the "Bank 5" correspond to some marking on the motherboard? – Mikhail T. Oct 10 '22 at 16:14
  • Check if you can see ECC lines in dmesg, you can also try memtest if possible. Or try removing/replacing dimm and check if this is related to dimm or motherboard. – asktyagi Oct 12 '22 at 02:48
  • Check IPMI SEL too (with e.g. `ipmiutil`). Usually it logs memory ECC errors too, and also It may give a clue to which memory slot it is in. – Nikita Kipriyanov Oct 12 '22 at 04:53
  • This is a workstation, not a server -- no IPMI device... – Mikhail T. Oct 13 '22 at 01:57