1

I am sporadically (twice in over a month) seeing worrying errors like:

[757706.327447] mce: [Hardware Error]: Machine check events logged
[757706.327450] [Hardware Error]: Corrected error, no action required.
[757706.327453] [Hardware Error]: CPU:1 (19:21:0) MC20_STATUS[-|CE|MiscV|-|-|-|-|-|-]: 0x8948000000282504
[757706.327457] [Hardware Error]: IPID: 0x0000000000000000
[757706.327459] [Hardware Error]: Bank 20 is reserved.
[757706.327459] [Hardware Error]: cache level: RESV, tx: DATA

I also see a bunch of (perhaps unrelated):

[725795.673933] audit: type=1400 audit(1664229606.644:1910): apparmor="DENIED" operation="capable" profile="/usr/sbin/cupsd" pid=1534043 comm="cupsd" capability=12  capname="net_admin"
[725795.733042] audit: type=1400 audit(1664229606.700:1911): apparmor="DENIED" operation="open" profile="/usr/sbin/cups-browsed" name="/proc/sys/net/ipv6/conf/all/disable_ipv6" pid=1534044 comm="cups-browsed" requested_mask="r" denied_mask="r" fsuid=0 ouid=0

Machine is a self-built server based on "AMD Ryzen 9 5950X 16-Core Processor" with "MemTotal: 32797136 kB" (further details available, if needed) and sporting.

mcon@ikea:~$ uname -a
Linux ikea 5.19.0-1-amd64 #1 SMP PREEMPT_DYNAMIC Debian 5.19.6-1 (2022-09-01) x86_64 GNU/Linux

What should I check?

ZioByte
  • 296
  • 4
  • 17

1 Answers1

1

mce: [Hardware Error]: Machine check events logged

mce, or Machine-check exception, is an error generated by the CPU when the CPU detects that a hardware error or failure has occurred.

Machine check exceptions (MCEs) can occur for a variety of reasons ranging from undesired or out-of-spec voltages from the power supply, from cosmic radiation flipping bits in memory DIMMs or the CPU, or from other miscellaneous faults, including faulty software triggering hardware errors.

[Hardware Error]: Corrected error, no action required.

apparently it's not that fatal and can be fixed automatically by the CPU/kernel; but there are also few other cases of 5th generation Ryzen doing the same, so I'd advise a CPU check for now, e.g. running MPrime for a few hours and see if anything comes up.

To see more details about the hardware error logged, you can use rasdaemon as suggested, should be available in Debian repository, or you can try some other MCE decoding tools such as these.

mforsetti
  • 2,666
  • 2
  • 16
  • 20
  • Thanks. Unfortunately trying to run MPrime results in a "Killed" message after a few seconds (and process pointed by `mprime.pid` is actually gone). Any hint? – ZioByte Sep 28 '22 at 10:22
  • When `Killed` message shows, is there any OOM message in your syslog? – mforsetti Sep 28 '22 at 14:09
  • Yes, I found that, but I didn't really understand discussion about it on the internet. – ZioByte Sep 28 '22 at 15:59
  • when `Killed` message shows, try running `tail -n 100 /var/log/syslog` and see if there's any entry about `mprime` got OOM'd. If so, try configuring the benchmark to stress CPU only. – mforsetti Sep 28 '22 at 16:36