1

Folks,

I am a total Linux n00b. I am trying to deploy mcelog on one of my computing nodes running PUIAS 6.4 (i86_64)

[root@lov3 edac]# uname -a
Linux lov3.mylab.org 2.6.32-358.18.1.el6.x86_64 #1 SMP Tue Aug 27 22:40:32 EDT 2013 x86_64 x86_64 x86_64 GNU/Linux

a free clone of Red Hat 6.4 on AMD hardware

[root@lov3 mcelog]# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                64
On-line CPU(s) list:   0-63
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             4
NUMA node(s):          8
Vendor ID:             AuthenticAMD
CPU family:            21
Model:                 2
Stepping:              0
CPU MHz:               1400.000
BogoMIPS:              4999.30
Virtualization:        AMD-V
L1d cache:             16K
L1i cache:             64K
L2 cache:              2048K
L3 cache:              6144K
NUMA node0 CPU(s):     0-7
NUMA node1 CPU(s):     8-15
NUMA node2 CPU(s):     16-23
NUMA node3 CPU(s):     24-31
NUMA node4 CPU(s):     32-39
NUMA node5 CPU(s):     40-47
NUMA node6 CPU(s):     48-55
NUMA node7 CPU(s):     56-63

My mcelog.conf file is more or less default apart of the fact that I would like to run mcelog as a daemon and to log errors. When I start mcelog

[root@lov3 mcelog]# mcelog --config-file mcelog.conf
AMD Processor family 21: Please load edac_mce_amd module.

However the module is present

[root@lov3 mcelog]# locate edac_mce_amd.ko
/lib/modules/2.6.32-358.18.1.el6.x86_64/kernel/drivers/edac/edac_mce_amd.ko
/lib/modules/2.6.32-358.el6.x86_64/kernel/drivers/edac/edac_mce_amd.ko

and loaded

[root@lov3 edac]# lsmod | grep mce       
edac_mce_amd           14705  1 amd64_edac_mod

Is there anything that I can do to get mcelog working? The only reference I found is this thread

http://lists.centos.org/pipermail/centos/2012-November/130226.html

Predrag Punosevac
  • 249
  • 1
  • 2
  • 8

2 Answers2

1

mcelog doesn't work on that AMD CPU or newer (as seen in mcelog.c family >= 15). The same problem exists for AMD EPYC processors.

Instead of mcelog, use the kernel module edac_mce_amd, which will put the MCE logs in the kernel log, which ought to end up on disk via syslog. It's possible that mcelog loaded that module for you this time, but I suggest loading it on boot another way, such as the /etc/initramfs-tools/modules file on Debian based Linux and update-initramfs -u.

But I can't find anything saying the format of such a log... so here's a guess put together from the linux source code...

in include/linux/printk.h, we see:

#define HW_ERR         "[Hardware Error]: "

in drivers/edac/mce_amd.c, we see things like this starting some output with pr_emerge(HW_ERR ...):

pr_emerg(HW_ERR "MC0 Error: ");

And more lines with pr_cont(...) but without HW_ERR.

So I guess you can look for "[Hardware Error]:" in your logs. And maybe the lines will say edac_mce_amd too.

Here's a rule which I think will log the first pr_emerg, but not the pr_cont parts (see here). Here I set up an rsyslog.d rule that looks for "[Hardware Error]:". But this will match things other than from the edac_mce_amd module.

vim /etc/rsyslog.d/09-edac_mce_amd.conf

if  ($syslogfacility-text == 'kern') and \
($msg contains '[Hardware Error]:') \
then    -/var/log/edac_mce_amd.log
#uncomment this to also remove it from the other files
#&   stop

Having only the first line is good enough for me since I'll set up a monitoring script that simply checks that the file size is 0. If anyone knows a way to do it properly, please comment.

Peter
  • 2,756
  • 1
  • 20
  • 26
0

As you are using CPU family 21, the message is obvious : you can see the below code :

mcelog.c of the mcelog-1.0pre3_20110718-0.14.el6 package show where the cpu family of greater than 15 returns 0 to is_cpu_supported():

 416 int is_cpu_supported(void)
 417 {
 418         enum {
 419                 VENDOR = 1,
 420                 FAMILY = 2,
 421                 MODEL = 4,
 422                 MHZ = 8,
 423                 FLAGS = 16,
 424                 ALL = 0x1f
 425         } seen = 0;
 426         FILE *f;
 427         static int checked;
 428 
 429         if (checked)
 430                 return 1;
 431         checked = 1;
 432 
 433         f = fopen("/proc/cpuinfo","r");
 434         if (f != NULL) {
 435                 int family = 0;
 436                 int model = 0;
 437                 char vendor[64] = { 0 };
 438                 char *line = NULL;
 439                 size_t linelen = 0;
 440                 double mhz;
 441 
 442                 while (getdelim(&line, &linelen, '\n', f) > 0 && seen != ALL) {
 443                         if (sscanf(line, "vendor_id : %63[^\n]", vendor) == 1)
 444                                 seen |= VENDOR;
 445                         if (sscanf(line, "cpu family : %d", &family) == 1)
 446                                 seen |= FAMILY;
 447                         if (sscanf(line, "model : %d", &model) == 1)
 448                                 seen |= MODEL;
 449         /* We use only Mhz of the first CPU, assuming they are the same
 450                (there are more sanity checks later to make this not as wrong
 451                            as it sounds) */
 452                         if (sscanf(line, "cpu MHz : %lf", &mhz) == 1) {
 453                                 if (!cpumhz_forced)
 454                                         cpumhz = mhz;
 455                                 seen |= MHZ;
 456                         }
 457                         if (!strncmp(line, "flags", 5) && isspace(line[6])) {
 458                                 processor_flags = line;
 459                                 line = NULL;
 460                                 linelen = 0;
 461                                 seen |= FLAGS;
 462                         }
 463  
 464                 }
 465                 if (seen == ALL) {
 466                         if (!strcmp(vendor,"AuthenticAMD")) {
 467                                 if (family == 15)
 468                                         cputype = CPU_K8;
 469                                 if (family >= 15)  <-----------
 470                                        fprintf(stderr, "AMD Processor family %d: Please load edac_mce_amd module.\n", f     amily);
 471                                 return 0;
 472                         } else if (!strcmp(vendor,"GenuineIntel"))
 473                                 cputype = select_intel_cputype(family, model);
 474                         /* Add checks for other CPUs here */
 475                 } else {
 476                         Eprintf("warning: Cannot parse /proc/cpuinfo\n");
 477                 }
 478                 fclose(f);
 479                 free(line);
 480         } else
 481                 Eprintf("warning: Cannot open /proc/cpuinfo\n");
 482 
 483         return 1;
 484 }
Vin
  • 51
  • 1
  • 6
  • 3
    Please don't assume the message is obvious, posting the source code does not suffice as an answer. – zymhan Mar 04 '16 at 13:30