0

One day, one of the memory modules started producing correctable memory errors. Then, the operating system began to slow down, and the Oracle database stopped working correctly. Also, I don't use mirroring or sparing settings.

| Sat 30 Jan 2021 14:34:56 PM | Major | 19001A | iRMC S4 | 'MEM3_DIMM-B1': Memory module failure predicted | Memory | Yes |
| Sat 30 Jan 2021 15:25:44 PM | Major | 190033 | BIOS | 'MEM3_DIMM-B1': Too many correctable memory errors | Memory | No |
| Sat 30 Jan 2021 15:25:45 PM | Critical | 190035 | iRMC S4 | 'MEM3_DIMM-B1': Memory module error | Memory | Yes |
| Sat 30 Jan 2021 09:59:37 PM | Major | 190033 | BIOS | 'MEM3_DIMM-B1': Too many correctable memory errors | Memory | No |
| Sat 30 Jan 2021 09:59:37 PM | Major | 190033 | BIOS | 'MEM3_DIMM-B1': Too many correctable memory errors | Memory | No |

Processor load increased to 1.2 from 0.4

CPU idle time went from 85% to 0.

CPU interrupt time 0.2% became 0.8%.

It feels like the server was heavily loaded with something. In the system log there were these records:

Jan 30 14:34:55 server1 kernel: mce: [Hardware Error]: Machine check events logged
Jan 30 14:34:55 server1 kernel: EDAC MC2: 213 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 or CPU_SrcID#1_Ha#0_Chan#1_DIMM#1 (channel:1 page:0x3833edc offset:0x0 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:2 rank:255)
Jan 30 14:35:27 server1 kernel: {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
Jan 30 14:35:27 server1 kernel: {1}[Hardware Error]: It has been corrected by h/w and requires no further action
Jan 30 14:35:27 server1 kernel: {1}[Hardware Error]: event severity: corrected
Jan 30 14:35:27 server1 kernel: {1}[Hardware Error]:  Error 0, type: corrected
Jan 30 14:35:27 server1 kernel: {1}[Hardware Error]:  fru_text: Card03, ChnB, DIMM0
Jan 30 14:35:27 server1 kernel: {1}[Hardware Error]:   section_type: memory error
Jan 30 14:35:27 server1 kernel: {1}[Hardware Error]:   node: 2 card: 1 module: 0 
Jan 30 14:35:27 server1 kernel: {1}[Hardware Error]:   error_type: 2, single-bit ECC
Jan 30 15:26:39 server1 kernel: {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
Jan 30 15:26:39 server1 kernel: {2}[Hardware Error]: It has been corrected by h/w and requires no further action
Jan 30 15:26:39 server1 kernel: {2}[Hardware Error]: event severity: corrected
Jan 30 15:26:39 server1 kernel: {2}[Hardware Error]:  Error 0, type: corrected
Jan 30 15:26:39 server1 kernel: {2}[Hardware Error]:  fru_text: Card03, ChnB, DIMM0
Jan 30 15:26:39 server1 kernel: {2}[Hardware Error]:   section_type: memory error
Jan 30 15:26:39 server1 kernel: {2}[Hardware Error]:   node: 2 card: 1 module: 0 
Jan 30 15:26:39 server1 kernel: {2}[Hardware Error]:   error_type: 2, single-bit ECC
Jan 30 21:59:52 server1 kernel: {3}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
Jan 30 21:59:52 server1 kernel: {3}[Hardware Error]: It has been corrected by h/w and requires no further action
Jan 30 21:59:52 server1 kernel: {3}[Hardware Error]: event severity: corrected
Jan 30 21:59:52 server1 kernel: {3}[Hardware Error]:  Error 0, type: corrected
Jan 30 21:59:52 server1 kernel: {3}[Hardware Error]:  fru_text: Card03, ChnB, DIMM0
Jan 30 21:59:52 server1 kernel: {3}[Hardware Error]:   section_type: memory error
Jan 30 21:59:52 server1 kernel: {3}[Hardware Error]:   node: 2 card: 1 module: 0 
Jan 30 21:59:52 server1 kernel: {3}[Hardware Error]:   error_type: 2, single-bit ECC
Jan 30 22:08:37 server1 kernel: perf: interrupt took too long (34740 > 34456), lowering kernel.perf_event_max_sample_rate to 5000
Jan 30 22:11:54 server1 kernel: perf: interrupt took too long (43438 > 43425), lowering kernel.perf_event_max_sample_rate to 4000
Jan 30 22:15:02 server1 kernel: mce: [Hardware Error]: Machine check events logged
Jan 30 22:15:02 server1 kernel: EDAC MC2: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#3_DIMM#0 or CPU_SrcID#1_Ha#0_Chan#3_DIMM#1 (channel:3 page:0x32bb2cd offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c3 socket:1 ha:0 channel_mask:8 rank:255)
Jan 30 22:18:05 server1 kernel: perf: interrupt took too long (54573 > 54297), lowering kernel.perf_event_max_sample_rate to 3000
Jan 30 22:24:04 server1 kernel: perf: interrupt took too long (68810 > 68216), lowering kernel.perf_event_max_sample_rate to 2000

I understand it this way: The memory module, for some reason, started producing correctable errors. When the counter threshold is reached, the server "disabled" this memory module. Which, in theory, should be a conditionally normal situation. Perhaps I'm wrong. I think that if I got a non-correctable memory error, the server would be rebooted, which did not happen in my situation.

After removing the memory module from the server and testing it with memtest for a few days, I did not find a single error. This seems strange to me, and that it may indicate that there is a problem with the server itself.

The question is: Should a server-initiated memory module outage cause problems in the operating system? How can I prove or disprove my theory?

Server: Fujitsu PRIMERGY RX4770 M3

Memory: 32x Samsung 16 GB M393A2K40BB1-CRC 

OS: RHEL 7.9
shallrise
  • 89
  • 1
  • 13

0 Answers0