7

On a linux server (8x Quad-Core AMD 8378), I'm getting the following errors:

[Hardware Error]: MC4_STATUS[-|CE|MiscV|-|AddrV|CECC]: 0x9c294c00001d018b
[Hardware Error]: Northbridge Error (node 4): ECC error in L3 cache tag.
[Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: SNP
[Hardware Error]: Machine check events logged

This has happened three times during the last month, but never before (server running for 3 years).

From a quick google-search, it seems this is a serious matter.

However, the vendor support technician said:

I have seen these errors MANY times, and unless you are overclocking your CPU - or have had a fan failure or similar - it is VERY unlikely to be a processor problem. It is more likely that the kernel is misreporting the error.

So - is this a critical error and I should order new parts (replace CPU?) or ignore it?

Many thanks.

L3error
  • 71
  • 1
  • 1
  • 2
  • Were they all around the same time? It's very unlikely that the processor is misreporting the error. But odd things like [solar flairs](https://www.google.com/search?q=solar+flair+bit+flip) can also cause these errors and are nothing to worry about. If the processor is going bad, well I'd worry about that. – Chris S Nov 28 '12 at 20:51
  • 6
    If your system did not change in the last month (no new kernel with reporting options set vs an old one which did not log it etc) then misreporting the issue seems.... uhm... a creative answer from the vendor support technician. – Hennes Nov 28 '12 at 20:53
  • I'd believe it is a hardware error. If it is occurring frequently, then I'd get my support out there and replace the CPU. Otherwise, I might not worry about it. – mdpc Nov 28 '12 at 23:35
  • You can try swapping two CPUs. If the error goes away, you win. If it follows the CPU, I'd be pretty convinced it was a CPU problem. – David Schwartz Nov 29 '12 at 01:39

3 Answers3

5

Best practice: Keep your own spare parts, when possible.

As for machine check exceptions, these are reported by the hardware; the kernel is just passing the message on to you, so that you can take action before the hardware problem gets out of hand and results in a real disaster.

The only instance I was able to find of a kernel "misreporting" a machine check exception was the following. In this case, it was a flaw in the processor causing the problem, not the kernel.

Intel Xeon processor E7 family processors have an issue in which some c-state transitions can cause false correctable Machine Check Exception (MCE) errors to be reported from MCE bank 6 to the user. On some E7 processor family systems, this resulted in "floods" of MCE errors. This patch disables MCE error reporting for bank 6.

Bottom line: It sounds to me like the vendor is trying to avoid replacing your defective hardware.

Michael Hampton
  • 244,070
  • 43
  • 506
  • 972
  • Agreed... Very unlikely to be misreporting, but if the error only happens once there is nothing to worry about, bit flips do happens and ECC is there to correct them before they cause real trouble. If you repeatedly get hardware errors on the same CPU or memory bank, then definitively look at getting it replaced before you get a multi-bit error that ECC cannot correct. – Thomas Guyot-Sionnest Jul 19 '22 at 10:11
0

[Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD I am also getting this error every time I restart the system. Restarting is normal, however the computer losses control at shutdown. When they say the CPU, do you refer to the RAM memory? I am running a CPU with 32 nodes 8 cards of 64 megabytes each. Should I be worried about this error>? Thenk you for your replays.

DCM CA
  • 1
  • This does not really answer the question. If you have a different question, you can ask it by clicking [Ask Question](https://serverfault.com/questions/ask). To get notified when this question gets new answers, you can [follow this question](https://meta.stackexchange.com/q/345661). Once you have enough [reputation](https://serverfault.com/help/whats-reputation), you can also [add a bounty](https://serverfault.com/help/privileges/set-bounties) to draw more attention to this question. - [From Review](/review/late-answers/555062) – mwfearnley Jun 29 '23 at 14:36
0

On enterprise servers we handled it like this: Have the vendor replace if the errors are excessive or if they repeat week after week. Actually, the event monitoring service triggered that all by itself. No questions asked.

Moving to x86 we also got the stories about EDAC/MCE being confused etc. If the errors keep coming, the hardware should be replaced.

(There's also a low chance of it being connected with big solar events. It IS possible, but PC hardware being flaky and vendors being reluctant to replace something is far more commonplace)

Florian Heigl
  • 1,479
  • 12
  • 20