mcelog Cache Error, how to Disable L3 Cache on Intel i7 CPU

Question

I am NOT a programmer but a System Integrator with experience since DOS

I bought a used Barebone PC and it has some minor issues: It is sometimes crashing, which is not connected to the RAM

its running debian KVM (proxmox) on the HOST and on top CentOS and Windows VMs

I have this error in mcelog on debian

Hardware event. This is not a software error.
MCE 0
CPU 0 BANK 8 TSC 25f5e6ef72
MISC 12dc0 ADDR 372c9000007c2f6
TIME 1614950322 Fri Mar  5 14:18:42 2021
MCG status:
MCi status:
Corrected error
MCi_MISC register valid
MCi_ADDR register valid
Threshold based error status: green
MCA: corrected filtering (some unreported errors in same region)
**Generic CACHE Level-3 Generic Error**
STATUS 8c2000800001110b MCGSTATUS 0
MCGCAP c09 APICID 0 SOCKETID 0
MICROCODE ca
CPUID Vendor Intel Family 6 Model 142 Step 9

Question:

Is it generally possible to disable only L3 cache? the CPU otherwise might work I was reading another article on stackoverflow where the cache completely got disabled L1L2L3 and the machine was too slow for running X

i found this trick, do i disable cache with this?

x:~# cat /proc/mtrr
reg00: base=0x080000000 ( 2048MB), size= 2048MB, count=1: uncachable
reg01: base=0x07c000000 ( 1984MB), size=   64MB, count=1: uncachable
reg02: base=0x07b800000 ( 1976MB), size=    8MB, count=1: uncachable
x:~# echo disable=00 > /proc/mtrr
x:~# echo disable=01 > /proc/mtrr
x:~# echo disable=02 > /proc/mtrr
x:~# cat /proc/mtrr
x:~#

I am Curious, if this is my first long lasting stackoverflow post, maybe unknown will again delete it because unknown has not learned about freedom of speech :) censorship forever!

Note that the error was *corrected*. There doesn't seems to be a way to just disable the L3 on non-NetBurst Intel CPUs. If your CPU supports CAT you can **try** to use it to force the hypervisor to use only a small fraction of the L3 (small but not null). There's an utility called `pqr` that set the COS for the CAT but you should beware of any independent modification made by the hypervisor itself (which seems a good candidate for using the CAT). — Margaret Bloom, Mar 05 '21 at 17:11
Thank you. After I made the mtrr disable action that I posted above when opening the question i had no further crash anymore (but a test case is: proxmox and centos 7 windows 10 uptime of at least 7 days necessary, to state that system is stable) If i get a crash i will come back here and then start to investigate about `pqr`- ps my processor is a `Intel® Core™ i7-7567U Processor (4M Cache, up to 4.00 GHz)` — ant0nwax, Mar 06 '21 at 18:14
I have further crashes :) And i checked cpu for CAT: `x:~# cat /proc/cpuinfo | grep -i cat` gives me no results — ant0nwax, Mar 15 '21 at 19:27
Cache Allocation Technology (CAT) is only on Xeon CPUs. "client" CPUs like yours are most often used in cases where only one task at a time is really doing a lot, and there isn't enough L3 to be worth dividing it for most use-cases anyway. Besides market segmentation, as well as any possible practical reasons. — Peter Cordes, Apr 03 '21 at 04:32
@PeterCordes It's useful on client processors as well, but only if it works transparently without user intervention. On server processors with virtualization, the company in charge of managing the servers is willing to spend the effort to use technologies like CAT. But most users on client machines would never do that, so it's only useful if it works automatically and dynamically like DVFS. — Hadi Brais, Apr 03 '21 at 21:28

score 3 · Answer 1 · edited Apr 03 '21 at 21:22

Intel CPU Family 6 Model 142 (0x8E) refers to Core processors of the 7-9th generations. All of these processors have an "inclusive L3" cache -- all lines in any L1 or L2 cache must also be cached in the L3. "Disabling" the L3 could only work if there were a mode bit that prevented the L3 from caching data, while still allowing the L3 directory to perform its function in managing cache coherence.

Hadi Brais · Answer 2 · 2021-04-04T12:39:48.587

Intel has not publicly disclosed how to only disable the L3 cache on most processors, including the Core i7-7567U. Disabling the MTRRs does effectively disable all of the three levels of caches on your processor because all accesses become of type UC (meaning uncacheable), with one possible exception discussed below.

The /proc/mtrr file only list the enabled variable-range MTRRs. However, it doesn't show you all of the MTRRs. Any other variable-range MTRRs are disabled and you don't have to worry about them. The fixed-rage MTRRs are still enabled though. These specify memory types of fixed ranges in the bottom 1 MB of the physical address space. Disabling all of the MTRRs listed by /proc/mtrr won't disable or affect the fixed-range MTRRs. It's typical for some of the fixed ranges to have cacheable memory types.

According to the relevant memory type resolution rules, a physical memory address not contained in the range of any enabled MTRR has the memory type specified in the lowest 8 bits of the IA32_MTRR_DEF_TYPE MSR register. This type is UC on most or all x86 production systems. You can determine the default type by executing sudo rdmsr -a 0x2ff and checking the lowest 8 bits of the output for each logical core. Note that the MTRRs are actually per physical core, but rdmsr offers no switch to only run one one of the logical cores per physical core.

If you want to disable all MTRRs, the best way is to set bit 11 to zero by executing wrmsr -a 0x2ff 15 0x400, which forces the entire physical address space to be UC. You don't need to change anything in /proc/mtrr and it's better to just keep it as is. The -a option is important here because you usually want memory types to not be dependent on which core the code happens to be running on.

There are still a couple of issues with this simple approach. Modern processors include additional MTRRs specific for memory ranges used in system management mode (SMM). These MTRRs can only be modified in SMM. When enabled, any accesses outside of SMM to the memory ranges configured in its MTRRs are ignored. On my system, the memory type specified for the SMM range is WB, so it's cacheable.

[Temporary notice: Thinking more about it, I'm not sure whether IA32_MTRR_DEF_TYPE[11] controls also the SMM range registers. I'll have to check with Intel. If it doesn't, then on processors that support SMM MTRRs, which include yours, the only way to disable caching entirely is by setting CR0.CD to 1. If it does, then no problem.]

Another issue is that I don't think wrmsr -a 0x2ff 15 0x400 adheres to the recommended procedure for changing MTRRs consistently on all cores of the system, so I would only try it for experimental purposes. On production systems, you may have to write a kernel module to do it properly as described in the manuals.

I don't think it's required to writeback and invalidate the caches before disabling the MTRRs because UC accesses are looked up in the caches as well. But I'm unable to find a statement from the Intel or AMD manuals to confirm this at this time.

Thank you Hadi, meanwhile by running it constantly 24/7 i get less crashes lately also might be caused by that i have upgraded from 2x8GB to 1x8GB+1x16GB SoDIMM, or windows updates on the virtual wiondows 10 machine :) -i do not really understand why rdmsr does not work on the crashing proxmox 'server': `x:~# echo "beg-"$(rdmsr -a 0x2ff)"-end" | cat -teve beg--end$ x:~# x:~# cat /etc/debian_version 10.8` thanks for further advices — ant0nwax, Apr 04 '21 at 07:46
@ant0nwax It looks like it's not printing anything, right? I'm not sure why. Modern versions of KVM do fully virtualize the MTRRs, so all of these registers should be accessible from within a VM. Another way to obtain the default memory type is by checking the kernel message buffer with `dmesg`. Note that uncorrectable cache errors may not be the cause of crashes, but I think you just want to make the system as stable as possible, and if disabling the caches helps, then it could be a temporary fix, although there may be a huge performance impact. — Hadi Brais, Apr 04 '21 at 12:36
yes correct, its empty, i issung this command from the debian host (not from within a VM) from the centos linux VM i can do it: `[root]# echo "beg-"$(rdmsr -a 0x2ff)"-end" | cat -teve beg-c06-end$ ` Shall i issue `[root]# wrmsr -a 0x2ff 15 0x400` on the centos vm? — ant0nwax, Apr 06 '21 at 06:31
@ant0nwax No, each VM probably has its own virtualized MTRRs, which may have different values from the host. It has to be done from within the VM. — Hadi Brais, Apr 06 '21 at 11:22
Your answer is not clear, i ask the question more clear again: Shall i issue `[root]# wrmsr -a 0x2ff 15 0x400` on the centos vm, on the windows vm (maybe in linux subsystem)? or on the debian host of these two vms? — ant0nwax, Apr 08 '21 at 05:15
@ant0nwax I already told you the command has to be executed inside a VM, not the host. — Hadi Brais, Apr 08 '21 at 11:12
ok, one choice of three is excluded, which VM, windows (with linux subsystem) or centos? — ant0nwax, Apr 09 '21 at 13:47
@ant0nwax `wrmsr` and `rdmsr` require a kernel module called `msr` and last time I checked, WSL doesn't support Linux kernel modules, so I don't think these commands work on WSL. Try on Centos or any OS with an actual Linux kernel. — Hadi Brais, Apr 09 '21 at 13:58

mcelog Cache Error, how to Disable L3 Cache on Intel i7 CPU

2 Answers2