I've got a ZynqU+ that I've built and am running embedded linux on. Everything boots fine, and initially runs fine. One problem though is that I see the ue_count
in /sys/devices/system/edac/mc/mc0/
is incremented to 13 (ce_count
is 0) every time I boot the board. There are no EDAC messages in dmesg or syslog mentioning encountering an uncorrectable error, and investigating the zynqs DDR module's registers (https://www.xilinx.com/htmldocs/registers/ug1087/ug1087-zynq-ultrascale-registers.html search for "DDRC Module"), the status registers containing CE & UE counts are 0 along with all related registers
Additionally, if I stress the system with a bunch of constant read/write operations just to a temp folder I will eventually (10-30mins) see EDAC errors printed out to the console. This will often be followed by a kernel panic, but if the system does not panic investigating the previous locations above I can see my ce_count
, ue_count
have incremented, syslog now has EDAC error messages in it, and the Zynq DDRC module's registers contain values where they were 0 before (Interestingly, not the CE & UE count register, that remains 0, perhaps EDAC clears it after reporting it?)
I have tested this build across half a dozen different boards and they are all showing the exact same behavior. I have trouble believing these ECC errors are real because of that, but I'm not really sure what other explanation there could be. Perhaps I miss-configured something in linux?
The 13 ue_count
on boot really mystifies me though, how can EDAC increment that without reporting any errors, how can it increment that while the zynq's module that it's registered too does not contain any signs of ECC activity?
Any advice on things to check, diagnostics to perform, experience with ECC errors, or anything really would be helpful, as I'm mostly at a loss on this problem.