We have +500 servers built with Supermicro motherboards and Kingston memory and we usually see the following alerts:
# fmdump -v
TIME UUID SUNW-MSG-ID
Oct 27 15:49:44.9379 108510ec-b4e1-c94b-dd9f-f7b2969a4725 INTEL-8001-94
100% fault.memory.intel.dimm_ce
Problem in: hc://:product-id=X7DB8:chassis-id=0123456789:server-id=hostname:serial=180104092839051c6a:part=KINGSTON:revision=C1/motherboard=0/memory-controller=1/dram-channel=3/dimm=0/rank=1
Affects: hc://:product-id=X7DB8:chassis-id=0123456789:server-id=hostname:serial=180104092839051c6a:part=KINGSTON:revision=C1/motherboard=0/memory-controller=1/dram-channel=3/dimm=0/rank=1
FRU: hc://:product-id=X7DB8:chassis-id=0123456789:server-id=hostname:serial=180104092839051c6a:part=KINGSTON:revision=C1/motherboard=0/memory-controller=1/dram-channel=3/dimm=0
Location: DIMM4A
My question is: how trustworthy are these faults when you are running on non-Oracle hardware?
We tried almost everything (short of never using these components again) but the faults randomly come back (eg. replace dimm4a and after a few months dimm1b has a fault, replace all memories and motherboard and another fault shows up after a few days).
The memory we replace is tested for days with memtest and we can never find a problem. Other teams using the same hardware with Windows & Linux don't see it. Is Solaris being too sensitive?
Right now we are going over another round of memory replacements but it's becoming a pain. We couldn't find anything wrong with the servers either, they've been working just fine but the randomly appearing memory faults are scary. Should we ignore them?
OS: OpenSolaris 2009.6 (b111)