Solaris/OpenSolaris FMA and memory false positives

Question

We have +500 servers built with Supermicro motherboards and Kingston memory and we usually see the following alerts:

# fmdump  -v
TIME                 UUID                                 SUNW-MSG-ID
Oct 27 15:49:44.9379 108510ec-b4e1-c94b-dd9f-f7b2969a4725 INTEL-8001-94
  100%  fault.memory.intel.dimm_ce

        Problem in: hc://:product-id=X7DB8:chassis-id=0123456789:server-id=hostname:serial=180104092839051c6a:part=KINGSTON:revision=C1/motherboard=0/memory-controller=1/dram-channel=3/dimm=0/rank=1
           Affects: hc://:product-id=X7DB8:chassis-id=0123456789:server-id=hostname:serial=180104092839051c6a:part=KINGSTON:revision=C1/motherboard=0/memory-controller=1/dram-channel=3/dimm=0/rank=1
               FRU: hc://:product-id=X7DB8:chassis-id=0123456789:server-id=hostname:serial=180104092839051c6a:part=KINGSTON:revision=C1/motherboard=0/memory-controller=1/dram-channel=3/dimm=0
          Location: DIMM4A

My question is: how trustworthy are these faults when you are running on non-Oracle hardware?

We tried almost everything (short of never using these components again) but the faults randomly come back (eg. replace dimm4a and after a few months dimm1b has a fault, replace all memories and motherboard and another fault shows up after a few days).

The memory we replace is tested for days with memtest and we can never find a problem. Other teams using the same hardware with Windows & Linux don't see it. Is Solaris being too sensitive?

Right now we are going over another round of memory replacements but it's becoming a pain. We couldn't find anything wrong with the servers either, they've been working just fine but the randomly appearing memory faults are scary. Should we ignore them?

OS: OpenSolaris 2009.6 (b111)

score 1 · Answer 1 · answered Feb 17 '11 at 19:49

I can only guess but from what I've read up is that the fault you are experiencing is due to the fact the number of correctable ECC errors in a given time have been exceeded. This is sure a problem and should be addressed.

If however, you other team runs windows on these boxes and don't experience any issues this might be due the fact, that windows just corrects the correctable ECC error and keeps silent where OpenSolaris or FMA fire a warning.

It should definitively not being ignored. If I were you I'd take the time to further investigate the windows machine and if there is a possibility to check for those corrected, correctable ECC errors.

Solaris/OpenSolaris FMA and memory false positives

1 Answers1