5

I have a 32GB non-ECC RAM dedicated server with CentOS.

Once for day it randomly crashes without any error in /var/log/kern.log, /var/log/messages, mysql, apache.

CPU/RAM/IO are not particularly high nor low.

Is there any such error logged by CentOS somewhere that can conclusively reveal "it is now time to pay for ECC" ?

wlf
  • 371
  • 2
  • 13
  • What type of hardware? – ewwhite Sep 14 '13 at 10:08
  • Intel® Core™ i7-4770 Quadcore Haswell incl. Hyper-Threading Technology, 32 GB DDR3 RAM, 2 x 2 TB SATA 6 Gb/s 7200 rpm HDD (Software-RAID 1) Class Enterprise. This is what I know. – wlf Sep 14 '13 at 10:13
  • This is far from a full answer, but if you're running Redis, it has a built-in memory test (http://antirez.com/news/43). This is the only instance I know of server software that does that, though. – liori Sep 14 '13 at 14:07

3 Answers3

10

What would you like it to log? CentOS can't know that the contents of non-ECC memory has become corrupt, because it's not knowable; it can only know that the contents of memory make no sense, and panic on the grounds of whatever self-inconsistency it found. That inconsistency might have arisen from RAM corruption, but it might also have arisen from a kernel bug, or some other cause.

The only way to know definitively that memory has become corrupt is to use memory that explicitly includes support for checking for such corruption; to wit, ECC memory.

Edit: that is a completely different question to the one you asked. But my strategy would be: run memtest86+ on the hardware, to see if there are any easy-to-catch repeatable errors, and enable remote syslogging on the server (as when the kernel panics, it often stops writing to the FS but can still squeeze a log message out the NIC), to see what's logged on the next panic.

MadHatter
  • 79,770
  • 20
  • 184
  • 232
  • Thanks and please bear with me, I am new to such matters. So the only way for me to find out whether my server is crashing once for day because of memory corruption or not is to get ECC. No hints anywhere beforehand? – wlf Sep 14 '13 at 10:02
  • 1
    +1 I am going to use memtest for the first time. I am glad I asked this question, I got two really helpful answers. – wlf Sep 14 '13 at 10:46
  • 1
    Once you are satisfied, don't forget to accept one by clicking the tick outline. My apologies if you are familiar with this procedure. – MadHatter Sep 14 '13 at 10:51
6

ECC memory has two advantages:

  • It is registered, meaning that there is a register before other components on the chip. This is supposed to remove electrical load from the memory controller. This is true of all RDIMMs, not just ECC RAM.
  • It can detect errors, and if not recover from them at least report that they happened

Given this, it is actually very difficult to determine whether you would have benefited from ECC ram without having ECC ram. By definition you cannot log the failure to detect an error, and you certainly don't have data on whether the error which may or may not have happened was the result of the memory controller messing up.

That said, if you run memtest, you will determine a couple things. If you find no errors, either you need ECC RAM, or the problem is with something else (so if you rule absolutely every piece of hardware and software out as the cause, you have shown the need for ECC RAM). If you find consistent errors, chances are the RAM is bad and just needs to be replaced. If you find inconsistent errors, the CPU might be bad, or you might need ECC RAM. If you find that memtest86 crashes, either the lowest-order DIMM is bad, or the CPU is bad, or you need ECC RAM.

Regardless, this is very tricky to definitely show. ECC RAM is most useful in applications where invisible errors in calculations are likely to cause extreme problems, or in applications where the sheer quantity of RAM combined with other conditions makes errors statistically likely. However, these criteria themselves are fuzzy and subjective, so it follows that there isn't really an objective criterion for this.

Falcon Momot
  • 25,244
  • 15
  • 63
  • 92
  • 1
    +1 this is going to be tremendously helpful, never used memtest before now. Thanks Falcon – wlf Sep 14 '13 at 10:44
  • Technically, there is such a thing as unregistered ECC RAM although it is comparatively rare. – Chris Smith Sep 16 '13 at 14:34
  • I don't think I've seen unregistered ECC ram in any new hardware at all, but yes, if you had some it might not help as much in this case. – Falcon Momot Sep 16 '13 at 18:05
0

If anywhere, it would probably log to

 /var/log/mcelog 

(this is where critical CPU events go on RHEL breds)

Florian Heigl
  • 1,479
  • 12
  • 20