0

im facing extremely weird issue regards one server, it random freeze/hang with no output on server, and not responding to short keys, and required cold boot, when boot with cold boot, no errors on boot screen at all.

It's not freezing under heavy load at all, with around 9-20% cpu wheb crash, load average around 2-5(12 core cpu) and 128gb ram

We tried check logs, nothing shows like kernal panics, or anything that relate to the issue itself.

In all the freezes after cold boot, when we check the log, we do see normal OOM reaper killing php procces (users reach limits) but nothing too abusive, but always on OOM, Sometimes when server freeze in the log you see the current time, and sometimes like the it shows after thr current time of the crash few lines from older date, and freezes.

Nothing in logs can determine software related, or under heavy load, just normal operation, this is an upgraded machine from old one, that were stable for years.. The freezes are random, could be after a week server up, or two days or three weeks and etc...

Also we tried to extract vmcore dump of server freeze but still nothing catches there.

It's just freeze with not screen output, but server still running but not pringable, cant access ssh nothing, also kvm as i said show no output at all at screen.

Could it be related to maybe faulty hardware? As my suspension is about faulty RAM?

I'm extremely lost with this issue.. Thanks

Danco
  • 21
  • 3

2 Answers2

0
  1. Make sure temperatures are good, CPU/RAM/CHIPSET/DISKS, I assume your are a linux user because of OOM, install lm-sensors, and check the temps with the sensors command.
  2. It's your RAM, run memtest86, be aware full test on 128GB can take a week.
Egidijus
  • 109
  • 1
  • 4
  • Yeah Linux based, you think its related to temperature? Or hardware? I was thinking get new server migrate data and then move it to the old one racks so rule out possiblty of hardware – Danco Oct 19 '21 at 00:06
  • If there are no clear signs in software, then it is very likely hardware. Temperature is hardware (software can't feel a warm touch). – Egidijus Oct 19 '21 at 05:32
  • I really doubt it relates to temperature as for server not under heavy load when it freezes, i dont think cpu can reach to 95 degrro with a cpu load of 9% or 20%, as for it reach those daily and yet nothing – Danco Oct 19 '21 at 08:47
0

We just migrated to another server, but after searching alot and trying debugging alot, looks like hardware issue regards the motherboard as i checked in some forums regards motherboards from asrock rack and ryzen cpus i manage to find few cases around same issue even wih windows 10 or windows server getting blue screen of death. as the OS support suggested in this case not to change the motherboard brand as could be risky to be refused to boot up, and to migrate to a new server as we did. after we migrated to new server, all issues resolved. so i guess it does relate to hardware issue and not software.

Danco
  • 21
  • 3