diagnostics for a server that keeps turning off

Question

I have a 1U supermicro box that's a few years old and off-warranty. Recently it has begun randomly shutting down. It will stay up for anywhere from a few hours to a week and then stop responding. The IPMI console shows it as powered on but it's completely non-responsive.

I'd v much like to fix this machine as the owners are very budget constrained. It has CentOS 7 presently.

What I've looked for:

IPMI logs - empty
System logs - nothing relevant
SAR - nothing interesting
Hardware sensors - fans are on, CPU temp is nominal

What I've tried:

supermicro diagnostics - the (UEFI) image won't boot properly on this system
memtest+ - ran for 24 hours with no incident

Given that it has redundant power supplies Im thinking this isn't the issue. This leaves CPU and mainboard.

What other tests can I run?
What other log sources could I look into?
What else might be failing?

Edit:

Started up said machine and let it run until it quit (12 hours?). The IPMI window shows that it's stuck on the boot page of all things.

It had been booted and running. This makes me think it's a main board issue. There aren't any USB devices plugged in and it's well and truly wedged.

When the system freezes, do you have to hard power it off to recover, or is resetting it sufficient? Is the OS up to date? — Michael Hampton, Sep 26 '18 at 17:06
Are the two PSUs getting power from the same source? Perhaps you're having brownouts... — Michael Hampton, Sep 27 '18 at 19:16
I suggest disabling console blanking (e.g. `setterm --blank 0` run on vt 1, or better yet, `consoleblank=0` on the kernel command line) just in case a panic does get logged to console. You'll then be able to see it in the IPMI virtual console the next time it happens. — Michael Hampton, Sep 28 '18 at 16:57
Is there some way to log kernel panic / related to the system log? Something that would persist after reboot? — ethrbunny, Sep 30 '18 at 18:22
You can't reliably log a real kernel panic, because the kernel believes itself to be in such a state that it can't write to any disk or network reliably. In this case it will only write to the console (and a serial console if one is set up). — Michael Hampton, Sep 30 '18 at 20:07
I'll have to figure out how to get a monitor connected. The datacenter trolls will get fussy if I tie up a crash cart for an extended period. — ethrbunny, Oct 01 '18 at 21:57
That's why God invented IPMI. You have no need to go to the physical console! You can view it remotely from the comfort of your workstation. — Michael Hampton, Oct 02 '18 at 03:39
Your motherboard is well and truly hosed. Time to buy a new one. — Michael Hampton, Oct 03 '18 at 15:30

score 2 · Answer 1 · answered Sep 27 '18 at 19:09

2

I wouldn't completely rule out the PSU. If they're redundant, you could try running with only one, then the other.

Can you get replacement CPU(s)? Used Xeons are pretty cheap, and you can still sell them afterwards. If it's a multi CPU system, try removing all but one.

Does the system have a separate, replaceable VRM for the CPU?

It could well be the mainboard, but that probably means the machine is dead.

answered Sep 27 '18 at 19:09

Andreas Heinlein

386
3
6

Are there any OS diagnostics for a mainboard? CPU? – ethrbunny Sep 28 '18 at 09:24

score 2 · Answer 2 · answered Oct 03 '18 at 13:14

Use the process of elimination. Take out one component at a time:

Test if it crashes without each memory chip. If it doesn't crash then it's the memory chip that you took out
If it's not the ram, replace the hard drive with a temporary spare or boot off a live USB when you need to check the hard driver. If it doesn't power off then it's the hard disk
If the CPUs are removable, then you can try running without each one
Elimimate the power suppplies in the same way
If the NIC cards are removable eliminate that
If it's still an issue after all these tests then it's probably a fried motherboard.

score -1 · Answer 3 · answered Sep 27 '18 at 23:36

-1

Check dmesg for kernel panics etc also syslog might show you some hints assuming it’s related to the OS

answered Sep 27 '18 at 23:36

Timothy Frew

582
3
7

You cannot run `dmesg` when the power is off. – kasperd Sep 28 '18 at 06:30
Well I think that is plainly obvious :) the original query stated the server isn’t permanently down. It begins randomly shutting down and the OP does have the opportunity to inspect logs etc – Timothy Frew Sep 28 '18 at 10:51
If there is a kernel panic you won't be able to see it with `dmesg`. Besides a kernel panic doesn't cause the system to power off. – kasperd Sep 28 '18 at 12:16
It can cause complete unresponsive system including on serial - also you can see a kernel panic in the logs I detailed. You are incorrect – Timothy Frew Sep 28 '18 at 12:17
1

A true kernel panic will not be logged anywhere except to console, as the kernel will be unable to send it anywhere else before it halts the system. But many kernel oopses do not rise to this level, and can be logged locally or across the network. – Michael Hampton Sep 28 '18 at 16:55
Precisely which is why it’s still worth checking the logs - thanks for the input @MichaelHampton – Timothy Frew Sep 28 '18 at 16:58

diagnostics for a server that keeps turning off

Edit:

3 Answers3