I have a 1U supermicro box that's a few years old and off-warranty. Recently it has begun randomly shutting down. It will stay up for anywhere from a few hours to a week and then stop responding. The IPMI console shows it as powered on but it's completely non-responsive.
I'd v much like to fix this machine as the owners are very budget constrained. It has CentOS 7 presently.
What I've looked for:
- IPMI logs - empty
- System logs - nothing relevant
- SAR - nothing interesting
- Hardware sensors - fans are on, CPU temp is nominal
What I've tried:
- supermicro diagnostics - the (UEFI) image won't boot properly on this system
- memtest+ - ran for 24 hours with no incident
Given that it has redundant power supplies Im thinking this isn't the issue. This leaves CPU and mainboard.
- What other tests can I run?
- What other log sources could I look into?
- What else might be failing?
Edit:
Started up said machine and let it run until it quit (12 hours?). The IPMI window shows that it's stuck on the boot page of all things.
It had been booted and running. This makes me think it's a main board issue. There aren't any USB devices plugged in and it's well and truly wedged.