I try to fix our old failing blade replacing nodes with old Supermicro 1U servers that are cheap now. I do it for my own money so price matters (company doesn't want to buy new equipement). Servers contain Supetmicro PSU with good caps from good brands, not Ablecom (I checked values and ESR of 3 of them and they are good), Supermicro H8DGU-F boards (SR5670 + SP5100 chipset), 2x Opteron 6238 12 core CPUs, 2 Intel Kawela LANs. Infiniband ConnectX2 or ConnectX3 card is inserted in the only PCIE slot of this board (infinibands are different, problem persists). We use CentOS-7 from 2019 autumn as an operating system but the shop where I bought servers say that the problem is also appeae in Windows. They say they selected best items they have and made some tests but today I faced this nasty problem again on this hardware... The problem is that an operating system hangs spontanously mainly when CentOS starting (while initializing hardware, before Welcome... text) or when the system is under load (scientific calculation, all cores). Machine becomes inaccessible vis ssh, screen is blank, no reaction on keyboard or mouse. If it hangs while loading OS and you didn't press Esc to show messages, the bottom bar continues to move sometime, than stops. If you press Esc you may see that it hang while cheching HDD or initializing Infiniband... BIOS is updated to newest 3.5c, CMOS cleared, optimal defaults loaded. I monitored temperarures with IPMI and just with my finger, nothing bad. Voltages in IPMI and BIOS are good. Ripple at high load at 12 V rail is maximom 200 mV, I don't think it can cause hangs, and there ere different supplies. I bought 4 servers and 6 H8DGU-F boards. Two of boards with 1.01 revision hang identically (after 2 days of load, after hours of load, or when the calculation started, or during boot), one of 1.01 revision from the same stock worked for 7 days under maximum load and was successifully rebooted about 10 times, one of 2.00 revision has all memory slots of CPU2 dead (it's not relevant, replacement was sent), one of 2.00 revision worked successifully for 9 days and started sucsessifully approximately 10 times. What can be the reason? I can't believe serverboards are that bad. It's really frustrating. They are expensive when new, shouldn't they be reliable and durable according to their price? Can someone, please, suggest what can be the reason?
(Sorry, it's about IPMI version of the board, so I corrected topic)