0

I try to fix our old failing blade replacing nodes with old Supermicro 1U servers that are cheap now. I do it for my own money so price matters (company doesn't want to buy new equipement). Servers contain Supetmicro PSU with good caps from good brands, not Ablecom (I checked values and ESR of 3 of them and they are good), Supermicro H8DGU-F boards (SR5670 + SP5100 chipset), 2x Opteron 6238 12 core CPUs, 2 Intel Kawela LANs. Infiniband ConnectX2 or ConnectX3 card is inserted in the only PCIE slot of this board (infinibands are different, problem persists). We use CentOS-7 from 2019 autumn as an operating system but the shop where I bought servers say that the problem is also appeae in Windows. They say they selected best items they have and made some tests but today I faced this nasty problem again on this hardware... The problem is that an operating system hangs spontanously mainly when CentOS starting (while initializing hardware, before Welcome... text) or when the system is under load (scientific calculation, all cores). Machine becomes inaccessible vis ssh, screen is blank, no reaction on keyboard or mouse. If it hangs while loading OS and you didn't press Esc to show messages, the bottom bar continues to move sometime, than stops. If you press Esc you may see that it hang while cheching HDD or initializing Infiniband... BIOS is updated to newest 3.5c, CMOS cleared, optimal defaults loaded. I monitored temperarures with IPMI and just with my finger, nothing bad. Voltages in IPMI and BIOS are good. Ripple at high load at 12 V rail is maximom 200 mV, I don't think it can cause hangs, and there ere different supplies. I bought 4 servers and 6 H8DGU-F boards. Two of boards with 1.01 revision hang identically (after 2 days of load, after hours of load, or when the calculation started, or during boot), one of 1.01 revision from the same stock worked for 7 days under maximum load and was successifully rebooted about 10 times, one of 2.00 revision has all memory slots of CPU2 dead (it's not relevant, replacement was sent), one of 2.00 revision worked successifully for 9 days and started sucsessifully approximately 10 times. What can be the reason? I can't believe serverboards are that bad. It's really frustrating. They are expensive when new, shouldn't they be reliable and durable according to their price? Can someone, please, suggest what can be the reason?

(Sorry, it's about IPMI version of the board, so I corrected topic)

Andrew
  • 1
  • 1
  • I'm not sure But I had similar issues with a DL380 I solved it by respecting exact pairs of ram & cpu then setup bios / then setup of the SAS disk controllers & only after all of that centos did boot properly and I was able to run the installation to the end and use the server for my own trainning – francois P Feb 14 '20 at 17:47
  • I had similar but not exact problem with H8DGi-F that has onboard Adaptec and external PCI LSI RAID controllers. CentOS setup crashed when initializing drive partitions (at the very beginning of the main installation process, before formatting). To avoid it, just use Rescan drives (or so) button in setup GUI before modifying partitions, and it's better to clean the beginning of the system drive with dd before installation if you don't need existing partitions (data). But it's different issue. It's crash but not hang and it's reproducable while hangs are sporadic and seem not to depend on OS. – Andrew Feb 15 '20 at 11:35

1 Answers1

0

Looks strange that there are no answers, maybe just too old board. SO answering (partially) my question. I had a talk with one of companies I bought rhese board in. They say that this model (rev. 1.01, 2.00) is problematic, other shop confirms for rev. 1.01. From 4 boards with 2.00 revision one doesn't see memory in CPU2 slots, one reboots and has network problems, two are being tested under full load now. From 1.01 revision two boards hang (while booting, immediately after high load starts, after hours or ~2 days of high load), one board worked for 2.5 weeks under high load, one board worked for ~2 days under high load (usually, 15...20 CentOS-7 boot tests are done before puttinig the server in operation because it sometime helps to identify the hanging earlier that just putting under load). So avoid H8DGU-F, they seem to be very undurable/unreliable, although cheap. In my situation I can't see other option because of high prices of other Opteron 6000 boards so I will check if there are 3...4 good boards to use, may be with one H8DGi or H8DG6 (they are dual-chipset versions and has onboard RAID controller so they are 2...5 times more expensive).

Andrew
  • 1