Server suddenly very sensitive to minor brown-outs

Question

We have a number of SuperMicro RAID 10 boxes with redundant PSU, the same model and spec and use the same rackmount APC UPS. One has suddenly taken to rebooting if there is a minor brown-out.

Windows logs only point to 'unexpected' - as in loss of power. This has happened before and replacing the UPS has always fixed it. So we swapped out the 4 year old UPS for a new one. As the server was booting the UPS decided to self-test, and the server rebooted again!

We couldn't fault the unit we took out, so unless two PSU's are playing up at the same time I can't think of anything else that could cause it. Swapping out the standard UPS for an online one would almost certainly cure it, but if something is failing...

============= FURTHER CLARIFICATION OF QUESTION =======================

The server has stayed up for nearly four years using the same UPS and configuration
Recently any drop in utility power (that causes the UPS to switch to battery) and the server reboots.
Swapping out the UPS does not appear to have fixed the issue as it went into self-test when the server was booting and it (the server) then rebooted.

I assume there is a controller for the PSU's? Something has become more sensitive to the milliseconds it takes to switch in the last couple of months.

With downtime being a major factor replacing the UPS again with an online one (like the APC SRT range) would cure the current problem - but is this a symptom that could develop into another serious issue?

"so unless two PSU's are playing up at the same time" - and both connected to the same UPS? You are aware how much standard practices are violated with this? What does the APC log say? — TomTom, Mar 26 '20 at 21:43
We don't use Powerchute but the log from the extended menu on the unit shows nothing — gchq, Mar 26 '20 at 22:20
The extended menu is only for things like battery runtime. And current ups load. It has no qualitative data on transient events or the logging of such. — Rowan Hawkins, Jan 08 '21 at 02:23

Rowan Hawkins · Answer 1 · 2020-03-30T22:23:42.147

2

edit: You are having one of three problems which you cant verify without more logging of the length of outage that the system is losing power on. You did mention seeing transient losses on the UPS You need to correlate those with the logs from the Supermicro. If you dont want to use the UPS logs, you can instead attach one cord to the wall, but the UPS log will provide much better detail on the state.

All of these would trigger the Bios event log to record a power loss event. (1) The actual problem outage is longer, (2)The UPS is not providing the power needed during the transient(regulation issue), or (3)there is a problem with the PSU's or the PDU in the system not maintaining Power Good(PG) state to the motherboard.

Supermicro chassis draw different amounts of power depending on how many psu are operating. A dual psu system is drawing 50% load from each psu when the ups goes low, the lagging supply attempts to go 100% and accelerates the ups draw-down.

You shouldn't have both supplies in the SAME ups, and you should be looking at your ups load via management.

YOU SHOULD also be aware that the ups loads will double on the non-browning ups. My guess is the UPS is not sized properly to the peak load expected. That is affecting the systems and the UPS.

The information from the management GUI on the UPS contains graphs about power quality as well as battery status and load information. This will tell you if you're brownouts are happening at particular times and let you maybe track down the reason why you're having them in the first place.

The supplies do appear to be operating correctly when the power is good. Unknown as yet is what happens when the issue happens. The removable internal PSU's only put out +12 volts and PowerGood(+5). Internal to the system is a Power distribution Unit (PDU) unit that has the cords onto the motherboard and splits the +12v into all of the other voltages that the motherboard needs. When the UPS has the transient, for whatever reason the power to the Supermicro is falling outside the ATX spec(below) and the system is shutting down. That is at 95% of rated values.

It is possible for that PDU to be bad, but the only way to test it is by swapping the supplies from a system not seeing the problem. Swapping the PDU is a real chore.

Per the ATX specification: The ATX specification requires that the power-good signal ("PWR_OK") go high no sooner than 100 ms after the power rails have stabilized, and remain high for 16 ms after loss of AC power, and fall (to less than 0.4 V) at least 1 ms before the power rails fall out of specification (to 95% of their nominal value). Wikipedia ATX Power good

edited Mar 30 '20 at 22:23

answered Mar 27 '20 at 21:48

Rowan Hawkins

620
4
18

The main reason for using the same UPS is (apart from lack of space in the rack) to cover the PSU's. It's worked flawlessly for years and powered the unit for some 20 mins during compete power outages - as with the other identical server/UPS configurations. Brown-out may have been the wrong term, it's a very short drop in utility power. The drop is so short that, for example, clocks on things like the toaster oven that will reset if the drop is longer than a second or two are not effected, but a fish tank pump has to be re-primed if it's a split-second drop. – gchq Mar 28 '20 at 01:12
UPS is running at 11% load. I would not have thought that to be excessive. It's giving itself a clean bill of health during a self-test (both the old unit and replacement), but the server is rebooting during the switch to battery - something it's not done in the past. – gchq Mar 28 '20 at 01:18
You are correct, 11% is not much loading on the ups. The term for those types of drops are transients. Without logging from the ups, you don't kbow how often they are occurring, you only know when there is a full cut. The UPS log would also let you have something to go back to your power supplier because transient should not happen that often. You can also monitor the psu's in the Supermicro over the management interface from another system with Ipmicfg. It doesn't matter if ipmicfg is running on Windows or Linux for that to work, you just need to pass it ipmi credentials preferably by file. – Rowan Hawkins Mar 28 '20 at 22:23
What may work better is SuperDoctor 5 which is free from supermicro and will let you log information better than dumping it to a file. I misremembered the commands for ipmicfg and they were actually for an older version SuperDoctorII which I ran from command line on linux. – Rowan Hawkins Mar 28 '20 at 22:40
Rowan - I did, recently, install SuperDoctor. I didn't install the web browser part though, so might have to go back and reinstall. It's currently running something with Java. There is a SuperMicro BMC utility that gives both PSU's a clean bill of health - along with everything else. So far there have been three drops and one self test this year. – gchq Mar 29 '20 at 01:13
I bounced this issue off of a another Tech that I know and he suggested if your other server has the same size psu's to swap them with the ones in this unit. The thing that you would be testing by doing that is if the supply is the problem or if the power distribution unit that links the 2 supplies is the issue. You would want to do it during a maint window, but if there are no issues with the removable supplies the other one will handle the load while you do the swap. You'll want to let it spin the fans down and stabilize before swapping out the second supplies. – Rowan Hawkins Mar 29 '20 at 02:43
You don't want to do a mixed sized power supply swap though so you need to make sure that they are the exact same part number you should be able to get that from BMC before you do the swap. – Rowan Hawkins Mar 29 '20 at 02:44
Rowan. The issue there is that the problem child is web service. If that reboots it's not good but provided everything restarts it's a ten minute loss of service. The others are DB servers and the DB's would have to be manually started along with a much higher risk of corruption. – gchq Mar 29 '20 at 11:53
During the outage window verify that the issue is happening on the problem system by unplugging the UPS. Then reconnect the UPS power, swap in the supplies from the powered off DB server and unplug the UPS again that will confirm that the other supplies show the same issue or not. You have the supply SN's in your screenshot. – Rowan Hawkins Mar 30 '20 at 22:27
Yes the Power Distributor looks a major pain in the arse to replace (with downtime snapping at your ankles). According to SuperMicro if both the PSU's are showing a green LED (they are) then both the PSU's and Power Distributor should be OK. – gchq Mar 30 '20 at 23:46
I have also added a screenshot of the IPMI event log. The highlighted items are the same dates/times that Windows reported unexpected shutdown, and also show in a NAS as being the times when that UPS went to battery. The Drive Slot entries started after the UPS was replaced. I have no idea what 'Dive Presence' means. The RAID controller is not showing any issues – gchq Mar 30 '20 at 23:55

Server suddenly very sensitive to minor brown-outs

1 Answers1