3

A client just experienced a complete failure of an APC AP7911A switched/metered rack power distribution unit (PDU). This obviously took all of the connected equipment down with it. The equipment is fine, as well as the upstream UPS units.

In situations where it's not possible to balance devices across multiple power feeds/PDUs/UPS units (e.g. switches with single power supplies, lack of high-line power feeds, etc.), how do you mitigate failures like this? This was a single rack installation in a less-than-ideal computer room, but typical for most small/medium businesses. Should one plan for individual PDU failure, or is it just something that gets dealt with when it happens?

ewwhite
  • 197,159
  • 92
  • 443
  • 809

5 Answers5

3

Multiple PSUs in servers are ok but not a magic bullet. Often when things to do with power go they take out other things around them eg. the backplane that your redundant psus both connect to. Far more likely to keep running if you have two servers on seperate UPSs.

Best of all is to work in redundancy at your application or platform layer so that machines or racks can go out without it causing a problem but when you haven't got the budget for that you can still reduce the risk by having spares of any non redundant equipment ready to swap out, but also by keeping things simple. A fancy managed PDU is far more likely to go down than a dumb power bar.

Also it is worth bearing in mind that many small businesses simply can't do things the proper way or choose to do things the cheapest way and live with the consequences if they happen. I've seen inexperienced admins go out of their way to avoid doing things a certain way that have been slated around here or similar sites only to put something worse in place. A less than ideal solution is often better than nothing.

JamesRyan
  • 8,166
  • 2
  • 25
  • 36
  • That's why you plug the redundant power supplies into separate UPSes that are each plugged into different PDUs that are each plugged into different mains circuits. – psusi Dec 15 '11 at 15:56
  • 1
    I've had 1 UPS failure and a number of PSU failures, but in every instance it took the redundant PSU or whole machine with it. So in my experience it is an extra cost/complication with very little benefit. Personally, I build in redundancy at other levels so that it isn't needed, wherever possible. – JamesRyan Dec 15 '11 at 16:01
2

I've been in exactly the same situation, where I've done my best to have redundancy across a cluster of servers, but the situation has been let down by the failure of one power source which in turn has caused a device that has only one PSU to fail. Sometimes the single PSU device has been critical, like a backup DC, a switch or a rack cabinet fan array.

The best answer I've come up with is using an PDU with an **Automatic Transfer Switch ** (ATS). This allows you to link the PDU to two power sources and it will switch over between the two with no downtime if one fails. This is ideal for your single PSU devices, obviously because they stay on. The ATS switch typically has about 8 outlets so that it effectively takes the place of a PDU.

For typical SME scenarios where you don't have two power circuits in the datacentre, but you may have a rack wired to either one UPS and the mains, or from mains through two UPSs, this provides good protection, otherwise you are always going to gamble on which PDU source is going to fail first. I also think that these ATS switches are more resilient than standard PDUs so this further mitigates disaster.

Mark Lawrence
  • 833
  • 5
  • 7
1

As for legacy kit with a single PSU, as far as I know it's as you say, it's just something that gets dealt with when it happens, but definitely plan for it to happen.

I'd make a note of the kit which is set up like this if possible, and plan for the failure, and expect it at one point.

I'd suggest making sure backups are well planned and running well, and disaster recovery plans are well thought through and tested regularly.

When it comes round to buying new kit, I'd be buying buy those servers with dual PSU, and plugging each into a separate UPS (via PDU if necessary). Even cheap low-end Small-Medium business Dell servers can be bought with dual PSUs.

Kenny
  • 520
  • 1
  • 9
  • 24
  • In this case, the server actually had 4 power supplies... They shouldn't have all been connected to the same PDU... – ewwhite Dec 15 '11 at 09:48
  • In a company where you don't use redundant equipment chances for a well-thought-through disaster recovery plan are infinitesimal. – the-wabbit Dec 15 '11 at 09:53
  • Any reason why this (connecting all PSUs to the PDU) was done? Did the client want to be able to hard-power-off the server through the PDU? – the-wabbit Dec 15 '11 at 10:05
  • It was just bad planning. I wasn't aware of the setup. – ewwhite Dec 15 '11 at 10:20
1

I'm in a slightly unusual situation as we have multiple datacentres that are our own and we get to decide how things work, and we use blades, but in general we have half our PSUs go to one PDU and the other half go to another PDU for exactly this reason. Now typically both PDUs are on the same very large PDU/UPS each of which serves multiple half row of 40 racks. So we split our clusters along rows, i.e. cluster member 1 in one of the first 20 racks of the first row, number 2 in the second 20 racks of the first row, number 3 in the first 20 racks of the second row etc. This way we're covered if we lose a PSU, PDU, large-PDU/UPS or whole row (through flooding, fire etc.). But as I say this is I guess a little unusual but hopefully some insight into how we do it, I'd always suggest different PDUs but make sure if you use multiple central/large PDUs and UPSs that you're not getting phases too far out for safety reasons (search SF for previous cross-phase arguments :) )

Chopper3
  • 101,299
  • 9
  • 108
  • 239
  • `search SF for previous cross-phase arguments` - my search did not show up anything relevant. Any pointers? – the-wabbit Dec 15 '11 at 10:07
  • Note that multiple PDU's are not an option, as stated in the question. – Roman Dec 15 '11 at 10:11
  • @Roman, I did spot the part about not being able to use multiple PDUs but if you'd read my post properly you'd have seen that I addressed that by discussing how you can split similar load-balanced equipment across racks as that's really the only option there is. – Chopper3 Dec 15 '11 at 10:21
-2

If you can't install a second PDU in the rack, you have no other options than setting up your server in such a way that sudden power losses do only minimal damage.

  1. First of all, I'd make sure to use battery-backed RAID controllers, so that the on-disk data will be consistent, or at least can be brought to a consistent state when power is restored.
  2. Second, use journaling filesystems. This helps keeping the filesystem consistent.
  3. Third, try to have all running services set up in such a way so that there's something akin to transactions: All data structures can be brought back to a consistent state, and accept a minimal data loss if necessary (Rollback). This varies greatly from service to service (Databases, Frequency of modifications, Logs...) and may or may not require quite a lot of handiwork on your side. If it's possible at all...
  4. Fourth, adjust your backup strategy accordingly and try to have more and smaller backups (instead of few and big ones).

But I need to be honest here, the first three won't give you 100% protection. Be prepared to restore from backup any time.

Roman
  • 3,907
  • 3
  • 21
  • 34
  • Well you do have the option to cluster functionality across racks to mitigate against such problems of course, as I mentioned, and you failed to read when criticising me, in my post below. – Chopper3 Dec 15 '11 at 11:03