2

How do large server farms handle gracefully shutting down all or part of the farm? I'm thinking of planed and unplanned cases like:

  • "We need to shutdown Rack 42"
  • "We need to do work on the power feeds to the whole block"
  • "Blackout! UPS's running out of Juice! Aahh!"
  • "AC is down, air temp is 125F and climbing"

The issues I'm interested in are how people handle sequencing, and kicking the whole thing off. Also it occurs to me that this could easily get mixed with bringing up and down services and with the software up grade system.

(At this point I'm more asking out of curiosity than anything.)

BCS
  • 75,627
  • 68
  • 187
  • 294
  • Yes I know this is not directly programming, but I cant imagine that many such systems manage this without some sort of program being in the loop – BCS Nov 12 '08 at 22:18

4 Answers4

1

Computers can use a lot more power coming back online than they do running, since they have to get all of the platters and fans spinning, typically have heavy CPU activity starting all of the applications, and so on. Most shops will have a set sequence that staggers the startups, so they don't max out the circuit and have to start all over again. This is also important if you have a bunch of applications that expect to talk to a database, or a bunch of web servers that need to talk to the app servers. You usually start from the bottom up, and stagger the startups by 30 seconds to a minute, depending on how many boxes are on your circuit.

Tim Howland
  • 7,919
  • 4
  • 28
  • 46
  • I have a box with 5 HDD's that pull in 30W per drive on startup. I'm glad it staggers them or it would toast my UPS! – BCS Nov 12 '08 at 23:15
  • Any idea what kind of systems are used to effect the staggered start up and to pick the order? – BCS Nov 12 '08 at 23:17
  • In the implementations that I've done, it's usually been a human operator- power outages are rare enough that when they occur, someone is there to deal with the emergency. If they are happening more than once a year, it's time for a new datacenter. – Tim Howland Nov 13 '08 at 02:18
0

One method is to mirror the live machines on temporary hot-swaps and, assuming access is via network, cut over by reconfiguring the router to divert traffic to the mirrors. This process can be automated for unplanned outages.

For planned maintenance, some simply notify their users that the system will be unavailable during a certain window.

Redundant power supplies and gas-powered generators handle most power-related problems, again with automated failover.

Adam Liss
  • 47,594
  • 12
  • 108
  • 150
  • Good ans interesting info, but not really what I'm interested in. e.g. how things get shut down, not how to avoid shutting service off. – BCS Nov 12 '08 at 22:54
0

Ah, now I understand your question more clearly.

Products such as the iBootBar from dataprobe allow you to monitor and manage the power to remote devices. An intelligent system can monitor the current draw of each device to verify that it's functioning within nominal limits. If not, it can take the equipment offline and bring a spare online to replace it, watching for the initial surge and waiting for power to stabilize before switching the next device on.

Adam Liss
  • 47,594
  • 12
  • 108
  • 150
0

Keep in mind that "large server farms" are designed to not ever shutdown unless they're obliged to do so. That means it is a possible but far far remote event, and when it happens you're really in a hurry. Every other use case, such as shutting down a rack or doing work on power lines, will be planned in advance as much as possible.

You will actually be in an hurry when things go really wrong.

For example the generators run out of fuel (usually they'll keep at least one full day of reserve and have contracts to get re supplied in time, so we're talking about big disaster here) or similar events, you'll know it will be happening with hours of time to shutdown things. Or the HVAC system completely fails, then you have mere minutes to shutdown everything before temperatures raise too much.

I'm not an expert here, being on the other side of the barricade (customer of data centers), but I think they'll have systems in place to command shutdown of all the systems they control, and they will simply cut power to customer's systems they can't control and correctly shutdown.

The farm will be eventually powered up again one zone at a time, one rack at a time, when all systems are back online and ready to go full capacity (UPSes, generators, HVAC, etc). When they have full control of the systems (i.e., not customer ones but private farms) they will usually bring AC gradually to all circuits, and servers will either power up automatically (if configured to do so, and many servers can even have a setting like "power up after a random time of max X minutes") or they will be commanded to power up via lights-out management like IPMI or similar systems.

Luke404
  • 10,282
  • 3
  • 25
  • 31