control startup order of several servers

Question

I'm looking for recommendations on how to control the startup order of a rack of machine should they all need to be booted back up. In more detail:

Group1:
  DHCP/DNS/LDAP Server
Group2:
  NFS servers
Group3:
  Web Servers
  Compute Nodes

I'm currently configuring apcupsd to handle emergency shutdown, particularly for the data servers, but I'm not sure if I should script some complicated startup using WOL, or if there is a early network event system for linux to pause the boot up process until the correct events are received.

If it matters, all the servers are Dell and they have iDRAC, but I haven't been able to get it to work (haven't tried that hard), if that offers any alternatives.

I don't think even Amazon has solved this problem. – Michael Hampton Jul 21 '12 at 04:55 — Michael Hampton, Jul 21 '12 at 04:55

ewwhite · Answer 1 · 2012-07-21T06:12:40.670

I'd question the specific scenario you're planning around...

Are you trying to plan for power-up following a power outage? Recovery from some sort of disaster? Or is your concern more power-spike related? For the latter, I've sometimes used random startup delays configured in the servers' BIOS or used a switched PDU to handle the power-on sequence to prevent overloading.

On the sequencing side, I'd engineer around the dependencies at the application level. From a cold-start, your application servers should be able to tolerate the failure, delay or missing presence of support severs (DHCP/DNS/LDAP). Do you have backup servers running those services? Anything outside of the location?

If not, you could set application or daemon startup checks - e.g. don't start the NFS daemon if the DNS servers can't be reached. I've done a teeny bit of this dependency checking using Monit or a wrapper script... but really haven't given thought to this type of ordering in most environments.

Thanks for the suggestion on Monit. It looks useful for other stuff I've wanted to do, like monitoring wether NFS shares go stale to try and remount them. — Cyclone, Jul 22 '12 at 01:53

score 2 · Answer 2 · answered Jul 21 '12 at 05:16

2

What are you trying to accomplish? Is it minimizing (smoothing out) the spikes in demand created by a bunch of servers simultaneously booting or something that can push emergency adaptation of poweroff? If it's the former then look for power sequencers - there are in-rack units designed to power up groups of outlets with programmable delays - say a few minutes between each of your groups to allow each to settle before booting. This used to be pretty common with some vendors in the larger side of midrange.

answered Jul 21 '12 at 05:16

rnxrx

8,143
3
22
31

2

I think he's saying that one group of servers will not start their services properly unless the previous group is already up and running. E.g. Web servers depend on NFS servers, which depend on LDAP servers... – Michael Hampton Jul 21 '12 at 05:20
Some kind of managed PDU might make sense. This would mean the ability to have any machine absolutely powered off, regardless of platform/vendor. At that point anything that could generate an snmp set would be able to bring it up, even allowing the services to cascade (i.e. group 1 trips group 2, group 2 trips group 3). Thse are pretty commonly available from APC, etc. – rnxrx Jul 21 '12 at 05:26

score 1 · Accepted Answer · answered Jul 22 '12 at 01:50

You have several options. It may be a good idea to combine two or more of these approaches.

If each group is on dedicated UPSs then you can control UPS start-up sequence to some degree. High power UPSs often can be configured to delay their start-up. Stagger the start-up delays to meet your needs. You should be staggering start-up anyway to prevent start=up load from triggering fuses or breakers on your power feed.
As other have noted there are PDUs with delay capabilities. These would be configured as for UPSs. They may also be network controllable so the next group can be turned on when the required services are available.
WoL is one approach you could use as other have noted.
If you are using a single threaded init process, you could add an init script that waits for the required service to be available before proceeding. Alternatively, you could add the checks to the appropriate init scripts. Adding the guard checks for necessary services may be a good idea anyway.
You could plumb your Internet facing IP addresses. but not enable them until all the required service are in place. This would require a guard script that verifies the required services are available.
NFS mounts can be configured to block until the mounts are available. This should delay further init processing until the NFS servers are serving the required mounts.

score 0 · Answer 4 · answered Jul 21 '12 at 21:26

In my opinion you've answered your own question. WoL is a great way of ensuring servers are booted in a set sequence. Nothing complicated about it. Just have each server send the WoL signal to the next one in the sequence. Just be sure to document that sequence for when something breaks or needs to be modified. Ideally the WoL script will first check that the relevant services are running before sending a signal. You might also have a timeout on those checks that can send you an alert if something is amiss, which can save a bit of backtracking if a server doesn't start.

control startup order of several servers

4 Answers4