3

First, we have a Windows 2008 R2 Two Node cluster running HA Hyper-V and DHCP. We utilize a back-end Dell MD3000i iSCSI SAN for storage. All of the networking is done via redundant switches and MPIO drivers. The data network is on a different VLAN than the primary network.

Here is the scenario we keep encountering:

We have power outages sometimes. We have dual UPS devices in the cabinet and they last for about 15 minutes or so, but if we don't get power back everything goes down, cluster nodes, SAN and all.

Eventually the power comes back up, all of the devices are configured to boot when AC returns. However, when we have a complete outage like this the cluster never comes back online properly. We get the usual errors like the Quorum disk is unavailable, etc. In addition our two primary domain controllers are virtual machines on top of the VM Cluster. We do have a physical server running as another domain controller thinking this would help when things come back online.

What we are not understanding is why the system is not able to recover itself when it boots, there is an available DC for authentication, eventually. The iSCSI network comes back online, is there something else we are missing?

I think it may be related to the iSCSI Initiator service not starting quickly enough when the cluster service is ready to go.

Any ideas or things I can post to help?

Thanks, Brent

Brent Pabst
  • 6,069
  • 2
  • 24
  • 36
  • 1
    Does your standalone DC has all the necessary roles for a member machine to boot ? If you're missing DNS (or DHCP if you're using it for your clustered servers), then the cluster will most probably fail. Also, what's in the even log ? – Stephane Dec 23 '11 at 13:42
  • 1
    Sounds similar to our setup (HP gear and a few more servers); works perfectly for us. We do have the Host servers set to boot 2 minutes after power, and the DC boots immediately. Cluster service gets pretty mad if it can't find a DC immediately. This certainly sounds like a storage issue however; how long does that MD3000i take to start-up? Our host servers iSCSI boot, so our SAN has to be running before they'll even start booting. – Chris S Dec 23 '11 at 13:51
  • Good thoughts! I just found this old resource from MSFT: http://support.microsoft.com/kb/883397 and it looks pretty relevant. I like the boot wait time on the VM Hosts and I think by setting some iSCSI dependency on the Cluster Service this may help us out. What do you think? – Brent Pabst Dec 23 '11 at 14:12

2 Answers2

2

We had the same problem with our cluster not coming back up cleanly after a power failure. Like you, the shared storage is located on iSCSI SANs. The fix for us was to ensure that VM host and guest startup was delayed long enough to ensure the SANs were back online FIRST. We found that if we didn't do this, the shared volumes would reconnect, but remain in an offline state, thus causing the cluster to fail....

newmanth
  • 3,943
  • 4
  • 26
  • 47
0

I ran into this problem on my own system. After a power failure the cluster just wouldn't come back up, either because the domain controller wasn't ready, or the SAN wasn't ready yet. For those that don't have any managed PDUs or bios options to delay startup, and need to add a boot delay, there's an easy method posted in this blog

On Server 2008, open a command prompt and type:

bcdedit /copy {current} /d "Boot delay placeholder"
bcdedit /timeout 300

This creates a second boot menu option (needed for the timeout to appear) and sets the timeout to 5 minutes (300 seconds). The server will sit at the boot menu until the timeout is reached or someone presses the enter key.

Grant
  • 17,859
  • 14
  • 72
  • 103