SQL Server 2008 R2 Failover Clusters behaviour on multiple failures

Question

I'm testing the resilience of one of our test systems. We have 2 x DB (SQL 2008 R2 on Server 2008 R2, running in ESXi VMs) arranged in a Failover Cluster.

Shutting down the Active SQL Server service doesn't do much - the service is not restarted and no failover occurs; I under stand this is by design - the system assumes that the admin had good reason to turn off the service and so will sit quietly.

However, we can simulate a failure in a number of ways - I found the simplest was to just kill the SQL service in Task Manager. Our cluster is set to allow one failure in 6 hrs, so after this first failure, it tries to restart the service - which succeeds. Kill the service a second time (within 6 hrs) and the cluster manager will decide to fail the DB over to the passive server. So far so good...

If will kill the service on the second server, it restarts again. But when we kill the service for a second time, it doesn't fail back over to the first server.

I'm assuming that this is also by design; it makes sense, in that why fail back over to a server that itself wasn't stable enough only minutes earlier? This sounds logical, but is it true? And if so, does to obey the same timeout period (i.e. 6 hrs), and can this be reset?

Basically, before I tell my colleagues the failover features are working, I just want to confirm/clarify my understanding and assumptions.

score 0 · Answer 1 · answered Apr 12 '13 at 16:08

0

Some other things you can test:

Try shutting down the boxes (even turn off the power to get a better simulation). Also unplug the network cables and disable the connection between the servers.

(though admittedly, it is usually the software that seems to cause a failover)

to set restart policies:

Open Cluster Administrator.

In the console tree, click the Resources folder.

In the details pane, click the resource you want.

On the File menu, click Properties.

On the Advanced tab, make the changes you want.

Seems like you want to look at the following settings: time-out, failover threshold, and failover period for resources. The time-out controls how long the Cluster service waits for the resource to shut down. The failover threshold and period control how many times the Cluster service attempts to fail over a resource in a particular period of time.

answered Apr 12 '13 at 16:08

Snowburnt

775
2
5
18

I'm trying to confirm that the behaviour observed is the behaviour expected. My question is about whether the cluster should failover to a server that had previously failed *inside the specified failover period*, or whether it should refuse to failover (for that very reason) – CJM Apr 16 '13 at 10:55
Yes, check the time-out, failover threshold, and failover period for the resources. Tweek them. Play with them. If it fails over properly within those thresholds after tweeking them then you'll know that you were just causing the failover too soon after the last failover. – Snowburnt Apr 16 '13 at 12:39
In other words, yes it appears to be by design and here's how you tweek it. – Snowburnt Apr 16 '13 at 12:49

SQL Server 2008 R2 Failover Clusters behaviour on multiple failures

1 Answers1