I'm testing the resilience of one of our test systems. We have 2 x DB (SQL 2008 R2 on Server 2008 R2, running in ESXi VMs) arranged in a Failover Cluster.
Shutting down the Active SQL Server service doesn't do much - the service is not restarted and no failover occurs; I under stand this is by design - the system assumes that the admin had good reason to turn off the service and so will sit quietly.
However, we can simulate a failure in a number of ways - I found the simplest was to just kill the SQL service in Task Manager. Our cluster is set to allow one failure in 6 hrs, so after this first failure, it tries to restart the service - which succeeds. Kill the service a second time (within 6 hrs) and the cluster manager will decide to fail the DB over to the passive server. So far so good...
If will kill the service on the second server, it restarts again. But when we kill the service for a second time, it doesn't fail back over to the first server.
I'm assuming that this is also by design; it makes sense, in that why fail back over to a server that itself wasn't stable enough only minutes earlier? This sounds logical, but is it true? And if so, does to obey the same timeout period (i.e. 6 hrs), and can this be reset?
Basically, before I tell my colleagues the failover features are working, I just want to confirm/clarify my understanding and assumptions.