Fault Tolerance with Windows Server Failover Clusters?

Question

I've attempted to read numerous articles online about this and believe I'm "seeing the light", but I'd like to have it confirmed.

Do Windows Server Failover Clusters offer a fault-tolerant feature similar to VMWare FT?

We have a server application that maintains network connections to various pieces of equipment and transfers data in real-time between clients and this equipment. A large amount of configuration and state data related to the connections and equipment is kept in RAM and never written to disk.

VMWare FT seems to offer a viable fault-tolerant solution for this type of app, since it keeps the paired VMs processing in lock-step with each other and maintains the data in RAM over a hardware failure. Presumably the failover will also be quick enough to maintain network connections to our equipment. We understand that it won't protect against app- or OS-level failures.

I haven't found a similar ability in WSFC, but some of the papers I've read are also several years old, so I recognize I may not have the most up-to-date information.

Thanks for any info you can provide.

score 0 · Answer 1 · edited Jun 11 '20 at 10:02

0

What you are describing are stateful services. In oppose of stateless services like web servers.

And yes, Windows Failover Clusters support stateful services. You should read further into the documentation of both, Failover Clusters and your service, to understand if your service meets the requirements.

To be appropriate for a failover cluster, a service or application must have certain characteristics. The most important characteristics include:

The service or application should be stateful. In other words, the service or application should have long-running in-memory state or large, frequently updated data states. One example is a database application. For a stateless application (such as a Web server front end), Network Load Balancing will probably be more appropriate than failover clustering.

_{Source https://technet.microsoft.com/en-us/library/cc753938.aspx}

edited Jun 11 '20 at 10:02

Community

1

answered Jan 08 '16 at 20:24

Daniel

6,940
6
33
64

I believe our service is appropriate for a failover cluster, but I don't believe it will provide the required behavior. When a failure occurs, the second instance of our service needs to be functioning within several hundred milliseconds, with the same in-memory data as the failed instance. We can certainly have the second machine start up our service, but it wouldn't retain the large amount of in-memory data, and it would take multiple seconds to take control again of our connections. – Chris S Jan 18 '16 at 13:38
This is the answer to your question. If you have specific needs, you should either ask a new question, or rephrase this question. – Daniel Jan 18 '16 at 13:44
Why do you say that *"it wouldn't retain the large amount of in-memory data"*? Have you tested it? Because that's the opposite of what the documentation says. – Daniel Jan 18 '16 at 13:46
1

The articles that I've read discuss how data written to disk can be shared between instances, but I've seen nothing that discusses how data in RAM is shared between instances. Perhaps I've overlooked it. – Chris S Jan 18 '16 at 13:52
I digged a little deeper, too. Couldn't find anything more useful. Perhaps you are right. Perhaps it's also not a good choice using WSFC with that little information about how the thing actually works :) – Daniel Jan 18 '16 at 14:28

score 0 · Accepted Answer · answered Jan 09 '16 at 18:23

First ask what kind of recovery time objective (RTO) you need for this application if a node fails. Keep in mind that closer to zero can have costs and limitations.

VMware FT does keep the VM state in sync to so that the OS keeps going. Which has large synchronization challenges. It requires a beastly network connection and is limited in number of vCPUs.

Windows Server Failover Clusters are not equivalent to FT; I've not seen a FT solution from Microsoft. WSFC moves a service between OS instances, with a brief interruption.

VMware HA and similar hypervisor features move a VM to another host and boot it on OS failure. Also a brief interruption. Closer to a failover cluster but moves the entire VM.

We require sub-second recovery time (hundreds of milliseconds). Since our service takes multiple seconds to initialize, it seems like the service would need to already be running on the second instance at the time of the failure. And then there is the requirement to have the same in-memory data as the first instance. We could certainly modify our service to handle these scenarios better, but our primary goal for the short term is to find an external solution (one that doesn't require modification of our service). — Chris S, Jan 18 '16 at 13:47

Fault Tolerance with Windows Server Failover Clusters?

2 Answers2