High Availability(HA) vs Fault Tolerance

Question

Read couple of articles on Google like this but still not clear about what is difference b/w them?

Purpose of both seems to provide the services when one component fails (be it hardware or software), a backup/secondary component takes over operations immediately so that there is no loss in service.

My understanding:

Per mine understanding, the difference is that there is no loss of any data be it in memory data in fault tolerant system which is not the case with HA. For example: If we have web server cluster with sticky session but without session replication then its a HA system but not a fault tolerant system. Reason is when a node fails, memory data is lost but if we have session replication along with stick session then it can be called fault tolerant system. Is that correct ?

score 0 · Answer 1 · answered Aug 27 '18 at 14:20

In the example you stated - the web server cluster with sticky session and non-replicated session will continue to serve the next request ( obviously the one which faced issue would be terminated or will serve an error to the user ). This is high availability. However even if there was replicated session - a truly fault tolerant system would be something which would be able to continue providing acceptable response to the user despite the current request failing by some means of auto correction of the state of data. Typically in web servers this kind of fault tolerance is not built in inherently but might be built by a layer catching any kind of exception ( before sending the output), correcting the in-memory data which gets replicated , calling another server which is able to get the correct response. The key thing is that it should be all automatic and some level of performance degradation is expected and acceptable while the system auto corrects. So a HA system does not carry the burden of maintaining correct data just that it can serve on the next request, however a truly fault-tolerant system involves the maintenance of consistent data.

score 0 · Answer 2 · answered Jul 17 '23 at 23:15

A system is High Availability when it is working 99.999% of its time. It's often said "five 9s of availability" and that's roughly a downtime of just 5.26 min per year, or 26.30 seconds per month.

It's not easy to have "Highly Available Systems" - you need a lot of automation to recover from failures and high levels of redundancy to be able to replace broken parts of your architecture right away. In addition, your architecture must be elastic: so when you are under increased load, the architecture must grow to meet the demand.

In AWS, for example, the "High Availability Systems" design is in one Region with resources deployed in different Availability Zones (AZs) in the same Region.

It is expensive to have a highly available system because there is a lot of redundancy in place to ensure the five 9s level of service.

But, in "High Availability" with five nines... there is downtime: very little, but there is... what if those minutes or seconds are a black friday? How is your business affected?

For higher levels of availability you have "Fault Tolerance": The system can keep working altough any component of the system fail: the system don´t stop of giving service altouh at the same time it is replacing a broken component: there is not downtime

"Fault Tolerant Systems" have higher levels of availability - they have six 9s or more (99.9999% or more) and the system can operate without downtime.

A system that is "Fault Tolerant" is obviously "High Availability", but the opposite is not true: if a system is "High Availability" that does not mean that it is also fault tolerant.

A fault tolerant system is still more expensive than a high availability one and is typically designed in AWS with regional redundancy: it uses more than one region and its availability zones to implement the architecture: if an entire region fails, one Failover will redirect the load to the other region where there is an active architecture configuration.

Another related concept is "Disaster Recovery" but, in this case, they are different disaster recovery strategies, and, in this case, the KPIs to follow are the RPO and the RTO, that is, the point of time of the last saved data. (RPO), and the time to recover the system again (RTO). They typically have passive redundancies or asynchronous data copying, and more than one region is used to support disaster recovery architecture.

I hope it helps you! regards,

High Availability(HA) vs Fault Tolerance

2 Answers2