A system is High Availability when it is working 99.999% of its time. It's often said "five 9s of availability" and that's roughly a downtime of just 5.26 min per year, or 26.30 seconds per month.
It's not easy to have "Highly Available Systems" - you need a lot of automation to recover from failures and high levels of redundancy to be able to replace broken parts of your architecture right away. In addition, your architecture must be elastic: so when you are under increased load, the architecture must grow to meet the demand.
In AWS, for example, the "High Availability Systems" design is in one Region with resources deployed in different Availability Zones (AZs) in the same Region.
It is expensive to have a highly available system because there is a lot of redundancy in place to ensure the five 9s level of service.
But, in "High Availability" with five nines... there is downtime: very little, but there is... what if those minutes or seconds are a black friday? How is your business affected?
For higher levels of availability you have "Fault Tolerance": The system can keep working altough any component of the system fail: the system don“t stop of giving service altouh at the same time it is replacing a broken component: there is not downtime
"Fault Tolerant Systems" have higher levels of availability - they have six 9s or more (99.9999% or more) and the system can operate without downtime.
A system that is "Fault Tolerant" is obviously "High Availability", but the opposite is not true: if a system is "High Availability" that does not mean that it is also fault tolerant.
A fault tolerant system is still more expensive than a high availability one and is typically designed in AWS with regional redundancy: it uses more than one region and its availability zones to implement the architecture: if an entire region fails, one Failover will redirect the load to the other region where there is an active architecture configuration.
Another related concept is "Disaster Recovery" but, in this case, they are different disaster recovery strategies, and, in this case, the KPIs to follow are the RPO and the RTO, that is, the point of time of the last saved data. (RPO), and the time to recover the system again (RTO). They typically have passive redundancies or asynchronous data copying, and more than one region is used to support disaster recovery architecture.
I hope it helps you!
regards,