Should an HA failover occur in this scenario?

Question

I'm running vSphere 5 in an HA cluster across two hosts (vsphereA and vsphereB). I have the HA cluster configured for host monitoring and datastore heartbeat monitoring with admission control disabled (hopefully I rightfully understand that datastore heartbeat monitoring prevents inadvertent and unwanted HA failovers due to management network isolation). Each host has a single connection to a dedicated iSCSI network and iSCSI target (no MPIO). All vmdk's for all VM's exist on the iSCSI datastore. As a test of HA I disconnected the iSCSI connection on vsphereB and was surprised to see that the running VM's on vsphereB continued to run on vsphereB. The powered off VM's were showing as inaccessible (which I expected due to the fact that they weren't running and the connection from vsphereB to the iSCSI target was severed) but the running VM's continued to run and continued to be "owned" by vsphereB. I expected to see an HA failover occur for those VM's and expected to see them "owned" by vsphereA after the HA failover (which didn't occur). I'm at a loss to understand why an HA failover didn't occur for those VM's. Am I misunderstanding in which cases an HA failover should occur?

score 8 · Accepted Answer · answered Sep 29 '12 at 04:58

You seem to be confusing vMotion and HA, which are different features that do different things.

vMotion is a feature which allows virtual machines to be migrated from one physical host to another with no downtime and minimal (milliseconds) disruption in service. It is done in advance of maintenance and requires the VM and both the source and destination hosts to already be in a healthy state. HA is a feature which restarts failed virtual machines (or inaccessible virtual machines if host isolation is configured) and does result in downtime for the VM, since the entire virtual machine is powered off and restarted.

Important take-away: a vMotion is not an HA failover. An HA failover is an HA failover.

vMotions are triggered by the following things:

A user initiates a vMotion
DRS initiates a vMotion in response to load conditions (thresholds set by the DRS aggressiveness setting), affinity rule violations, or host updates triggered through VUM

HA failovers are triggered by the following things:

A host in your HA cluster has detected that another host in the cluster has failed and is not responding to HA heartbeats using either the configured management networks or heartbeat datastores
Isolation response is configured to shut down or power off VMs, and the host can no longer speak to a majority of cluster nodes, triggering a VM shutdown and subsequent HA failure detection from the remaining majority of the cluster (if there is one, which is one of the dangers of isolation response)
The cluster/VM are configured for VM Monitoring through VMware Tools, the hypervisor has not received a heartbeat for a specific amount of time, and no disk or network activity has occurred for 120 seconds

Bottom line: vMotions occur because of performance events, and HA failovers happen because of availability events.

What you've done is pull the disk out from underneath a running VM. The standard behavior of vSphere, and most hypervisors, in this instance is to leave the virtual machine alone, and let it handle its own disk issues. There's several good reasons for this:

Some operating systems/distros (i.e. pfSense) will work just fine if the underlying disk stops responding
A few dozen VMs starting up at the same time tends to create a "thundering herd" problem -- doing this on storage that's already questionable may not end up being the best idea
Like swapping, the operating system (and applications) will usually do a better job of dealing with storage issues than the hypervisor will
Sometimes storage just hangs -- it's the most failure-prone component in most virtualized environments. Best to try to detect it and alert on it and let an administrator figure out what to do with it before you kick over an entire environment

On the other hand, for many workloads (databases come to mind), it's a good idea to shut down as soon as there's a chance corruption or lost transactions might occur. In a best-case scenario, though, since you can't cleanly quiesce the database without the disk, you're probably ending up in an inconsistent state anyway.

Ultimately: there's some good use cases for having HA respond to unreliable storage, but it doesn't do that today, and the behavior you're seeing is totally normal.

Thanks for the informative answer. I was using the term vMotion to describe the mechanism for failing over the VM's from vsphereB to vsphereA after I had pulled the iSCSI NIC cable from vsphereB. Is it not in fact the vMotion "component" or "engine" that performs failover, whether it be an HA failover or a DRS failover? In my scenario, should an HA failover have ocurred? vsphereB no longer had access to the iSCSI datastore where the running VM's were located. How could the VM's continue to run on vsphereB if vsphereB no longer had access to the datastore? — joeqwerty, Sep 29 '12 at 14:27
HA failover is literally just poweroff/poweron -- if the cluster was healthy enough to do a vMotion, you wouldn't need to initiate an HA failover in the first place. The VMs continued to run because all the files they needed to run were already in memory -- though obviously any new read/write operations to the disk would time out and fail. — jgoldschrafe, Sep 29 '12 at 14:29
In addition, I reworded my question and replaced "vMotion" with "HA failover". — joeqwerty, Sep 29 '12 at 14:31
Gotcha. So only a power off will trigger an HA failover? What about pulling all of the NIC cables? That would simulate a failed host and a failed datastore heartbeat (from the perspective of the other hosts in the cluster). — joeqwerty, Sep 29 '12 at 14:34
Pulling out the management NICs alone should do it if you're configured for isolation response, but a poweroff of the host will certainly fail your VM over. — jgoldschrafe, Sep 29 '12 at 15:48
Thanks much for correcting my understanding of vMotion versus HA. — joeqwerty, Sep 29 '12 at 15:53

Should an HA failover occur in this scenario?

1 Answers1