I am trying to understand (and ideally fix) a particularly annoying phenomenon I've noticed with Failover Clustering and Cluster Shared Volumes in conjunction with Hyper-V. It appears that under high load, when the CSV backend storage is responding to all requests with considerable latency (>3 seconds) for all connected cluster nodes, an "unresponsive" condition is triggering and takig down the Resource Hosting Subsystem (RHS) process holding the resource, thus also taking offline the corresponding CSV for a short period of time (until the RHS process is restarted).
The log contains entries from the RHS with event 2051 "[RES] Physical Disk : IsAlive sanity check failed!, pending IO completed with status 170." followed by a number of entries indicating that the CSV is forcefully removed and re-attached.
As the CSVs are used for hosting Hyper-V guests' storage, Hyper-V subsequently sets the "cannot connect to virtual machine configuration storage" mark on VMs stored on the affected CSV and subsequently powers them off. Even if it has not been quick enough to power down every affected guest, guest's disk access requests will return errors for the duration of the outage, rendering the guests effectively unusable until the next restart.
Now since this is only occurring in borderline conditions and not easily reproducible, I do not have many occasions to verify possible solutions. After reading this, and that I have developed the hypothesis that some request for a CSV I/O operation (maybe, but not necessarily a reservation - I understand reservations are only used for FS metadata writes, which should not be all that common) is failing due to a timeout, which subsequently puts the entire resource a kind of a failed state. But this leaves me with some questions:
Does Hyper-V really have to power down the VMs where it cannot access the storage for? Or response with failures to guest's requests for storage access? Can't it simply freeze the execution until the storage comes back?
How might I tune the RHS' checks not to fail due to a simple overload condition of the backend storage?