Cluster Shared Volumes offlining under excessive use

Question

I am trying to understand (and ideally fix) a particularly annoying phenomenon I've noticed with Failover Clustering and Cluster Shared Volumes in conjunction with Hyper-V. It appears that under high load, when the CSV backend storage is responding to all requests with considerable latency (>3 seconds) for all connected cluster nodes, an "unresponsive" condition is triggering and takig down the Resource Hosting Subsystem (RHS) process holding the resource, thus also taking offline the corresponding CSV for a short period of time (until the RHS process is restarted).

The log contains entries from the RHS with event 2051 "[RES] Physical Disk : IsAlive sanity check failed!, pending IO completed with status 170." followed by a number of entries indicating that the CSV is forcefully removed and re-attached.

As the CSVs are used for hosting Hyper-V guests' storage, Hyper-V subsequently sets the "cannot connect to virtual machine configuration storage" mark on VMs stored on the affected CSV and subsequently powers them off. Even if it has not been quick enough to power down every affected guest, guest's disk access requests will return errors for the duration of the outage, rendering the guests effectively unusable until the next restart.

Now since this is only occurring in borderline conditions and not easily reproducible, I do not have many occasions to verify possible solutions. After reading this, and that I have developed the hypothesis that some request for a CSV I/O operation (maybe, but not necessarily a reservation - I understand reservations are only used for FS metadata writes, which should not be all that common) is failing due to a timeout, which subsequently puts the entire resource a kind of a failed state. But this leaves me with some questions:

Does Hyper-V really have to power down the VMs where it cannot access the storage for? Or response with failures to guest's requests for storage access? Can't it simply freeze the execution until the storage comes back?
How might I tune the RHS' checks not to fail due to a simple overload condition of the backend storage?

3 seconds IO is not considerable latency, it is a problem. Why is this so extremely high? I mean, I run mid / low end storage and I can not trigger such latencies unless the raid rebuilds. And that includes build servers, database servers hitting the same storage group. I would fix the problem here. This is a VERY bad overload. — TomTom, Jan 29 '14 at 12:02
@TomTom because typically this *is* where the RAID is rebuilding. Or where you have a transient network problem on the way to the storage. Or a number of other reasons I can't come up with in a comment's length. Of course this is a problem, but I do not want a mere latency problem have sustainably break all my virtual machines' running states. — the-wabbit, Jan 29 '14 at 12:20
Does the System event log on the Hyper-V node contain anything when this occurs? — Trondh, Jan 29 '14 at 13:04
Maybe something like this would help? http://blogs.msdn.com/b/clustering/archive/2012/11/21/10370765.aspx — Nathan C, Jan 29 '14 at 13:35

Cluster Shared Volumes offlining under excessive use

0 Answers0