3

I've been looking at some performance issues on a clustered virtual machine in our organisation. Actually this problem seems to affect most of the virtual machines I have looked at. Both host and VM are 2008R2 with SP1.

I believe - from what I have read in various articles and advice I have been given - that I/O latency is the most important metric to be looking at. I've looked at this metric in three different places:

  • LUN latency on the storage appliance
  • Logical disk average sec/write and average sec/read on the Hyper-v host
  • The same as above, but on the virtual machines themselves

This is in an effort to narrow down the source of any latency that might be happening. Sure enough, this is what I found....

What I'm seeing is what I would consider to be acceptable latency (3-15ms) on the LUNs, up to 20ms (still acceptable) on the Hyper-V host. When I look at the same metrics on a VM I'm seeing regular spikes of up to 300ms for up to 10 seconds at a time and an average of about 20-30ms.

This particular VM is a SQL server, but the same applies to non-SQL servers too. The relevant exceptions are added to our AV solution to avoid on-access scanning of DB files. Also, our VHDs are of a fixed size as opposed to dynamically expanding.

So for my question:

What are the likely causes of this latency, and/or what other metrics could I be using within the VM (or even on the Host) to narrow this down?

john
  • 1,995
  • 2
  • 17
  • 30

2 Answers2

3

Measuring time within a VM can be problematic, as the virtual processors don't execute continuously. If you want to get a clear view of what's actually happening, use Performance Monitor in the management OS. Look for Hyper-V Virtual Storage Device. You can correlate that with data from Resource Monitor, too, to see what's contending for access to the disks.

In general, the response time of a particular VHD will have everything to do with what else is happening on the volume hosting that VHD.

Jake Oshins
  • 5,146
  • 18
  • 15
  • When you say 'volume', I guess you would be referring to the CSV as opposed to any volume the LUNs might sit in? – john Oct 03 '13 at 18:31
  • I'm not sure what your terminology is getting at. Yes, CSV is the file system that is typically used for VHDs, and that sits on a volume, which is shared across the host cluster. That volume sits on top of a LUN, which is part of a disk pool, usually (with Server 2008 R2) in a SAN. What, specifically, are you asking? – Jake Oshins Oct 03 '13 at 19:43
  • Sorry, we have volumes on our storage appliances too. Are you suggesting that the logical disk metrics on the VM are useless? – john Oct 03 '13 at 20:01
  • Additionally, Microsoft assert that the same metrics *can* be used: http://technet.microsoft.com/en-us/library/cc768535%28BTS.10%29.aspx – john Oct 04 '13 at 14:35
  • No, I'm not suggesting they're useless. I'm merely pointing out that time within a VM is virtualized, and thus measurements of anything that tries to look at small time quanta will result in messy data. Data you collect over long periods of time will be generally right. Looking at any small-span time period may not be accurate. – Jake Oshins Oct 04 '13 at 17:37
  • Ok, that's a very useful clarification. Do you think you could add that to your answer please? Unfortunately, set you alluded to doesn't contain a metric for latency. – john Oct 04 '13 at 18:13
  • There's a similar opinion in the [comments here](https://support.microsoft.com/en-us/kb/943556?wa=wsignin1.0). – Nick Westgate Jun 18 '15 at 05:09
1

Your 'disk latency' on the VM could be CPU latency on the host since the host has to use CPU cycles for IO requests.

Is the host heavily loaded overall? Or is it just running a lot of VMs? Not sure what the hyper-v equivalent, but the VMWare metric is CPU ready time - basically how often is the VM waiting on the host to run.

AngerClown
  • 320
  • 1
  • 3
  • The idle time overall is about 80-90% on the host, so not the host is not overloaded. – john Oct 04 '13 at 14:29