2

When I perform heavy disk operations, like deleting 10k files at a time, the network share becomes unresponsive and won't serve out files for a short time.

Here's my configuration. I have a failover file server cluster composed of two Windows 2008 R2 Enterprise servers. Each server is a VM running on top of two independent Dell Poweredge servers running Windows Hyper-V. Both of the Dell servers have dedicated NICs to a Dell MD3000i SAN. Each of the file server VMs route their iSCSI connections through this dedicated NIC for their connection to the volume on the SAN where the files reside.

If I run a batch file that performs 10k deletes from a remote machine that references the file by share name (ie. \\fileserver\sharename\folder\filename.jpg), it may do 1,000 or 8,000 deletes before the share gives out. It's random each time. Ironically, the batch file will continue deleting the files, but other servers accessing files on that same share will get held up. The files I'm deleting would not be accessed by other servers, so locking of those specific files is not an issue.

If I run the same batch file on the master server of the file cluster and reference the files by their local path (ie. x:\folder\filename.jpg), the share cuts out immediately and the other servers sit and wait. Access to that share will resume when I terminate the batch file running.

Anyone have an idea as to the cause of the share cutting out or what I could do to diagnose this issue further? Any suggestions are greatly appreciated.


Updated Note: I've isolated this problem to occurring only within the boundaries of the host box. None of the network traffic involved to replicate this problem with the VMs reaches the physical switch the host box connects to other than the iSCSI connection to the SAN. The iSCSI connection has it's own dedicated switch and private subnet to the SAN outside of standard network traffic.

Zoredache
  • 130,897
  • 41
  • 276
  • 420
Adam Winter
  • 51
  • 1
  • 6
  • Question: does this happen only with the command line batch file, or does it happen from Explorer with the highlight and delete of the files as well? – Bart Silverstrim Jun 22 '10 at 15:24
  • Do you have live migration enabled? – tony roth Jun 22 '10 at 15:43
  • I've only had this happen with command line batch files because of the high number of deletes that are required. In addition, large files do not seem to suffer the problem. I can copy 3-4GB files all day with no problems and high performance, but doing 10k deletes kills the system. Also, I am not using live migration. – Adam Winter Jun 22 '10 at 16:28
  • you say a batch file, whats in the batch file? – tony roth Jun 22 '10 at 16:51

2 Answers2

3

This screams resource depletion of some kind. If this were a Linux host I'd be thinking, "this sounds like a boat load of IO-Wait." Check OS level performance monitors like mfinni pointed out. You have two areas that could be bottle-necking, and that's logical/physical disk performance, and network performance on the iSCSI network connection. PerfMon can give you this. I don't know HyperV at all, but if it is anything like VMWare then you have some performance metrics on the Hypervisor side you can look into as well. Do so.

As a theory, my guess is that the very high level of metadata updates you're doing is causing some inherent latency in your iSCSI stack to magnify. This in turn crowds out other I/O or metadata requests, which results in the symptoms you describe, other processes can get a word in edgewise as the MFT blocks are being hammered by this other process. iSCSI itself can cause this, but the VM layer is probably adding its own internal delays. If this is indeed the problem, you might want to consider presenting the iSCSI LUN to the hypervisor instead and then presenting the resulting disk to the VM; that way you're not relying on a virtualized network adapter for iSCSI, you're relying on a physical one.

Edit: It seems that you probably have this kind of fault on your hands. The PerfMon counters I'd pay attention to are "Bytes Sent/sec" and "Packets Sent/sec" for the interface running the iSCSI connection. The combination of the two should give you your average packet SIZE. (alternately, if you have the ability, throw a sniffer into the loop and see what the packets look like at the network switch. This is the more reliable method if you can do it) If that packet size is pretty small (say, under 800 bytes) then there is not much you can do about this other than get down to the TCP level and see what kind of optimizations can be made between your cluster nodes and the iSCSI target. Server 2008 is picky with its TCP settings, so there may be gains to be made here.

sysadmin1138
  • 133,124
  • 18
  • 176
  • 300
  • What you're saying makes sense and has credibility. Any suggestions on how I can prove it? What would I monitor in perfmon to validate this theory? I can't move the iSCSI LUNs to the host box because this is a cluster. If the host box goes down, so do the LUNs, so I have to keep them attached to the VM. – Adam Winter Jun 22 '10 at 16:20
  • Large files copies do not suffer the same problem, but doing a ton of small file deletes locally on the master server of the file cluster kills the share to other servers. If the problem were with the iSCSI going through the virtual switches of Hyper-V, why would it not suffer with big data writes as it does with small requests? I agree that some type of requests are queuing up somewhere, and I need a means of finding out. – Adam Winter Jun 22 '10 at 16:32
  • 1
    If large-file options don't do it but lots of itty bitty files do, then it is almost definitely meta-data related in some way. What's doing it is probably lots of little data operations on the same few blocks of the file-system. This is the kind of thing that write-combining in RAID cards is designed to help with, but in this case you've got a network stack between you and your RAID cache, and this kind of operation is HIGHLY sensitive to network latency. – sysadmin1138 Jun 22 '10 at 16:50
  • do you have the nic's set to auto or manual, how about the switch ports? They need to match. – tony roth Jun 22 '10 at 18:38
  • Just had another thought. We setup this file server cluster because we wanted to expand our web hosting environment from 1 server to multiple. The volume on the SAN which the file cluster points to used to be connected to a physical machine using iSCSI running Windows Web 2008. At that time, file ops to this volume were super fast. When we expanded, we mounted this volume to the file cluster running as VMs so that the data would be highly available. Could the move to a VM negatively affect the iSCSI connections? That would imply the virtual switch within Hyper-V is the main culprit. – Adam Winter Jun 22 '10 at 19:04
  • The virtual switch in Hyper-V is my top suspect right now. The kind of op you're doing will seriously magnify even small slowdowns in there. – sysadmin1138 Jun 22 '10 at 19:43
  • One more thought on this subject. Before moving this volume from the physical machine as described above, we converted that physical machine to a virtual machine on the same host as the file cluster node. The iSCSI connection on the web server VM still performed great with the batch file. What would have caused performance to drop when moving that volume from the web server VM to the file server VM on the same host, using the same NICs and virtual switch on the host? Something to do with the cluster service? – Adam Winter Jun 22 '10 at 20:35
0

Good lord. Is there anything in the event viewer to indicate the OS is seeing some sort of resource depletion? Can you inspect with perfmon?

mfinni
  • 36,144
  • 4
  • 53
  • 86