5

We have a number of Server 2012 systems, all of which run virtualised on Hyper-V 2012 server. We are having problems with two such virtual instances, both of which are used as file servers, whereby they occasionally stop responding to requests to serve files to clients. After logging on to the server, attempts to shut it down gracefully fail (no error, it just fails to acknowledge a shutdown request).

Recovery is a case of power cycling the server(s) from the Hyper-V console.

These two servers don't serve a large number of users (one serves no more than 6 users, and the other serves around 20 users), they are in the same domain, but on different physical hardware (and at different sites). They don't lock up at the same time. They both use DFSR to replicate a fairly large amount of data between themselves (200GB) over ADSL connections, this is working fine, and we have been using DFSR to do this on the previous two generations of server OS we have used (Server 2008 R2 and Server 2003 - both of which were physical installs however).

Today, when one of the servers crashed, I noticed an entry in the event log, which looked similar to the following:

Log Name:      Application
Source:        ESENT
Date:          27/11/2012 10:25:55
Event ID:      533
Task Category: General
Level:         Warning
Keywords:      Classic
User:          N/A
Computer:      HAL-FS-01.example.com
Description:
DFSRs (1500) \\.\E:\System Volume Information\DFSR\database_C8CC_101_CC00_EC0E\
dfsr.db: A request to write to the file "\\.\E:\System Volume Information\
DFSR\database_C8CC_101_CC00_EC0E\fsr.log" at offset 4423680 (0x0000000000438000)
for 4096 (0x00001000) bytes has not completed for 36 second(s). This problem is
likely due to faulty hardware. Please contact your hardware vendor for further
assistance diagnosing the problem.

When the server started up again, I went to find the event log entry to investigate further and found that the event log entry was no longer there (I assume it was in memory but failed to write to disk before the server was powered off, for the reason mentioned in the message). I found the above message by searching back further in the event log.

Both of these virtual servers have their E: volumes fully allocated as opposed to dynamically expanding, and there are no other issues on any of the other virtual servers (which include server 2012, server 2008 R2 and Ubuntu 12.04 x64). There are no signs of IO, memory or CPU starvation on the host systems.

I've used performance counters on the affected virtual servers to monitor memory usage (including non paged pool usage), as well as CPU and network utilisation, and none of these show any signs of trouble when the issue arises.

I would have thought our configuration isn't that uncommon, so I'm wondering if anyone else has seen this, and managed to resolve the problem?

The host specifications are as follows:

hal-vm-01 running a total of 5 virtual servers (affected file server, DC + other guests) is a Dell Poweredge R710, 16GB, 6 x 300GB SAS 15K RAID 10, Perc H700

hey-vm-01 System running 2 virtual servers (affected File server and DC) Dell Poweredge T620, 16GB, 2 x 3TB SATA RAID 1, Perc H310

We have a further virtual server hal-vm-02 running 5 guests, which is unaffected by this problem and is a lower spec than hal-vm-01, but loaded about the same (exchange, DC, SQL + other guests). More memory is on the way so that we can configure shared nothing failover between this host and 'hal-vm-01'.

There is AV software (MS SCEP) running on the two virtual servers that are affected, they are configured to scan on create only, and to not scan files created by the dfsrs.exe process. There is no AV software running on the VM hosts themselves.

We are using Windows Server 2012 backup on the host hal-vm-01 to backup all the VMs, this runs out of hours. The other affected server hey-vm-01 isn't backed up, as it's just an off site DFSR copy of the data at our main office. Another backup job runs on the affected virtual guest hal-fs-01, this also uses Windows Server backup, to take snapshots of the data stored in the DFS replicated shares. Both backup jobs run out of office hours.


Three months later...

We've had a support ticket open with Microsoft for over three months now, there have been lots of performance counter logs, memory dumps, event logs sent to Microsoft. The analysis they've performed indicated a problem with one of the virtual drives of the hal-fs-01 (the virtual server with the problem). The virtual drive in question was the server's E:\ drive, which just happened to have all our DFSR groups and shares. Recently, I moved all data off the E:\ drive to many smaller virtual disks that I added to the server, and of course moved all the shares and DFSR groups, leaving just Windows Deployment Services files on the E:\ drive. Despite this, we still saw the problem with writes to the E:\ drive failing.

Last week I've moved the WDS files to a new virtual disk and also disabled the WDS service. I've also deleted the E:\ virtual disk just in case there was some anomaly with the disk. Since then, we've not yet had another failure, however it's too early to know if this has fixed the problem, as our longest up time was previously around 2 weeks, as of the time of this edit (20/03/2013), we are only one week into the current config, if the problem hasn't surfaced again by next week, I'll be re-enabling WDS, as I have a suspicion that WDS could be the culprit.

I'll keep this question updated (or provide an answer if I manage to resolve the problem).


Moved back to Server 2008 R2...

Not updated the question with progress, but we ended up rolling back to Server 2008 R2, everything works fine. I'd still be interested in hearing about anyone having this issue and managing to find a fix.

Bryan
  • 7,628
  • 15
  • 69
  • 94
  • This doesn't exactly look like a misconfiguration, so I'd probably turn to Microsoft Support. If you have SA then you probably have some free calls left. – pauska Dec 05 '12 at 11:29
  • @pauska I'm inclined to think it's a problem somewhere possibly server 2012's DFSR implementation, Hyper-V 2012 or a weird combination of the two. We have SA, so I'll investigate that, thanks. I can't believe I'm the only one with a configuration like this though, hence why I thought I'd ask here. - Of course copious amounts of Google searches have returned nothing of any interest. – Bryan Dec 05 '12 at 11:35
  • Are you by any chance using replicas on these VM's? – pauska Dec 05 '12 at 11:40
  • @pauska, Yes, but only on one of the two affected VMs. The hosts `hal-vm-01` and `hal-vm-02` replicate all VMs. The affected server on `hey-vm-01` doesn't have a replica. – Bryan Dec 05 '12 at 11:47
  • Can you disable HV replica on them and see if it solves it? I'm kind of thinking that DFS-R and VSS does not play nice together in WS2012.. – pauska Dec 05 '12 at 12:21
  • I'll certainly give that a go @pauska on the one host that has a replica, but the second VM that fails in the same way that doesn't have a replica copy (or doesn't use VSS), which kind of suggests this isn't the problem. Out of interest have you had any experience yourself that makes you think this, or are you basing this on the info in my question? – Bryan Dec 05 '12 at 12:46
  • No, sorry, this is pure guess based on your question and the info you provided. I agree that the other server should not stop responding, as DFS isn't clustering.. but it doesn't hurt to at least test it before calling MS support. – pauska Dec 05 '12 at 13:08
  • Thanks @pauska for your help. I'll give that a try, it can't hurt. – Bryan Dec 05 '12 at 13:28
  • do you have a host based antivirus solution running on the parent partition? – tony roth Dec 05 '12 at 16:07
  • @tonyroth I've updated the question with details of the AV in use. – Bryan Dec 05 '12 at 16:16
  • @Bryan "Please contact your hardware vendor for further assistance diagnosing the problem" means the host is having problems, in this case the drive thats hosting the vhd(x) maybe experiencing a problem. – tony roth Dec 05 '12 at 16:27
  • @tonyroth understood, but if the host were having problems why are no other VMs on that server being affected? The server is easily capable of handling the load generated by our low user base. Remember we are seeing this on two virtual servers, one of the virtual servers has two guest VMs, one is a file server for a maximum of 6 users and the other is a domain controller. We are talking very low usage here, and more than capable server hardware. – Bryan Dec 05 '12 at 16:34
  • What type of storage is backing the VHD's and the DFS data? iSCSI, or local storage? – longneck Dec 05 '12 at 16:40
  • Bizarrely, we have EXACTLY the same problem but a slightly more simple setup than you. Will try and find that log entry too.... –  Dec 05 '12 at 16:50
  • @longneck Storage is local - SAS 15K on one server, SATA on the less used server. – Bryan Dec 05 '12 at 16:56
  • @Julian Interesting, be sure to check your application log before rebooting, as the entry doesn't always get written to disk, and hence no longer exists when you reboot the server. – Bryan Dec 05 '12 at 16:57
  • I've now contacted Microsoft PSS regarding the issue. – Bryan Dec 11 '12 at 10:26
  • Did you get a resolution to your issue? – longneck Mar 20 '13 at 02:22
  • @longneck It's still on going. I'll update the question with the latest findings. – Bryan Mar 20 '13 at 12:53
  • We've started to encounter a similar issue on a Windows 2012 Hyper-V based system connected to a StarWind SAN with primary (SAS) and secondary (SATA-3) storage. The file server at site A last week become totally unresponsive with log full of ESENT/DFS-R errors. Then the same thing happened yesterday at site B - site A & B replicate between each other. In both cases, we were able to shutdown the virtual file server. The core problem though was that our DFS structure was impacted across all sites until this was resolved – Rob Nicholson Mar 05 '14 at 13:03
  • It is rather ironic that a technology designed to help business continuity (failover to other site) caused us so problems ;-) The event log is full of ESENT warnings and these usually coincide with when DPM 2012 carries out synchronisation. This uses VSS and as this is (I understand) an atomic operation, I'm not totally surprised that there is a 20 second delay in normal operation whilst the snapshot is created. This would be fine if the warnings were just that "Took a while to respond - ohh yes, it's VSS" but what appears to happen is that DFS-R/ESENT sometimes gets into a state... – Rob Nicholson Mar 05 '14 at 13:07
  • ...whereby it's continually reporting these errors. If delays in disk writes are an expected side effect of VSS, then I'm in the camp that there is a flaw in the error handling/retry code in ESENT/DFS-R. BTW - like the originator, we used the same system previously on Window 2008 servers running on XenServer and a less-powerful Starwind SAN. That worked fine... – Rob Nicholson Mar 05 '14 at 13:09
  • BTW - I've watched the Starwind SAN during these DPM sync/VSS windows and it hardly breaks into a sweat disk wise - queue length of around 2 with occasional 5 peaks. The 4 x 1Gbit SAN backbone is busy (60%) which indicates we're network limited and not disk. During sync, it's mainly reads from the SAN to the DPM dedicated RAID-5 array - and we know how fast RAID-5 is at writing... so whilst it might be easy to point finger at SAN/network/disks etc. I feel that ESENT/DFS-R is the cause - it's not resilient enough – Rob Nicholson Mar 05 '14 at 13:13

2 Answers2

1

Ok I am not sure if this will be of any help but the factor I have in common with you is that i had my drives connected to a PERC H310 controller and I was running a file server in a Virtual environment mapping its data drive to a Raw disk connected to the same H310. At random times usually during periods of High I/O The virtual machine would complain that it could not access the drive and would crash. I ended up connecting the drives to the onboard Intel controller and had no problems since. I personally think the low end Perc cards have quirks that can cause issues with I/O sensitive operations.

RyPaul
  • 11
  • 1
  • We have two physical servers with this problem, one with a PERC H310, one with a PERC H700, I'm _pretty_ sure it isn't this, as it only affects one virtual drive on each server. The common factors for me are Server 2012, DFSR, Hyper-V, as recently noted WDS on both servers. – Bryan Mar 20 '13 at 13:30
  • Sorry I could not help. The only other thing I could add is to make sure write caching is turned on in regards to the PERC controllers if you want it enabled that is. I found with some of them when you add disks they default to no write cache which can hinder write speeds. Good luck with the problem I hope you get it solved –  Mar 20 '13 at 18:01
0

You look at the wrong place, I think. Look at the host- that smells like a host issue with the disc subsystem, either craps or SIGNIFICANTLY overloaded.

TomTom
  • 51,649
  • 7
  • 54
  • 136
  • I'd agree but for the fact that none of the other systems on the same host are affected, and the performance counters don't suggest this is the case. – Bryan Dec 05 '12 at 10:59
  • @Bryan I tend to agree with tomtom, the perf counters won't show you the issue in this case. How frequently does the problem occur or can you get it to repeat on demand? BTW the error message you posted is probably talking about an issue with the host not the guest. – tony roth Dec 05 '12 at 16:22
  • @tonyroth Why wouldn't they? I've got performance counters running on both the host and the guest. I've not found a way of creating the problem on demand, and it happens once every 5 - 10 working days. It always happens when staff are in the office, never out of hours. – Bryan Dec 05 '12 at 16:26
  • @bryan ok a better question is what counters did you look at? if you look at the usual suspects you won't see the issue. Not sure if the w2k12 servers expose the storport latency yet, it wasn't under w2k8r2. – tony roth Dec 05 '12 at 17:02
  • The usual suspects I guess, Logical & Physical disk, Avg. Queue Lengths (read, write & total), Memory, CPU, Non Paged pool, on both virtual and physical host. – Bryan Dec 05 '12 at 17:11
  • 1
    yes exactly these won't show the issue at all, this article is about w2k8r2 http://support.microsoft.com/kb/978000 but you can ignore the downloading of the hotfix w2k8r2sp2+ has it by default. Just read the section on setting the values for thresholds. My guess is that the storport etw results will show quite a dropout at the physical disk level, now whats causing this will be the interesting part. btw do this at the host level. – tony roth Dec 05 '12 at 17:27