1

I have a SQL Server instance (SQL Server 2008 R2, Windows 2008 R2) that complains, for very short, random periods of about 15-20 seconds, that some of its I/O requests are taking longer than 15 seconds. ("SQL Server has encountered x occurrence(s) of I/O requests taking longer than 15 seconds to complete on file x") The disks in question are part of a SAN. Typically, in such a scenario, it's common to see IOPS or throughput demands on the disk spike, thus producing the latency, and suggesting perhaps that the LUNs need to be beefed up to match the server's needs. In this case, however, there is no such spike--on the contrary, according to perfmon, activity on the affected disk goes from a steady state to almost nothing at all, and latency actually improves a good deal. (And, I should add, we've searched on the SQL Server side for evidence of any sudden burst of activity, to no avail. The nature of the workload is such that a sudden drop in server activity is not possible.) There is a brief compensatory spike after the slow I/O incident, as requests catch up after the interruption.

The SAN folks have gone over everything with a fine-toothed comb (including the configuration of the host) and declare that nothing is wrong from their perspective. It so happens that we are using both anti-virus on this server (with proper file exclusions) and an encryption solution that operates like a file system driver, so I am naturally suspicious that either or both of these may be the source of the problem. But I'd like to be able to present a smoking gun when I call everyone into the sitting room to reveal the murderer. Other than consulting the vendors (which naturally we are doing), any suggestions for troubleshooting intermittent latency issues that may be caused by an application intercepting file system requests? Any tools or techniques, perhaps, that might show exactly what's slowing things down? I'm afraid that turning off either the AV or the encryption to see what happens is a non-starter. Just to complicate matters, this problem, so far, cannot be reproduced on demand.

Eldergriffon
  • 87
  • 1
  • 7
  • What brand of servers are these? I vaguely remember something like this occurring on some HP DL385 servers we used to run a couple of years ago, and I believe the required fix was a driver update. If you have HPs, let me know and I'll continue down this path. – Jeremy Lyons Mar 20 '13 at 21:18
  • These are Dells, unfortunately, but thanks. Do you remember which driver you had to update on the HPs? – Eldergriffon Mar 20 '13 at 23:06
  • 1
    It was a very specific bug in an HP driver, and I don't want to give you bad guidance on what sounds like a different scenario. You could always download the latest Dell Server Update Utility and make sure your hardware drivers and BIOS are up to date. – Jeremy Lyons Mar 21 '13 at 15:12

1 Answers1

1

here is another link bomb and run http://support.microsoft.com/kb/978000 and http://blogs.msdn.com/b/ntdebugging/archive/2010/04/22/etw-storport.aspx

these will give you more insight as to wether its a filter driver issue or san issue.

tony roth
  • 3,884
  • 18
  • 14