1

Our ASP.Net website uses SQL Server as the session state provider. We currently host the database on SQL Server 2005 since it does not play well on 2008 R2. We would like to know why, and how to fix it.

hardware setup Our current session state server has SQL Server 2005 with the files hosted on a single local disk. It is one of our oldest servers since it has served us well, and we never felt the need to upgrade it. The database is about 2 GB holding 6000 sessions. (The sessions are a little big, but we need it.) We have another server with SQL Server 2008 R2 with a much faster CPU and much more RAM. edit: I lied about the much faster drive. The older server has a single 15K 36GB SAS drive. The new server has a raid 1 mirror of 15K drives with 136GB of available space.

situation One day, we have a huge surge in traffic. The transaction log growth on SQL Server freezes the server for 10's of seconds, allowing only a few requests through in minutes. So we load up the new server with ASPState with very large data and log files and point all of our applications to the new server. It chugs along fine for about 5 minutes, and then the CPU usage jumps up to 50% of the 16 cores that Standard Edition can use and freezes for 10's of seconds at a time. The files do not record any autogrowth events. The disk queue is nice and low. RAM usage is low.

CPU usage on our old server has never been higher than 5%. What happened on the new server?

Alternatively, I would like to hear success stories with ASP.NET session state server running on SQL Server 2008 R2 with an average write load of 30MB/sec with bursts up to 200MB/sec.

John Tseng
  • 155
  • 4
  • You are asking us to spot the difference blindly, without telling whether you ran diffs on config files. Please edit your question to show that both servers are identical except the differences you tell us about. – Deer Hunter Jun 25 '13 at 21:11
  • @DeerHunter Spot the difference blindly is a good way to put it. That's pretty much the idea. What could've possibly gone wrong? – John Tseng Jun 25 '13 at 21:16
  • @DeerHunter I agree that the possible places for us to screw is up is huge, and we have no idea where to start. I guess I will be happy with successful stories of where we have failed. I've updated my question to reflect that. Thanks. – John Tseng Jun 25 '13 at 21:27

1 Answers1

2

Your first server, with SQL Server 2005, was behaving in a normal way for a massive transaction log increase. There are tricks you can use to increase the file growth speed for the mdf file, but not for the ldf.

http://www.mssqltips.com/sqlservertip/1572/configuring-windows-instant-file-initialization-for-sql-server-2005/

I'm not sure what to tell you about your second server, based on the information above. However, disk queue is probably not the metric you're looking for here. You probably want the Perfmon counter Logical Disk: Avg. Disk sec/Write or its equivalent, and optimal values (average) for ldfs are 5ms or under.

I'm going to guess, based on "a much faster hard disk," that you're not using RAID. It's possible, despite the disk queue number, that you're experiencing disk write pressure. You might also be experiencing latch or lock waits.

Here's a query that will help diagnose waits:

http://www.brentozar.com/responder/triage-wait-stats-in-sql-server/

Katherine Villyard
  • 18,550
  • 4
  • 37
  • 59
  • Thanks for attempting to stab in the dark with me. I've tried the wait stats. The server didn't respond. – John Tseng Jun 25 '13 at 21:30
  • How about Perfmon? http://www.brentozar.com/archive/2006/12/dba-101-using-perfmon-for-sql-performance-tuning/ – Katherine Villyard Jun 25 '13 at 21:32
  • We weren't watching perfmon at the time. I'll see if I can replicate the load after hours and get back to you. Thanks. – John Tseng Jun 25 '13 at 21:36
  • Just set the perfmon to run unattended to capture the scenario naturally and don't try to replicate it yourself. Check disk usage, memory, and network. Chances are it's one of those three. This may either be an issue of faulty hardware, or you've reached the max capacity for either memory, disk usage, or the network interface. – Techie Joe Jun 26 '13 at 00:29
  • Here's the perfmon stats for a gentle load on our server: http://pastebin.com/PRQXyMPn Here's an overloaded file: http://pastebin.com/QQ33U5nS I see that when overloaded, the Avg.Disk sec/Write would go up in spurts. When it does, the CPU usage falls. However, I don't understand why the CPU usage is so high when the disk is keeping up. The overloaded state is trying to push 5 times more traffic through. I see a lot more than 5x the CPU usage. – John Tseng Jun 26 '13 at 00:30
  • 1
    Can you tell us about what kind of disks you're using and the layout of the SQL Server files? – Katherine Villyard Jun 26 '13 at 00:35
  • Windows reports a raid SCSI device. I'll ask the admin for hardware info. There's one data file and one log file, both on the same disk. – John Tseng Jun 26 '13 at 00:38
  • @TechieJoe Good idea for capturing good realistic data. However, my boss is not going to like the idea of putting production load on a server, waiting for it to stumble. =D – John Tseng Jun 26 '13 at 00:39
  • 1
    @jtseng Unfortunately if you have problems with a production server you need to capture the scenario naturally in real time. You can throw test after test scenario at the server (or a server in a test environment) but the chances are strong you may never capture the real scenario. It's better to do this now while its not too bad instead of a potentially catastrophic failure which will make things infinitely worse. I've had to do similar things working on the front lines of msn.com many, many moons ago. – Techie Joe Jun 26 '13 at 01:12
  • @jtseng Once you capture the scenario you can then take the data gained out of the production environment and engineer a solution. – Techie Joe Jun 26 '13 at 01:15
  • I've updated the disk data in the question. There's a raid 1 mirror of 15K drives with 136GB data available. – John Tseng Jun 26 '13 at 01:16
  • @TechieJoe We've sort of taken the data out of production and stuck it on a test environment. This new server the one we were having problems with, so that's the same. I've copied over the session data from our production sites and stuck it on a test website running on our production machines. We just dropped the real processing that normally happens. I think the only difference now is I have very good control over the number of requests per second without severely impacting live traffic. What else can I do to make it more realistic? – John Tseng Jun 26 '13 at 01:24
  • @jtseng Have you tried sp_blitz and sp_blitzindex? http://www.brentozar.com/blitz/ http://www.brentozar.com/blitzindex/ We've been looking at hardware, but one common factor we haven't looked at is the scalability of the code. – Katherine Villyard Jun 26 '13 at 11:37