14

When doing a TCP analysis of the traffic between my web servers and database servers I see the network buffers (TCP Window) filling up frequently. The web servers then send TCP messages to the database server telling it that its buffers are full an not to send more data until given an update.

For example, this is the size of the network buffer in bytes for one of the more long lived connections to the database server over time:

Network Buffer Graph

The web servers are running a .NET 4.0 application running in IIS integrated mode on Windows 2008 R2 web servers. The SQL server is a 2008 R2 server.

My interpretation of this is that the SQL server is returning data to the web servers faster then the application on the web server can collect the data from the buffers. I have tried tuning just about everything I can in the network drivers to work around this issue. In particular increasing the RSS queues, disabling interrupt moderation, and setting Windows 2008 R2 server to increase the buffer size more aggressively.

So if my interpretation is correct that leaves me wondering about two possibilities:

  1. Is there any way in .NET to tell it to increase the size of the network buffers? The "enhanced 2008 R2 TCP stack" is rarely deciding to enable window scaling (Making the buffer bigger than 65 kBytes) for this connection (probably due to the low latency). It looks like the ability to manually set this system wide is gone in Windows server 2008 r2 (There used to be registry entries that are now ignored). So is there a way I can force this in the code?
  2. Is there anything that can be tuned that would speed up the rate at which the application reads information for the network buffers, in particular for the SQL connections?

Edit:
Requested DMV Query cutting off at ASYNC_NETWORK_IO:

SELECT * FROM sys.dm_os_wait_stats ORDER BY waiting_tasks_count desc;
wait_type  waiting_tasks_count  wait_time_ms  max_wait_time_ms  signal_wait_time_ms
CXPACKET            1436226309    2772827343             39259           354295135
SLEEP_TASK           231661274     337253925             10808            71665032
LATCH_EX             214958564     894509148             11855            84816450
SOS_SCHEDULER_YIELD  176997645     227440530              2997           227332659
ASYNC_NETWORK_IO     112914243      84132232             16707            16250951
Tomalak
  • 332,285
  • 67
  • 532
  • 628
Kyle Brandt
  • 26,938
  • 37
  • 124
  • 165
  • 2
    This is a problem I would be talking to Microsoft directly on. Open up a support ticket or use a free MSDN call if you have it. – rick schott Mar 03 '11 at 12:54
  • Kyle's follow up: http://blog.serverfault.com/post/views-of-the-same-problem-network-admin-dba-and-developer/ – Remus Rusanu Mar 20 '11 at 19:52

1 Answers1

12

1) What makes you think that this is TCP flow control, as opposed to SQL Server not producing data in the intervals where there is no traffic? Check if sys.dm_exec_requests look at wait_type. The wait types are described in Waits and Queues. If is indeed the client applying TCP flow control, then you'll see the wait type ASYNC_NETWORK_IO.

2) If the issue is indeed network wait type, then the solution is not to increase the bandwidth, but obviously to reduce the traffic. The client has no business requesting so much data from the server as to cause TCP flow control. This would be caused by doing horribly wrong things in the client, like counting rows or client-side paging. Move the processing on the server, and just get small results sets with the data you need.

Edit

Consuming the DB call result set ultimately boils down to one form or another of this:

FetchNextRow
while (not EnfOfResults)
{
  ProcessRow;
  FetchNextRow;
}

What this could mean, in real terms, it could be foreach row in IQueryable or SqlDataReader.Read(). But the basic idea is the same, that the client fetches rows from the result, process them, then gets some more rows. If the client code does anything in that ProcessRow that blocks, then the client code will not reach the point where it fetches the next row again, and thus will eventually trigger TCP flow control which in turn will cause SQL Server to suspend the query (since it has no place to write the results into). There is nothing you can do in terms of TCP to make this better. Increasing the window size can actually make maters worse, as now all those results that were previously suppressed at source (DB) are going to be created and have to be stored somewhere, which will ultimately mean live memory allocated to storage and may make things far worse than they are now.

If I would be in your shoes right now I'd focus on identifying where does that ProcessRow blocking occur. An hypothesis I put forward was that that processing would be an MVC View writing into the response buffer and being blocked in turn by TCP flow control resulting from the user agent not consuming the HTTP response (eg. Ajax call completed but the browser is not running the completion code to consume the response because the main thread is looping busy on something else). As always, the best approach is to methodically measure. Some possible tools:

Remus Rusanu
  • 288,378
  • 40
  • 442
  • 569
  • RE 1) Why TCP flow control? I see lots of "Zero Window messages" going from the web server to the SQL server (About 500 per minute per web server). I also often see the buffer hovering around ~200-300 Bytes. So I guess it is the Time Scale at how often the window in near zero. However when it hits Zero the window update goes out pretty fast (2-3MS). I am going to go have a look at the wait DMV now ... – Kyle Brandt Mar 03 '11 at 01:18
  • Updated my question to include the DMV query. ASYNC_NETWORK_IO is showing up, but I am bit ignorant if that is higher than it should be. Going to poke through my new copy of "SQL Server 2008 Internals" to see if I can't learn how to dig down into what queries are causing this. – Kyle Brandt Mar 03 '11 at 01:32
  • max wait time of 16707 ms means there was at least one task that had to wait +16 seconds for network to free up, which would corroborate your initial conclusion. But that also indicates that the application did a DB requests and then did no bother to read results (free buffers) for as long as 16 seconds. Given that this is ASP, one thing to investigate is whether user agent congestion can block your IIS/ASP buffers, which causes your ASP thread to wait for output buffers and in turn makes it ignore the DB requests. – Remus Rusanu Mar 03 '11 at 01:37
  • In other words, a user browses SO, Internet congestion or browser stuck in JavaScript causes it to not read HTTP response, causing congestion (manifested as TCP flow control) all the way down to the database. In this case, looking at the TCP params or.Net ADO SqlClient is barking up the wrong tree. Even if you manage to pull more data from DB and free that pipe, if you're not able to push it to the other pipe down to the user agent via HTTP, then the data will accumulate *somewhere*, and you'll have problems from exhausting memory. – Remus Rusanu Mar 03 '11 at 01:42
  • That sort of makes sense to me but I not sure I understand it enough to know if it fits. The reason is that the the load balancer will be making all the HTTP requests to web server and reading the responses I think. (In fact if I am not mistaken it maintains persistent connections between itself and clients, but not between itself and the servers). Even taking both client side aborts and server side aborts I only see about 50 in an hour for a single server (vs the 200 or so zero window messages I see *per minute* for the server) – Kyle Brandt Mar 03 '11 at 01:59