3

We've got a very strange issue that cropped up a few weeks ago and have been unable to resolve.

We are running a couple of web sites in IIS (port 80,443) and in Apache (8080,8090) all on the same Windows Server 2003 SP2 machine. We've been running this configuration for a couple of years now.

The web applications running in IIS connect sometimes connect to the applications running in Apache (on the same server) before responding to the client. Other times the applications will connect to a database server running on another server, and sometimes they will connect to a Windows file share on another server as well.

In all three of the above scenarios we will get the application sporadically reporting one of the following errors:

  • Unable to read data from the transport connection: An established connection was aborted by the software in your host machine.
  • The underlying connection was closed: An unexpected error occurred on a receive.

In addition, we've noticed that while logged into the server while the problem is occurring, attempting to do a request to http://localhost/etc, http://127.0.0.1/etc, http://192.168.xxx.xxx/etc (local IP) all will give a "Connection was reset" error message (Firefox). Both IIS and Apache web requests fail. We are able to connect to the server from a different machine (using IP address or hostname), and we can connect to external sites from the server and doing a ping to itself does not drop out during that time period.

The problem will magically correct itself for a random period of time. Sometimes we can go over 24 hours with out a problem, other times just 20-30 minutes. While the problem is occurring it can last from a few seconds to several minutes (usually no more than 10-15).

We've also experienced no problems connecting to the database server or file share server from other servers at times when we experience it from this server.

Any ideas as to where we should be looking?

Update: So we're still getting this areas, but to add some more detail we get these errors randomly on connections to multiple servers and several different types of connections. We get it on cifs (File Sharing), SQL Server, and web connections to multiple servers both on the LAN and WAN, and to itself. Most of the time it is the "An established connection was aborted by the software in your host machine."

NETSH DUMP

#========================
# Interface configuration
#========================
pushd interface

reset all


popd
# End of interface configuration

#========================
# Interface configuration
#========================
pushd interface ipv6

uninstall


popd
# End of interface configuration



# ----------------------------------
# ISATAP Configuration
# ----------------------------------
pushd interface ipv6 isatap



popd
# End of ISATAP configuration



# ----------------------------------
# 6to4 Configuration
# ----------------------------------
pushd interface ipv6 6to4

reset



popd
# End of 6to4 configuration

#========================
# Port Proxy configuration
#========================
pushd interface portproxy

reset


popd
# End of Port Proxy configuration



# ---------------------------------- 
# Interface IP Configuration         
# ---------------------------------- 
pushd interface ip


# Interface IP Configuration for "SW-1A"

set address name="SW-1A" source=static addr=192.168.xxx.51 mask=255.255.255.0
add address name="SW-1A" addr=192.168.xxx.50 mask=255.255.255.0
set address name="SW-1A" gateway=192.168.xxx.254 gwmetric=0
set dns name="SW-1A" source=static addr=192.168.xxx.2 register=PRIMARY
add dns name="SW-1A" addr=192.168.xxx.3 index=2
set wins name="SW-1A" source=static addr=none


popd
# End of interface IP configuration


# ------------------------------------
# Bridge configuration (not supported)
# ------------------------------------

# ------------------------------------
# End of Bridge configuration
# ------------------------------------


# ----------------------------------------- 
# aaaa Configuration                         
# ----------------------------------------- 
# This script will NOT work across different versions of IAS.
# ----------------------------------------- 

# aaaa configuration script.  
# Known Issues and limitations: 
# Import/Export between different versions is not supported.
# IAS.MDB Version = 7
pushd aaaa
set config  blob=\
blob snippped
\
AA\
*\
A7ACI/wD\
\
*\
A7ACI/wD\
\
*\
A7ACI/wD\
\
*\
A7ACI/wD\
\
*\
A7ACI/wD\
\
*\
A7ACI/wD\
\
*\
A7ACI/wD\
\
*\
A7ACI/wD\
\
*\
A7ACI/wD\
\
*\
A7ACI/wD\
\
*\
A7ACI/wD\
\
*\
A7ACI/wD\
\
*\
A7ACI/wD\
\
*\
A7ACI/wD\
\
*
popd

# End of aaaa show config


# End of aaaa configuration.                  




# ----------------------------------------- 
# Remote Access Configuration               
# ----------------------------------------- 
pushd ras

set authmode mode = standard
delete authtype type = PAP
delete authtype type = SPAP
delete authtype type = MD5CHAP
delete authtype type = MSCHAP
delete authtype type = MSCHAPv2
delete authtype type = EAP
add authtype type = MSCHAP
add authtype type = MSCHAPv2
delete link type = SWC
delete link type = LCP
add link type = SWC
add link type = LCP
delete multilink type = MULTI
delete multilink type = BACP
add multilink type = MULTI
add multilink type = BACP

set user name = ASPNET dialin = policy cbpolicy = none 
set user name = Guest dialin = policy cbpolicy = none 
set user name = IUSR_WX-WWW1 dialin = policy cbpolicy = none 
set user name = IWAM_WX-WWW1 dialin = policy cbpolicy = none 
set user name = customuser1 dialin = policy cbpolicy = none 
set user name = customuser2 dialin = policy cbpolicy = none 
set user name = customuser3 dialin = policy cbpolicy = none 
set user name = SUPPORT_388945a0 dialin = policy cbpolicy = none 
set user name = customuser4 dialin = policy cbpolicy = none 


popd

# End of Remote Access configuration.        




# ----------------------------------------- 
# Remote Access AppleTalk Configuration     
# ----------------------------------------- 
pushd ras appletalk

set negotiation mode = allow

popd

# End of Remote Access AppleTalk Configuration. 



# ----------------------------------------- 
# Remote Access Diagnostics Configuration   
# ----------------------------------------- 
pushd ras diagnostics

set rastracing component = * state = disabled

set modemtracing state = disabled

set cmtracing state = disabled

set securityeventlogs state = disabled


popd

# End of Remote Access Diagnostics Configuration.




# ----------------------------------------- 
# Remote Access IP Configuration            
# ----------------------------------------- 
pushd ras ip

delete pool

set negotiation mode = allow
set access mode = all
set addrreq mode = deny
set broadcastnameresolution mode = disabled
set addrassign method = auto

popd

# End of Remote Access IP configuration.     



# ----------------------------------------- 
# Remote Access IPX Configuration           
# ----------------------------------------- 
pushd ras ipx

set negotiation mode = allow
set access mode = all
set nodereq mode = allow
set netassign method = autosame

popd

# End of Remote Access IPX configuration.    




# ----------------------------------------- 
# Remote Access NBF Configuration           
# ----------------------------------------- 
pushd ras netbeui

set negotiation mode = allow
set access mode = all

popd

# End of Remote Access NBF configuration.   




# ----------------------------------------- 
# Remote Access AAAA Configuration          
# ----------------------------------------- 
pushd ras aaaa

set authentication provider = windows
set accounting provider = windows

delete authserver name = *
delete acctserver name = *



popd

# End of Remote Access AAAA configuration.     


# Routing Configuration
pushd routing
reset
popd
# IP Configuration
pushd routing ip
reset
set loglevel error
add preferenceforprotocol proto=LOCAL preflevel=1
add preferenceforprotocol proto=NetMgmt preflevel=10
add preferenceforprotocol proto=STATIC preflevel=3
add preferenceforprotocol proto=NONDOD preflevel=5
add preferenceforprotocol proto=AUTOSTATIC preflevel=7
add preferenceforprotocol proto=OSPF preflevel=110
add preferenceforprotocol proto=RIP preflevel=120
add interface name="SW-1B" state=enable
set filter name="SW-1B" fragcheck=disable
add interface name="SW-1A" state=enable
set filter name="SW-1A" fragcheck=disable
add interface name="Internal" state=enable
set filter name="Internal" fragcheck=disable
add interface name="Loopback" state=enable
set filter name="Loopback" fragcheck=disable
popd
# End of IP configuration



# ---------------------------------- 
# DNS Proxy configuration            
# ---------------------------------- 
pushd routing ip dnsproxy
uninstall


popd
# End of DNS proxy configuration



# ---------------------------------- 
# IGMP Configuration                 
# ---------------------------------- 
pushd routing ip igmp
uninstall


popd
# End of IGMP configuration



# ---------------------------------- 
# NAT configuration                  
# ---------------------------------- 
pushd routing ip nat
uninstall


popd




# ---------------------------------- 
# OSPF configuration                 
# ---------------------------------- 

pushd routing ip ospf
uninstall

popd
# End of OSPF configuration




# ---------------------------------- 
# DHCP Relay Agent configuration     
# ---------------------------------- 
pushd routing ip relay
uninstall


popd
# End of DHCP Relay configuration



# ---------------------------------- 
# RIP configuration                  
# ---------------------------------- 
pushd routing ip rip
uninstall


popd
# End of RIP configuration



# ---------------------------------- 
# Router Discovery Configuration     
# ---------------------------------- 
pushd routing ip routerdiscovery
uninstall
add interface name="SW-1B" disc=disable minint=7 maxint=10 life=30 level=0
add interface name="SW-1A" disc=disable minint=7 maxint=10 life=30 level=0
add interface name="Internal" disc=disable minint=7 maxint=10 life=30 level=0
add interface name="Loopback" disc=disable minint=7 maxint=10 life=30 level=0


popd


# ---------------------------------- 
# DHCP Allocator Configuration       
# ---------------------------------- 
pushd routing ip autodhcp
uninstall


popd
# End of DHCP Allocator Configuration


Loading of DLL WinsEvnt.dll failed.
Wins Operation failed with Error There are no more endpoints available from the endpoint mapper.

Update: We ended up installing Windows Server 2008 R2 on the same hardware in late July and the problem went away and we've not looked back since. There's a point were you just cut your losses, bite the bullet and run with it.

Lloyd Cotten
  • 53
  • 1
  • 2
  • 10
  • Is there any kind of software firewall in place on the machine (e.g. Windows Firewall, or that provided with or alongside some anti-virus software)? – BMDan Jul 13 '11 at 02:44
  • There is no software firewall, and we have tried disabling the anti-virus for a period. Same result, unfortunately. – Lloyd Cotten Jul 13 '11 at 11:17
  • Has the box had all applicable Microsoft updates applied to it? – BMDan Jul 13 '11 at 12:28
  • There are a number of Windows Updates that haven't been applied. Is there a specific one (KBxxxxxx) you believe has the potential to address something like this? – Lloyd Cotten Jul 13 '11 at 12:46
  • Cotton: No, just trying to eliminate possible sources. In Linux, I'd know which packages' changelogs to check to see whether this was fixed; in Windows, I tend to go with more something more akin to the shotgun approach to patching. – BMDan Jul 13 '11 at 13:55
  • OK... we've applied all of the outstanding Windows Updates. We'll see how that goes over the next couple of days. – Lloyd Cotten Jul 13 '11 at 21:20
  • @BMDan: The problem appeared about 6 hours after applying all the outstanding updates to the server. – Lloyd Cotten Jul 14 '11 at 02:28
  • What's the state of attempted connections to `127.0.0.1:somelisteningport` in `netstat -an` when the server is exhibiting this problem? – BMDan Jul 16 '11 at 00:49
  • @BMDan: Using TCPView, I've managed to watch this. The connection is made, and then disappears almost immediately. Successful connections linger around a little longer. – Lloyd Cotten Jul 19 '11 at 12:35
  • Oh! Now *that* is interesting. That would suggest a firewall or other policy-based networking apparatus deciding that the connection is not acceptable. Can you post the output of `netsh dump`? I suspect you might have something you don't know about in your network stack. – BMDan Jul 19 '11 at 14:47
  • @BMDan: I've added the netsh dump info to the question. Have a look and let me know if anything jumps out at you that's not jumping at me. – Lloyd Cotten Jul 20 '11 at 00:14

4 Answers4

5

One possibility: ephemeral port exhaustion. Try something like netstat -an | find /c ":" to count how many connections you have in all the various states. If that number is over ten thousand or so, then chances are that this is your issue.

BMDan
  • 7,249
  • 2
  • 23
  • 34
  • Hi thanks for the suggestion. We've considered this, however it's pretty steady in the 100-300 range and rarely above 500. – Lloyd Cotten Jul 13 '11 at 11:10
  • It sort of seems like a limit like this, though, as it "seems" to occur after a period of fairly "heavy" connections, i.e. connections that transfer larger numbers of bytes. – Lloyd Cotten Jul 13 '11 at 11:16
  • Okay, and you *are* checking with `-a` on `netstat`, right? In other words, you're getting (and counting) entries in `TIME_WAIT`, for example? – BMDan Jul 13 '11 at 12:09
  • Yes approx half are in the TIME_WAIT state – Lloyd Cotten Jul 13 '11 at 12:23
  • Even if you only have 100-300 (with a rare max of 500) connections showing in netstat (in a listening, time_wait, closing, etc. state) you could still be suffering from ephermeral port exhaustion because by default, Windows 2003 limits ephemeral ports to 1024-5000. These ephemeral ports will normally be recycled at a good rate so you won't hit the limit - but you might be hitting it on bursts. Check the registry for MaxUserPort under HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters - jump the number up (30,000 or more) and reboot to test. Hope it helps! – haymaker Jul 19 '11 at 21:02
1

Are you seeing anything helpful in the Event Logs? Do drivers have trouble loading at startup? Can you check the switch and see if there are transmit or checksum errors on system's uplink interface?

If you've got errors at reboot that relate to the network in anyway I'd pursue those first. If you have switch errors I'd replace cables and move to different port on the switch.

If you don't have errors, consider installing a new network card. At best it fixes the issue, at worst, along with new cables and a different switchport, you can cross physical entirely off the list.

UPDATED

Given that this is happening on multiple machines, I think we can rule out physical layer issues.

My thought then, given that it is intermittent and effects multiple services, would be TCP Chimney. See this KB article: http://support.microsoft.com/kb/945977 and see if it helps.

After that, turn off everything on the network connection panel other than Client, Sharing, and IP Protocol. No QOS, no firewall, no NLB, no vendor-level port combining drivers, etc.

UPDATED again

Well, with all that off, I'd say you should go through the driver level Advanced settings next. If you can, post them here. If not, write them down and then try this: turn on flow control if it is off -- RX and TX on/Respond & Generate depending on the driver. Then find anything marked offload and turn it off. Turn off jumbo frames and vlan support. Turn off anything marked QOS. Basically make the hardware do all the work and take any OS/Driver/CPU tasks away from the datapath.

And finally, if you can catch the server during an "event" follow the steps in this article to check kernel page usage and see if that helps diagnose the issue: http://blogs.msdn.com/b/david.wang/archive/2005/09/21/howto-diagnose-iis6-failing-to-accept-connections-due-to-connections-refused.aspx

My final suggestion

Consider turning off SynAttackProtect and other Kernel-level TCP protections: http://technet.microsoft.com/en-us/library/cc781167(WS.10).aspx or at least bumping up the TcpMax* settings that might cause it to kick in.

Mark
  • 2,248
  • 12
  • 15
  • Event log: nothing other than application specific errors that I detailed in my question. They are all symptoms and not the cause. – Lloyd Cotten Jul 13 '11 at 17:52
  • Switch: no collisions, transmit or any type of errors at all (there are a couple of collisions reported on the link to one of the networked PDUs, but nothing else). – Lloyd Cotten Jul 13 '11 at 17:54
  • Drivers: no problems reported. – Lloyd Cotten Jul 13 '11 at 17:54
  • Also, this problem occurs on two servers that are identical. Same hardware, same os/software versions and configuration applied to both. – Lloyd Cotten Jul 13 '11 at 17:56
  • Also, we've used wireshark and when the problem is present and we cannot hit localhost, this connection does not even touch the NIC (as one would expect for the loopback). – Lloyd Cotten Jul 13 '11 at 17:59
  • You should get some of these facts up into the main body. – Mark Jul 13 '11 at 19:32
  • Hmm... very interesting. The TCP Chimmney sounds plausible. I'll let you know how it goes. – Lloyd Cotten Jul 13 '11 at 21:19
  • Unfortunately, no go on the `netsh int ip set chimney DISABLED` command. I ran that command but the problem is still present. – Lloyd Cotten Jul 14 '11 at 02:27
  • I am unable to disable the Broadcom Advanced Server Program Driver. Everything else is disabled (NLB used to be enabled, but disabled that months ago). – Lloyd Cotten Jul 14 '11 at 03:13
  • I *believe* (don't quote me) that you cannot capture on the loopback adapter on Windows, or at least that that functionality has differed between versions of Windows/Winpcap. Easy enough to test on your specific platform, of course! – BMDan Jul 16 '11 at 00:48
  • @BMDan: You're correct, I've never been able to see connections on the loopback with wireshark, etc. We have been able to see some info on the connections with TCPView. – Lloyd Cotten Jul 17 '11 at 01:57
  • @Mark: After turning off some things as you suggested, here's what we have: QOS: Disabled, Ethernet@WireSpeed: Enabled, Flow Control: Rx & tx Enabled, Interrupt Moderation: Enabled, IPv4 Checksum Offload: None, IPv4 Large Send Offload: Disable, Jumbo Packet: 1500, Locally Admin Address: Not Present, number of RSS Queues: Auto, Pause on Exhausted Host Ring: Disabled, Receive Buffers: 0-Auto, Receive Side Scaling: Disable, Speed & Duplex: 1 Gb Full Auto, Transmit Buffers: 0-Auto, Wake up Capabilities: Both – Lloyd Cotten Jul 17 '11 at 20:47
  • It is a Broadcom BCM5708C NetXtreme II adapter, 6.2.8.0 driver version – Lloyd Cotten Jul 17 '11 at 20:48
  • We'll see how this goes – Lloyd Cotten Jul 17 '11 at 20:48
  • Well, we went almost 24 hours with no errors in that state, but now it's acting up again. I took a look at the non-paged memory stats with poolmon. It's only showing between 47 and 62 MB used NPP. Surely that isn't a problem? Also, incoming connections still work. It's only outgoing that get disconnected. – Lloyd Cotten Jul 18 '11 at 18:51
  • @Lloyd Do you even get "Connection refused" entries in the HTTPERR log files? – oleschri Jul 18 '11 at 19:53
  • @oleschri: Based on his most recent reply to my answer, above, the connection forms (SYN+SYN/ACK+ACK), but is then RST cleanly (had it been FIN, you'd see the socket in TIME_WAIT). – BMDan Jul 19 '11 at 14:50
  • @BMDan: I see, so it never reaches IIS – oleschri Jul 19 '11 at 15:02
  • @oleschri: On the contrary: it might very well reach IIS (or at least *something*), but it's then immediately broken down (and not by IIS, which would send a FIN, not a RST). Also, note the issue affects not only HTTP/IIS, but also CIFS, amongst other things. – BMDan Jul 19 '11 at 19:26
0

I can think of two things:

  • You are running out of ephemeral ports. If you are on linux - the default settings are usually very conservative, so you should always adjust them for production use. You a can check it with cat /proc/sys/net/ipv4/ip_local_port_range. Note, that this problem may be not on the server itself but on your firewall, especially if you are using NAT.

  • You are running out of file descriptors. Each TCP connection takes 2 file descriptors, so by counting number of open connections you can estimate the number of file descriptors you need and compare it with your system limits. ulimit -a will give you your current limits. Again, default Linux settings are conservative (on Centos 5.x the default limit is 1024), so you may need to make some adjustments.

dtoubelis
  • 4,677
  • 1
  • 29
  • 32
0

Could it be that your services authenticate to each other via Kerberos (via AD) and the called services stop to reply due to authentication issues? This should be detectable w/ NetMon or WireShark.

oleschri
  • 317
  • 1
  • 12