0

We are using a Fortibalancer for our web servers (Win2012 with IIS) and we have run into a strange issue. IE users will experience timeouts (~77s) in getting a response from our servers. Packet traces show ZeroWindow probes and ACKs happening at the time of the timeouts.

These are the facts:

  1. When we bypass the load balancers, there is no issue and no Zero Window packets (let alone probes)

  2. Packet traces on the servers show Zero Window packets to the load balancers, but not to the servers

  3. Wireshark shows the highest packet 'size' as 16KB when using the load balancers, but shows 64KB when clients connect directly to the servers.

  4. The issue is not related to load: It can happen with almost no traffic or during periods of high traffic.

  5. We cannot replicate the problem, but it does happen around predictable times (~9:30am or ~3:30am) but not every day. (Nothing special happens in our environment during those times).

  6. Firefox users NEVER experience the problem.

  7. IE version does not seem to matter: IE 8-11 users have the same problem.

  8. LB's are up-to-date. They perform SSL offloading, and link and load balancing. CPU usage on the LB's have never exceeded 10%.

Because of #1, we know the servers themselves are not the issue.

Because of #2, it seems that the LB's are the bottleneck.

Number 3 gives me pause and there seems to be no way to increase the window size (we've tried and we can't increase from 16KB).

Number 5 is the real killer. Our application does not function well enough on other browsers to test, but FF is the one non-IE browser that does, and users have never, ever experienced a delay. FF is so reliable, we are starting to transition clients over to FF and still have not experienced ZeroWindows while IE users continue to experience them. In their packet traces, I can see that the packet 'size' to the LB's are 100-200 bytes larger than with IE packet streams.

Question:

What can I test next in order to find a direction on remediating the problem? Any ideas on what the problem could be?

schroeder
  • 276
  • 2
  • 4
  • 15
  • You say the servers are sending a zero window to the LB, right? That means the server can't keep up with the incoming rate of data. The LBs are local to the servers so throughput is much higher than across a WAN when a client connects directly. That is one possible reason. But if the clients are also local, then that wouldn't apply. In terms of why IE and not FF, you'd want to capture and compare the two as IE is likely behaving differently and captures would allow you to analyze what it that difference is. – karyhead Jul 24 '14 at 22:22
  • @karyhead No - the servers are sending packets that fill the window, not that they are sending notice of a full window. Clients are both local and remote. – schroeder Jul 28 '14 at 14:42
  • I assume the LB acts as a proxy so it's receiving data from the server and then must proxy it to the client. If the LB is experiencing the zero window condition (sending ACKs to the server with window=0) then likely the LB can't get the data to the client fast enough on the front-end and has to tell the server to stop sending data on the back-end. If this only happens with IE and not FF, then IE is probably being slow and causing back pressure. This is all theory without seeing the data. If you can share both front-end (client side) and back-end (server side) traffic captures, I can confirm. – karyhead Jul 28 '14 at 19:31

0 Answers0