6

Have an external website which opens fine on some PC's, yet seems to time out (or symptoms of timing out, but never actually does) on others.

Seems to only affect (some) of our newer HP Pro 3305 MT Workstations. All of which are running Win7 32bit SP1 with all updates. Older PC's (Win7 32bit SP1 & WinXP) are unaffected.

Using Google Chrome & Firefox makes no difference. Opening the website in IE9 Compatibility Mode has exactly the same symptoms.

All PC's are on the same local network (Workgroup) using the same DNS server & gateway (inhouse) on the same internet connection, on the same subnet. There is no proxy server, no content filtering, no load balancing etc etc. Only group policy in effect (locally) is for Update scheduling. Local firewalls are all the same (Kaspersky WP4) and our external facing firewall has no IP specific settings.

I have no control over the external website, traceroute shows the same destination on all PC's. It is a fairly popular website in our industry (Horticulture) and i'm not aware of any other people (even other sites within our sister companies) with the same problem.

Update: Used Fiddler2 to monitor the HTTP request, seems its not getting fulfilled for some reason?!

Request sent:

GET http://www.rhs.org.uk/ HTTP/1.1
Host: www.rhs.org.uk
Connection: keep-alive
User-Agent: Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.47 Safari/536.11
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Encoding: gzip,deflate,sdch
Accept-Language: en-GB,en-US;q=0.8,en;q=0.6
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3

Log from Fiddler 2 of the request:

This session is not yet complete. Press F5 to refresh when session is complete for updated statistics.

Request Count:   1
Bytes Sent:      567        (headers:567; body:0)
Bytes Received:  0      (headers:0; body:0)

ACTUAL PERFORMANCE
--------------
ClientConnected:    17:02:33.720
ClientBeginRequest: 17:02:39.118
GotRequestHeaders:  17:02:39.118
ClientDoneRequest:  17:02:39.118
Determine Gateway:  0ms
DNS Lookup:         0ms
TCP/IP Connect: 46ms
HTTPS Handshake:    0ms
ServerConnected:    17:02:39.165
FiddlerBeginRequest:    17:02:39.165
ServerGotRequest:   17:02:39.165
ServerBeginResponse:    00:00:00.000
GotResponseHeaders: 00:00:00.000
ServerDoneResponse: 00:00:00.000
ClientBeginResponse:    00:00:00.000
ClientDoneResponse: 00:00:00.000


RESPONSE BYTES (by Content-Type)
--------------
~headers~:  0

Log of a successful request from a working PC (done this morning, excuse the timestamps being different from above):

Request Count:   1
Bytes Sent:      493        (headers:493; body:0)
Bytes Received:  20,413     (headers:525; body:19,888)

ACTUAL PERFORMANCE
--------------
ClientConnected:    08:22:47.766
ClientBeginRequest: 08:22:47.766
GotRequestHeaders:  08:22:47.766
ClientDoneRequest:  08:22:47.766
Determine Gateway:  0ms
DNS Lookup:         26ms
TCP/IP Connect: 30ms
HTTPS Handshake:    0ms
ServerConnected:    08:22:47.828
FiddlerBeginRequest:    08:22:47.828
ServerGotRequest:   08:22:47.828
ServerBeginResponse:    08:22:48.905
GotResponseHeaders: 08:22:48.905
ServerDoneResponse: 08:22:48.905
ClientBeginResponse:    08:22:48.905
ClientDoneResponse: 08:22:48.905

    Overall Elapsed:    00:00:01.1388020

RESPONSE BYTES (by Content-Type)
--------------
text/html:  19,888
~headers~:  525

So my question has evolved into:

What is the difference between the 2 requests and how do I determine why 1 PC is not getting a reply to it's GET request?

Update 2:

See my answer below. I may well accept it in the future, but without being able to reproduce the problem (or the fix) I'd like to leave this question open.

HaydnWVN
  • 415
  • 2
  • 8
  • 27
  • Do you have some HP 3305's that ARE working normally or are they all exp. the problem? – Paul Ackerman Jul 06 '12 at 11:37
  • I've seen MTU problems cause errant surfing behavior but this would likely only occur over a WAN link - not on the same subnet. Still it is the only time I have seen such behavior so I thought I'd mention it in case you're using a smaller MTU and the new boxes didn't get the memo. – Paul Ackerman Jul 06 '12 at 11:52
  • I have 2 that are working, 4 that are not... The 2 that are working were purchased a month prior to the others which were all from the same order/date... All of them I setup the exact same way?! – HaydnWVN Jul 06 '12 at 11:53
  • I'd see ping/DNS problems with anything MTU related wouldn't I? – HaydnWVN Jul 06 '12 at 11:54
  • interesting. Did you clone the machines in some way (disk dup, ghost, etc). If so, you might need a different driver on the NIC. In any case, I would probably start by updating the driver and see if it remains. – Paul Ackerman Jul 06 '12 at 11:54
  • RE:MTU. you can test MTU issues with ping -f which tells the router not to fragment and -l size to see what the max you can send through. With the standard 1500 bytes MTU, IP has a 20 byte header and ICMP uses 8 so you should be able to send 1472 bytes through. IF you get a response that says Packet needs to be frag's but DF bit set, you have an MTU issue. – Paul Ackerman Jul 06 '12 at 11:56
  • you wouldn't necessarily see DNS problems unless you were doing zone transfers or something large. – Paul Ackerman Jul 06 '12 at 11:59
  • Thanks Paul, i'll try it! [found this](http://www.elifulkerson.com/projects/mturoute.php) which i'll probably run... More information = more power! Machines were not cloned (by me), but were supplied by the same OEM who probably did. But they have the same NIC's anyway, with updated/latest drivers - both working and 'broken' PC's. – HaydnWVN Jul 06 '12 at 12:01
  • what do you mean by "seems to time out (or symptoms of timing out, but never actually does)"? What is the error you end up getting? – Paul Ackerman Jul 06 '12 at 13:54
  • Opening the google cached copy of the webpage works fine, copying and opening the direct address of an image from the webpage works, but not an actual page. Using a web proxy opens the site fine. Opening the webpage in safe mode/administrator/IE safe mode & trusted zones makes no difference... – HaydnWVN Jul 06 '12 at 14:58
  • Nothing shows up regarding MTU, no difference between a working PC and a non-working. – HaydnWVN Jul 06 '12 at 15:35
  • Pages are not loading at all - see added into to Question. – HaydnWVN Jul 06 '12 at 16:06
  • If you have 2 HP 3305's that are working and 4 that are not, it might be interesting to see what happens if you switch the harddrive of a working PC with the one of a not working PC. – ZEDA-NL Jul 10 '12 at 08:32
  • I've used live CD's for both Windows XP and Ubuntu on a 'broken' PC and the website works... Makes me think that it's not hardware! Just something within the Win7 enviroment thats causing it. – HaydnWVN Jul 10 '12 at 09:33
  • Does the 6 second gap between ClientConnnected and ClientBeginRequest happen on all failed requests? – stark Jul 10 '12 at 21:47
  • I'll check again tomorrow, does the 6 seconds signify anything? – HaydnWVN Jul 11 '12 at 15:59
  • Different timeouts, not always 6 seconds. – HaydnWVN Jul 17 '12 at 08:06
  • Have you added the URL to trusted sites to see if it will load? – Paul Ackerman Jul 17 '12 at 11:18
  • Yup already tried that, also lowered security level for the zone etc etc – HaydnWVN Jul 18 '12 at 10:42
  • 1
    I'm going to throw a bit of insider information into the mix here. I'm one of the devs in the online team at the RHS (the team that looks after the main website). About once every couple of months or so we get a request like this and try to fix it, but the user generally stops responding before we find out what's going on. I'm 99% sure it's a Windows 7 issue, but other than that we're stumped. – Piers Karsenbarg Aug 06 '12 at 13:58
  • how about a safe boot with networking then surf? – tony roth Aug 06 '12 at 22:46
  • Same symptoms. LiveCD is ok though so it can't be the hardware? @PiersKarsenbarg what information would you like me to provide or what to try? – HaydnWVN Aug 07 '12 at 09:01
  • so you safe booted win7 and it still failed, which browser did you use when safe booted? – tony roth Aug 07 '12 at 13:08
  • IE9, tried Chrome and Firefox but not in safe mode. – HaydnWVN Aug 07 '12 at 13:10
  • are these workstations part of a domain? – tony roth Aug 07 '12 at 17:00
  • Are you certain there is no IP based rate limiting rules/filtering on the web server side of things? For example, I've seen this be an issue with email services on Linux hosts. By default, there is a Max connection per IP setting in various IMAP servers that can cause issues for larger offices. – jeffatrackaid Aug 07 '12 at 19:16
  • There might be, but explain to me - how it is always these few PC's that do not display while every other can without a problem? – HaydnWVN Aug 08 '12 at 08:49
  • ok even with a safe booted win7 machine the oem can install filter drivers/av etc that survie a safeboot. Go into add/remove while booted normally is there any av software loaded? – tony roth Aug 08 '12 at 13:32
  • There isn't, I installed and configured the AV, setup GP's and installed everything but the OS and Office on these machines. – HaydnWVN Aug 08 '12 at 13:39
  • when you say GP's are you refering to group policies? – tony roth Aug 08 '12 at 15:09
  • Yes, the only ones I have setup are locally though (this is a workgroup) are to do with Windows Update Scheduling – HaydnWVN Aug 08 '12 at 15:35
  • do a telnet www.rhs.org.uk 80 what happens? – tony roth Aug 08 '12 at 22:31
  • [After installing telnet](http://technet.microsoft.com/en-us/library/cc771275(v=ws.10).aspx) on both machines (working & non-working) the connection opens and displays "`Press any key to continue...`", if you do so it closes the connection (ie both exactly the same). – HaydnWVN Aug 10 '12 at 09:40

7 Answers7

1

If you want to know the difference in the HTTP GET request, download the ZAP (Zed Attack Proxy) from OWASP or some other proxy that will allow you to inspect each packet before it is sent to the server. This will answer the question of "what is the difference between the 2 requests".

If the requests are the same try another NIC.

Most likely your NIC is on-board. Try installing a PCI NIC with appropriate drivers and see if you can get there. Sounds like hardware/driver issue at this point.

Paul Ackerman
  • 2,729
  • 1
  • 16
  • 23
  • I can't figure out how to use ZAP to monitor HTTP GET requests. Don't have alot of time to 'fix' this as my work-around is quite a popular 'fix' at the moment! – HaydnWVN Jul 30 '12 at 10:36
1

I've never used Fiddler before, but based on the "ServerGotRequest" being un-set in the failure scenario implies one of three things:

  1. The server hasn't received the full request from the workstation (i.e. the HTTP GET hasn't completed)
  2. The server received the request but didn't reply due to an error or other problem on the sever.
  3. The server replied, but the reply packet didn't make it back.

I know this is a hosted server, do you have access to look at server logs or the ability to run a sniffer on it (i.e. WireShark) to capture data while you're testing? If so, watch the server log files for any errors, and run the sniffer until you get a failure scenario at the workstation then look and see if the server received the full response and tried to respond.

After that, check the Kapersky firewall logs to see if it dropped any packets. Is it possible to setup a sniffer in front of the firewall and see if the response from the server is making it back that far? If it makes it to the firewall, and Kaspersky doesn't note dropping anything it's probably safe to assume it made it through.

During these tests, I'd suggest running WireShark on one of the machines that fails. It will show the out-bound connections, plus it should also show any responses the NIC receives. If it is a NIC issue, the sniffer trace should show the packet being received and from there you can determine if that warrants a NIC and/or driver update.

Since you are unable to attach a sniffer to the outside of your firewall, you'll need to work with your ISP to have them setup monitoring for the packets leaving your router, but never receiving a response.

Once the ISP has confirmed or refuted your hypothesis about where the packets are going, there are two options: Option 1: The packet makes it to the firewall but does NOT go out to the ISP during a failed web connect attempt. Option 2: The packet makes it through the firewall onto the ISP network, but the response never comes.

Option 1 might be easiest to replace and/or re-install the firewall if possible. If it is an ISP provided device, you'll want to have them save the current config but apply a very basic configuration on the new system to ensure it's not a configuration related problem.

Option 2 would be nice because it puts the problem on them to fix, but if they don't have the time to look into it then you're stuck with their answer. In this case, it could be that it leaves their network and goes out to their Internet provider - that gets into a whole other can of worms trying to track down where a packet died.

dan_linder
  • 167
  • 7
  • Wireshark shows the GET request leaving the machine, but the machine never gets anything in reply. Odd thing is that it doesn't time out either?! I have no access to the webserver, it's not mine! It's just a website that some users occasionally need access to! – HaydnWVN Aug 06 '12 at 16:16
  • So, if you put a WireShark sniffer outside your firewall do you see the GET request going out? If not, then your problem is internal and just needs to be tracked down. If it goes out, then you'll need to work with your ISP and hope they will work with you. – dan_linder Aug 07 '12 at 17:01
  • I can't monitor the outgoing data outside the firewall as it's built into our router. We only have the 1 connection. Why would I need to work with my ISP? All GET requests (successful and unsuccessful) are over the same connection, on the same external facing IP to the same destination IP. This isn't a routing issue. – HaydnWVN Aug 08 '12 at 13:49
  • I agree that it doesn't look like a routing issue, but it's quite possible that it's a packet dropping issue. By using a packet sniffer on the outside of the firewall you can show that the GET requests that ultimately fail did leave the network. You would then have some data points to prove that it's not a problem on your network. – dan_linder Aug 09 '12 at 20:37
  • If this were the case wouldn't we see odd behaviour with other sites/e-mails going missing and corrupt data? This website is the only one with an issue. – HaydnWVN Aug 10 '12 at 09:07
  • I would expect odd problems to appear elsewhere too - that's what has me at a loss for a specific "check X" sort of answer. Have you tried any of the data collection steps? Even sniffing the inside port on the firewall will confirm that it's not completely due to an internal networking issue. – dan_linder Aug 14 '12 at 03:36
  • If you see the GET requests reaching your firewall, but you can't monitor the outside connection on the firewall yourself, you'll need to work with them - they should be able to setup some monitoring on their end to capture the traffic that's making it to their side. – dan_linder Aug 14 '12 at 03:37
  • Ok thanks dan, care to post your additional steps into your answer and I'll upvote it! I'll let you know how I get on when I have time to try it! – HaydnWVN Aug 14 '12 at 10:05
  • HaydnWVN - just updated the post with the additional troubleshooting notes. – dan_linder Aug 14 '12 at 23:28
  • So, any update on this? – dan_linder Aug 24 '12 at 23:50
  • Have worked with [Piers](http://serverfault.com/users/49576/piers-karsenbarg) and we are no closer to finding the issue... BUT! The website has begun working literally over the last few days. No changes to the website. No changes to the machines. Have some more info i'll add into the question, but i'll keep it open as we havn't found the actual *cause* yet. – HaydnWVN Aug 29 '12 at 15:29
0

Can you confirm if the nic in working machines versus non-working are the same make/model. Also could you confirm that your ipv6 is the same on all machines (on an internal lan I would disable ipv6 altogether). Also as a last check - ensure that there is nothing in the host file that might stop network access (c:\windows\drivers\etc)

The fact that you have ruled out the browser and the hardware (using a live cd) leaves me to think it must be network adapter related.

If all this fails - definitely swap hard disks and see if the problem follows the hard disk or the nic.

PJ42
  • 1
  • Machines are exactly the same. IP6 has no settings/is disabled (not sure which). Hosts file is empty. – HaydnWVN Jul 10 '12 at 16:05
  • Did you try swapping the disks? – PJ42 Jul 11 '12 at 08:56
  • Not yet, these are live machines and using an online (free) web proxy (which i've done as a 'fix') may end up being my solution. It's only 1 website, albeit a regularly visited one. – HaydnWVN Jul 11 '12 at 16:00
  • Should have asked this before - do you use an internet proxy? If so can you confirm that the same user accounts and being used? – PJ42 Jul 12 '12 at 13:59
  • As mentioned in my original post - no proxy. – HaydnWVN Jul 13 '12 at 09:03
0

I would compare the netmasks and gateway addresses on the problematic systems and compare this to the working systems.

I have seen the problem before and this was the cause -- a wrong (but still somewhat working) gateway address.

jftuga
  • 5,731
  • 4
  • 42
  • 51
  • All are the same (ie correct). We only have 1 gateway here and use 255.255.0.0 as subnet mask. Your answer doesn't explain why it should only happen on this 1 website (all others are ok) without any errors. – HaydnWVN Jul 10 '12 at 16:06
  • Did you check the IP settings of the web server? – jftuga Jul 10 '12 at 17:04
  • 1
    The web server/host site is totally out of my control. Both machines DNS results for the address resolve to the same destination IP. – HaydnWVN Jul 11 '12 at 15:58
0

Start with the basics - you've got two different series of machines that likely have two different series of NIC's. Are both sides set for autonegotiation and, if so, are they agreeing on the appropriate speed? Try hard-coding both sides as an experiment to see if it improves at all (..or if it's hard-coded on either side currently then let both sides negotiate).

rnxrx
  • 8,143
  • 3
  • 22
  • 31
  • 1
    It's not a hardware issue. Not sure what it is, but it's definitely not that. – Piers Karsenbarg Aug 06 '12 at 14:24
  • So you've confirmed that the counters on the switch port aren't showing any kind of errors? – rnxrx Aug 06 '12 at 14:28
  • Hard-coding? Counters on the switch? Elaborate please! – HaydnWVN Aug 06 '12 at 16:14
  • My comments have to do with the Ethernet switch you're using. If you're able to look at the counters for various errors you may find that the machines in question are seeing these counters rising. This would be indicative of an issue with the configuration of the network interfaces. – rnxrx Aug 06 '12 at 17:35
  • It's a basic unmanaged switch with no interface for me to check anything like that. What kind of network configuration? And why would it only be with 1 site and nothing else? – HaydnWVN Aug 07 '12 at 08:50
  • It could end up that this particular site is pushing more data at a particular time, or something to that effect. My point overall is that this kind of thing can be a ghost in the machine. Is there any other physical commonality between the affected machines? Same switch (or group of ports)? New patch cables with this batch of machines? You may have swapped all of this stuff already, but if you haven't then try swapping one of the problem machines directly with one of the working ones (i.e. reuse its connection + patch) to eliminate this as an area to investigate, if nothing else. – rnxrx Aug 07 '12 at 19:45
  • Have tried the 'machine swap' scenario, no change. What do you mean by "`Ghost in the Machine`"? How would you troubleshoot it? – HaydnWVN Aug 08 '12 at 13:47
0

There's a big gap between

ClientConnected:    17:02:33.720

...and...

ClientBeginRequest: 17:02:39.118

Either you are losing packets or the client side security software is broken. It's trivial to test the former with Wireshark - and even if you don't see packet less (retransmits) you can determine the directionality of the injected latency.

symcbean
  • 21,009
  • 1
  • 31
  • 52
  • Will test further, any ideas why this would only be happening with 1 website and nothing else? – HaydnWVN Aug 07 '12 at 08:49
  • No lost packets from any machines (affected and unaffected). The gap between the Connect and BeginRequest can be anything between 3-9 seconds on either type of PC. What do you mean by "`the client side security software is broken`"? As I can control/fix that! Any idea what may be broken? – HaydnWVN Aug 08 '12 at 13:45
  • If you can get the same results with different browsers on affected machines, then this would demonstrate that its you firewall / security software causing the problem - try un-installing it (or better yet a fresh install without the software). – symcbean Aug 08 '12 at 23:55
0

As of this morning this issue is 'fixed'.

I have worked (via email) with Piers Karsenbarg on several different avenues of resolution, all to no avail. Nothing has been changed on the website and nothing has been changed on the machines - except some Windows Updates. Can't thank Piers enough for getting involved with the problem and spending lots of his quality time trying to resolve it!

Piers linked me to this which has all the symptoms (but none of the causes) on these machines in question (ie no Type 1 fonts). But it is possible a Windows Update (or some Adobe update) fixed the issue - I'm thinking replaced or removed the Type 1 fonts . Further information can be found here and here.

HaydnWVN
  • 415
  • 2
  • 8
  • 27