Very strange home router problem

Question

For a long time now I've had a very strange problem with my wi-fi network at home. I have a BT Voyager 2100 ADSL modem and an iMac, ageing PowerBook and a PC that connect to it wirelessly. The problem is that I can never access a small number of certain websites because they always time out.

There's nothing apparent that connects these websites in any way. Some examples that I've come across are www.adobe.com, www.microsoft.com, www.portsmouthguildhall.co.uk (a local venue) and subtraction.com (a blog). I can ping some of the sites without problems; there are no timeouts. In fact, I used to be able to access subtraction.com and can still get its RSS feed. I just can't view the site in a web browser any more. This is a very isolated problem—for the majority of my Internet use everything works fine.

It's clearly not a problem with the individual computers because they all have this problem, so it must be a problem downstream with my router or even ISP. I've upgraded the router to the latest firmware and tried resetting it, but it didn't fix the problem.

How can I even diagnose where the problem is? I'm at a loss as to know where to start! Are there any UNIX networking commands that I can use (I have Mac OS X)?

Thanks for any help.

EDIT: Following Alnitak's suggestion, I tried a traceroute and ping with adobe.com. As you can see, the traceroute never gets there:

$ traceroute adobe.com
traceroute to adobe.com (192.150.18.117), 64 hops max, 40 byte packets
 1  voyager.home (192.168.1.1)  1.975 ms  1.505 ms  1.574 ms
 2  lo0-plusnet.ptn-ag2.plus.net (195.166.128.53)  28.476 ms  47.139 ms  28.036 ms
 3  ge0-0-0-204.ptn-gw02.plus.net (84.92.3.93)  28.520 ms  37.297 ms  33.186 ms
 4  te2-2.pte-gw2.plus.net (212.159.1.106)  35.670 ms  36.262 ms  34.995 ms
 5  80.239.193.141 (80.239.193.141)  33.932 ms  28.600 ms  28.764 ms
 6  ldn-bb1-link.telia.net (80.91.248.90)  29.649 ms  28.149 ms  30.857 ms
 7  ldn-b5-link.telia.net (80.91.249.178)  27.991 ms  28.014 ms  28.490 ms
 8  verio-129583-ldn-b5.telia.net (213.248.100.50)  28.468 ms  29.286 ms  31.702 ms
 9  ae-1.r23.londen03.uk.bb.gin.ntt.net (129.250.5.237)  30.871 ms  29.295 ms ae-1.r22.londen03.uk.bb.gin.ntt.net (129.250.5.233)  28.614 ms
10  ae-0.r22.londen03.uk.bb.gin.ntt.net (129.250.4.85)  29.732 ms as-0.r20.nycmny01.us.bb.gin.ntt.net (129.250.3.254)  108.909 ms ae-0.r22.londen03.uk.bb.gin.ntt.net (129.250.4.85)  28.505 ms
11  ae-0.r21.nycmny01.us.bb.gin.ntt.net (129.250.2.26)  109.164 ms as-0.r20.nycmny01.us.bb.gin.ntt.net (129.250.3.254)  104.860 ms ae-0.r21.nycmny01.us.bb.gin.ntt.net (129.250.2.26)  111.253 ms
12  as-0.r20.asbnva02.us.bb.gin.ntt.net (129.250.2.9)  104.777 ms ae-0.r21.nycmny01.us.bb.gin.ntt.net (129.250.2.26)  109.973 ms as-0.r20.asbnva02.us.bb.gin.ntt.net (129.250.2.9)  108.774 ms
13  as-0.r20.asbnva02.us.bb.gin.ntt.net (129.250.2.9)  103.691 ms ae-3.r21.asbnva01.us.bb.gin.ntt.net (129.250.2.128)  104.958 ms as-0.r20.asbnva02.us.bb.gin.ntt.net (129.250.2.9)  104.455 ms
14  as-3.r20.snjsca04.us.bb.gin.ntt.net (129.250.2.167)  197.595 ms ae-3.r21.asbnva01.us.bb.gin.ntt.net (129.250.2.128)  105.027 ms  106.565 ms
15  * as-3.r20.snjsca04.us.bb.gin.ntt.net (129.250.2.167)  179.946 ms *
16  * te-5-3.r02.snjsca04.us.ce.gin.ntt.net (128.241.219.86)  176.374 ms *
17  * * te-5-3.r02.snjsca04.us.ce.gin.ntt.net (128.241.219.86)  189.724 ms
18  * * *
19  * * *
20  * * *
^C

—Now trying a ping from hop 14 onwards. As you can see, the last ping has 20% packet loss:

$ ping -s 1492 as-3.r20.snjsca04.us.bb.gin.ntt.net
PING as-3.r20.snjsca04.us.bb.gin.ntt.net (129.250.2.167): 1492 data bytes
1500 bytes from 129.250.2.167: icmp_seq=0 ttl=55 time=214.555 ms
1500 bytes from 129.250.2.167: icmp_seq=1 ttl=55 time=215.339 ms
1500 bytes from 129.250.2.167: icmp_seq=2 ttl=55 time=221.211 ms
1500 bytes from 129.250.2.167: icmp_seq=3 ttl=55 time=224.296 ms
^C
--- as-3.r20.snjsca04.us.bb.gin.ntt.net ping statistics ---
4 packets transmitted, 4 packets received, 0% packet loss
round-trip min/avg/max/stddev = 214.555/218.850/224.296/4.062 ms

$ ping -s 1492 as-3.r20.snjsca04.us.bb.gin.ntt.net
PING as-3.r20.snjsca04.us.bb.gin.ntt.net (129.250.2.167): 1492 data bytes
1500 bytes from 129.250.2.167: icmp_seq=0 ttl=55 time=299.852 ms
1500 bytes from 129.250.2.167: icmp_seq=1 ttl=55 time=326.598 ms
1500 bytes from 129.250.2.167: icmp_seq=2 ttl=55 time=243.278 ms
1500 bytes from 129.250.2.167: icmp_seq=3 ttl=55 time=214.610 ms
1500 bytes from 129.250.2.167: icmp_seq=4 ttl=55 time=232.900 ms
^C
--- as-3.r20.snjsca04.us.bb.gin.ntt.net ping statistics ---
5 packets transmitted, 5 packets received, 0% packet loss
round-trip min/avg/max/stddev = 214.610/263.448/326.598/42.517 ms

$ ping -s 1492 te-5-3.r02.snjsca04.us.ce.gin.ntt.net
PING te-5-3.r02.snjsca04.us.ce.gin.ntt.net (128.241.219.86): 1492 data bytes
1500 bytes from 128.241.219.86: icmp_seq=0 ttl=245 time=349.851 ms
1500 bytes from 128.241.219.86: icmp_seq=1 ttl=245 time=270.748 ms
1500 bytes from 128.241.219.86: icmp_seq=2 ttl=245 time=334.406 ms
1500 bytes from 128.241.219.86: icmp_seq=3 ttl=245 time=220.046 ms
^C
--- te-5-3.r02.snjsca04.us.ce.gin.ntt.net ping statistics ---
4 packets transmitted, 4 packets received, 0% packet loss
round-trip min/avg/max/stddev = 220.046/293.763/349.851/51.869 ms

$ ping -s 1492 te-5-3.r02.snjsca04.us.ce.gin.ntt.net
PING te-5-3.r02.snjsca04.us.ce.gin.ntt.net (128.241.219.86): 1492 data bytes
1500 bytes from 128.241.219.86: icmp_seq=0 ttl=245 time=472.908 ms
1500 bytes from 128.241.219.86: icmp_seq=1 ttl=245 time=228.290 ms
1500 bytes from 128.241.219.86: icmp_seq=2 ttl=245 time=231.048 ms
1500 bytes from 128.241.219.86: icmp_seq=3 ttl=245 time=229.906 ms
^C
--- te-5-3.r02.snjsca04.us.ce.gin.ntt.net ping statistics ---
5 packets transmitted, 4 packets received, 20% packet loss
round-trip min/avg/max/stddev = 228.290/290.538/472.908/105.296 ms

It could be a problem with the router, but it could also be a strange network topology or routing issue with your network provider. If you know anyone else using the same provider, especially if they live close to you, see if they have similar issues. — Eddie, May 01 '09 at 14:47
my own tests against www.adobe.com (adobe.com doesn't answer) show that it won't answer pings longer than 1430 bytes long. try setting your network card's MTU to 1438 and see if your problem disappears. — Alnitak, May 02 '09 at 14:48
Just tried setting both the router's MTU and the network card on my Mac to 1300...no good. — John Topley, May 02 '09 at 15:11
Just tried an MTU of 1200 - no good. I have a static IP address. Could the sites be blocking that for some reason? — John Topley, May 03 '09 at 07:56
that last ping hasn't got 20% packet loss, you just happen to have aborted the ping during the round-trip of the last packet — Alnitak, May 05 '09 at 19:49
tbh, it still smells like an MTU problem, but I'm at a loss as to the reason. — Alnitak, May 05 '09 at 19:53
John, if possible, please run 'sudo tcpdump -i en1 port 80' at the same time as you try to access one of those sites, and then post the results somewhere list pastebin.com — Alnitak, May 05 '09 at 19:59
Here you go: http://pastebin.com/m67b4f381 - thanks Alnitak! — John Topley, May 05 '09 at 20:21
ok, clearly shows there's a problem, but not the cause. Please add "-s 0" to the flags so that tcpdump can catch the whole response packet. Also, what's your MTU set to now? The outgoing MSS is still 1460, which is too high, but I can't find any details on how to change it on Mac OSX — Alnitak, May 05 '09 at 21:39
Unfortunately I can't run that command (even with sudo). I get: "tcpdump: BIOCSETIF: -s: Device not configured". There are 3 MTU settings I can configure, all set to the defaults for now. The OS X Ethernet config in Network preferences is 1500. Then in my router config the LAN side IP settings also has an MTU of 1500. Finally the router's Internet connection has an MTU of 1400. — John Topley, May 06 '09 at 07:54
John, that's a zero, not an "Oh": tcpdump -i en1 -s 0 port 80 or icmp — Alnitak, May 06 '09 at 11:27
I was using a zero, I just got the parameter order wrong. Anyway, here's the output: http://pastebin.com/m43086d12 (the Google stuff is because I have Google Notifier polling in the background) — John Topley, May 06 '09 at 18:58
It looks like the start of the TCP connection was missed in the tcpdump :( The "bad hdr length" messages are AFAIK the reason why it's not working, but I can't be sure why they're happening. — Alnitak, May 08 '09 at 06:38
One more thing to try is adding "-vvx" to the tcpdump options to get a raw packet trace: tcpdump -i en1 -s 0 -vvx port 80 or icmp — cmeerw, May 08 '09 at 21:48
OK, here's the verbose tcpdump output: http://pastebin.com/m320a9520 — John Topley, May 09 '09 at 09:24

Alnitak · Accepted Answer · 2009-05-09T12:10:51.010

This sounds like an MTU problem.

There's likely something between you and those sites that doesn't support the typical 1500 byte MTU, and on top of that probably a firewall blocking the ICMP packets that are used for "Path MTU Discovery", so your end can't tell that the normal MTU can't be used.

Try a traceroute, and then for each hop in turn, try sending a large ping packet (1492 bytes) and see if any of those hops refuse to return the packet.

EDIT - your tcpdump output shows that your end is still trying to initiate TCP's "three-way handshake" because the SYN bit is sent in the packets from your end. However the packets coming back from Adobe appear to be truncated or malformed. That's pretty weird, because there shouldn't be any payload in the packets, just the far end's SYN response. I'd need to see a full dump (including the -X option) of just those first 4 or so packets to know more.

EDIT2 - based on your detailed tcpdumps I believe that your router is corrupting the TCP response from some sites. The best way to test this is to borrow another brand of router.

Can we slap upside the head all the netadmins that still cling to the false belief that all ICMP should be blocked? :-) I can't believe we're STILL dealing with this all these years later. C'mon, PPPoE has been out forever. I can understand not "getting it" before then, since the problem never really came up, but really, nowadays everyone should know better. — Brian Knoblauch, May 05 '09 at 20:02

score 5 · Answer 2 · answered May 01 '09 at 14:38

5

Plug one of your computers directly into your internet connection and let it get all it's network settings from your ISP. If you can't access the sites then it's an ISP issue, if you can then it's a router issue and you can go from there.

answered May 01 '09 at 14:38

Jared

1,420
2
16
23

I like the sound of this plan. Unfortunately I can't try it right now because I don't have the right sort of cable or connector. – John Topley May 01 '09 at 17:25
I need to buy an RJ-11 to RJ-45 adaptor. – John Topley May 01 '09 at 17:29
It's worth noting your router is your modem as well. Do you find the same symptoms when you are hardwired to the modem as well? – Chealion May 01 '09 at 19:27
2

@John - that's impossible, he's got an ADSL router, so the presentation from the ISP isn't ethernet. – Alnitak May 02 '09 at 11:03
He could try to DMZ his machine from the router. – Manuel Ferreria May 02 '09 at 14:42

score 3 · Answer 3 · answered May 01 '09 at 14:34

You can try a traceroute, and see how far your packets are getting. If they're stopping at your router, it's probably a problem there. If they go farther, you might want to get in touch with your ISP.

Reading your question again, you say you can ping the servers successfully, so you might not see anything abnormal on the traceroute...

score 3 · Answer 4 · answered May 08 '09 at 01:18

I definitely agree with the notion that basic symptoms of this problem sounds like it is related to a PATH MTU problem. There are other possibilities, but that is the most likely place to start.

Given the prominence of the sites you mention and presumably the extended period of time that this has been occurring for, it seems kind of unlikely it is a problem within the ISP's network......although given the traceroute result shown in the question, the path depth and total latency doesn't shine very well on your ISP. Generally speaking, any decent ISP should get you to any major/prominent web property (within the USA) in something [well] under 120ms...but I digress.

Using traceroute and ping to diagnose the problem as others have mentioned is very helpful, but it is far from a definite tool solution given the possibility/likelihood of ICMP blocking/filtering in various locations. And, because of this, except in the hands of a skilled analyst it is pretty hard to tell the difference between specific problems & firewalls messing with ICMP.

The best way to rule out an MTU problem is to start by reducing the MTU of the Ethernet interface in one of the computers that is having the problem. See the procedure located here for MAC systems since you mentioned you have a MAC computer.

If you start lowering your interface MTU as the process describes in steps of say 100 bytes at a time and checking functionality starting from from 1400 down to 500 bytes.....if the problem suddenly goes away at one of the steps, then you definitely have a path MTU problem for sure. If dropping down to 500 as a minimum doesn't solve it, then it is not a path MTU problem and you can move on to investigating other possibilities (after you switch your MTU back up to where it started...which was probably 1500 bytes).

So I should try reducing the MTU on my Mac and leave the two router MTU settings the same? — John Topley, May 09 '09 at 09:25
@John Topley - The router MTU settings should not affect this experiment as long as it (or they) are larger than the Mac's settings. (I'm assuming the router is set to something around 1400 or larger). In other words, the MTU setting at the source prevails as long as it is smaller. — Tall Jeff, May 09 '09 at 11:01

score 3 · Answer 5 · answered May 10 '09 at 09:50

I've fixed the problem now and in the end the fix was deliciously simple. I logged a support call with my ISP (PlusNet) and they sent me a link to a forum post explaining that this problem is a bug in my router's firmware. The fix was simply to set the router's Internet connection MTU to 1500 (the default is 1400) so that it matches the router's LAN side MTU.

Thanks to everyone who offered help and advice. I'm going to accept Alnitak's answer simply because he/she stuck with me on this and kept coming back with more advice and things to try.

glad it's fixed, and was indeed an MTU problem. There must be something very odd in that firmware if it can't even complete the three way handshake (which has very small packets) in these circumstances. — Alnitak, May 10 '09 at 14:53

score 2 · Answer 6 · answered May 08 '09 at 02:10

You did not mention whether you are going through a proxy server. It might be interesting to see if your ISP is potentially transparently proxying you, a practice I consider very evil but I think its quite common. Maybe you could try http://tracetcp.sourceforge.net/usage_proxy.html and do a tcp trace to the hosts that are not working, that could be interesting.

In the meantime going through a proxy server should allow you to access the sites so you at least have a workaround.

Have you tried contacting your ISP about this issue?

To me your traceroute and ping results are totally normal. The lack of reply at the end is normal, that is the last HOP that is sending ICMP max hop reached replies. tracepath is a utility which can be used to diagnose mtu problems which may help you.

I'm not aware that I'm going through a proxy server. I haven't contacted my ISP yet - wanted to get advice here first. I'll try your suggestion. — John Topley, May 08 '09 at 09:11

score 1 · Answer 7 · answered May 05 '09 at 19:52

I agree that this sounds a failure of Path MTU Discovery.

The solution to this problem for me (on linux) was to enable advanced router support in the kernel and the TCPMSS target support in the netfilter/core netfilter section of the kernel config. And then to tell iptables to force maximum segment size down:

iptables -t nat -A PREROUTING -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu

An alternative might be to pick a very small mtu (and possibly work upwards from there), and while that might bring its own problems, it should make these sites reachable.

score 1 · Answer 8 · answered May 09 '09 at 11:47

I have now sent a similar TCP connection request packet to www.adobe.com from my local machine (the only difference being the source IP address) and compared the response packet I get with the one in your latest tcpdump.

I have found 3 differences in the IP/TCP headers:

the "Differentiated Services" field in the IP is set to 0x80 in your case and 0x00 in my case - I am pretty sure this is caused by PlusNet's traffic prioritisation.
the 4 bytes at offset 0x20 are "0000 5012" in your case and "5012 0000" in my case - these are the data offset, flags and window size fields in the TCP header. It looks like something is swapping these 2-byte words in your case. And this is definitely what results in an invalid TCP packet
the connection response request has a TCP MSS option (with value 1460) added in your case, but there are no TCP options in my case

My guess would be that your router tries to be clever by adding a MSS TCP option, but in some cases messes up the TCP header. Does your router have any "MSS clamping" settings - if so, I would try disabling those settings. Otherwise I would suggest asking PlusNet support (showing them the tcpdump output).

There are no configurable MSS clamping settings in the router config. I've logged a support call with PlusNet, so we'll see what they say... — John Topley, May 10 '09 at 09:14

score 0 · Answer 9 · answered May 07 '09 at 20:17

0

I had a similar problem with my router locking up when accessing certain streaming audio/video resources. Updating the WMP network settings resolved that particular issue; not sure if it might be relevant in your case.

answered May 07 '09 at 20:17

An̲̳̳drew

1,265
2
14
19

Hafthor · Answer 10 · 2009-05-11T16:55:30.490

-1

Gonna go out on a limb and say it is a subnet mask problem, either with your local LAN (should be 255.255.255.0) or with your WAN-side.

I suggested this because if the subnet mask were incorrectly set to something like 255.254.255.0, you could end up with strange results - for big sites (with multiple A records) seemingly random reachability.

edited May 11 '09 at 16:55

answered May 08 '09 at 23:22

Hafthor

380
2
7
13

His ability to ping these same sites kills a lot of theories like this (including my own). – gbarry May 08 '09 at 23:37
Mostly - you can end up with different behavior between the browser and ping because the browser caches DNS itself. – Hafthor May 11 '09 at 16:59

Very strange home router problem

10 Answers10