4

An apartment complex has fiber internet and is experiencing latency problems over the last month.

Tenants frequently experience timeouts and corrupted webpages. The current work around it to refresh the webpage half a dozen times until it loads correctly.

The symptoms are:

  • Inconstant, happening a couple times per day per tenant
  • Requesting a new dhcp lease on laptop does not solve the issue
  • Affects both mac and windows machines Update: ONLY AFFECTS MAC USERS!*
  • Affects both wireless and wired
  • Is not a DNS issue because we have tried ISP's dns and google's dns servers with no improvement
  • iTunes is heavily affected by this. iTunes store frequently timesout (iPad, iPhone, Mac)

What other diagnostic tools could be used to identify the problem? The ISP says everything looks fine.

A traceroute shows huge latency (several seconds) on hop 9.

traceroute google.com

traceroute: Warning: google.com has multiple addresses; using 74.125.224.168
traceroute to google.com (74.125.224.168), 64 hops max, 52 byte packets
 1  10.90.4.1 (10.90.4.1)  3.086 ms  0.738 ms  0.683 ms
 2  69.169.148.1.provo.static.broadweavenetworks.net (69.169.148.1)  0.907 ms  1.135 ms  0.893 ms
 3  10.8.201.41 (10.8.201.41)  1.040 ms  1.552 ms  11.494 ms
 4  97.75.190.142 (97.75.190.142)  1.343 ms  1.347 ms  0.946 ms
 5  97.75.190.137 (97.75.190.137)  1.290 ms  1.609 ms  1.202 ms
 6  97.75.191.66 (97.75.191.66)  2.463 ms  2.146 ms  2.161 ms
 7  97.75.191.54 (97.75.191.54)  2.406 ms  2.281 ms  2.616 ms
 8  te-9-3.car1.saltlakecity1.level3.net (4.53.40.105)  3.014 ms  2.330 ms  2.241 ms
 9  * * *
10  ae-61-61.csw1.losangeles1.level3.net (4.69.137.2)  15.805 ms
    ae-91-91.csw4.losangeles1.level3.net (4.69.137.14)  15.441 ms  15.160 ms
11  * ae-1-60.edge1.losangeles9.level3.net (4.69.144.10)  17.204 ms  15.983 ms
12  google-inc.edge1.losangeles9.level3.net (4.53.228.6)  92.445 ms  82.679 ms  107.813 ms
13  64.233.174.238 (64.233.174.238)  21.234 ms  21.016 ms  21.321 ms
14  72.14.236.11 (72.14.236.11)  21.577 ms  21.630 ms  21.568 ms
15  lax02s01-in-f8.1e100.net (74.125.224.168)  20.798 ms  20.687 ms  20.666 ms

Affects most webpages (google, apple.com, facebook.com ect..)

(lines 9, 17 and 18 all take a long time).

traceroute beachbody.com
traceroute to beachbody.com (66.208.81.68), 64 hops max, 52 byte packets
 1  10.90.4.1 (10.90.4.1)  1.038 ms  0.830 ms  0.767 ms
 2  69.169.148.1.provo.static.broadweavenetworks.net (69.169.148.1)  0.988 ms  0.934 ms  0.928 ms
 3  10.8.201.41 (10.8.201.41)  1.357 ms  1.375 ms  1.500 ms
 4  10.8.101.5 (10.8.101.5)  1.405 ms  1.579 ms  1.115 ms
 5  eth_3-3_prv02-rt02.veracitynetworks.com (97.75.190.166)  10.601 ms  1.563 ms  1.754 ms
 6  97.75.191.66 (97.75.191.66)  2.857 ms  13.554 ms  2.833 ms
 7  97.75.191.54 (97.75.191.54)  2.760 ms  2.394 ms  4.350 ms
 8  te-9-3.car1.saltlakecity1.level3.net (4.53.40.105)  2.352 ms  2.311 ms  2.340 ms
 9  * * *
10  ae-61-61.csw1.losangeles1.level3.net (4.69.137.2)  29.086 ms
    ae-71-71.csw2.losangeles1.level3.net (4.69.137.6)  28.958 ms
    ae-91-91.csw4.losangeles1.level3.net (4.69.137.14)  28.863 ms
11  ae-82-82.ebr2.losangeles1.level3.net (4.69.137.25)  28.075 ms
    ae-72-72.ebr2.losangeles1.level3.net (4.69.137.21)  28.508 ms
    ae-62-62.ebr2.losangeles1.level3.net (4.69.137.17)  29.029 ms
12  ae-6-6.ebr2.sanjose5.level3.net (4.69.148.202)  28.672 ms  28.586 ms  28.223 ms
13  ae-2-2.ebr2.sanjose1.level3.net (4.69.148.142)  28.426 ms  28.341 ms  29.611 ms
14  ae-4-4.car2.sacramento1.level3.net (4.69.132.157)  28.834 ms  29.236 ms  29.231 ms
15  ragingwire.car2.sacramento1.level3.net (4.53.202.22)  29.339 ms  29.406 ms  29.584 ms
16  resisp-74-221-224-49.smf.ragingwire.net (74.221.224.49)  26.096 ms  25.930 ms  26.575 ms
17  * 204.212.188.26 (204.212.188.26)  28.459 ms !X *
18  204.212.188.26 (204.212.188.26)  25.650 ms !X *  26.197 ms !X  

enter image description here


Update 1
Here is a traceroute with the same laptop, but different network location (sanitized).

beachbody.com fails 95% of the time at location 1. beachbody.com succeeds 100% of the time at location 2.

traceroute beachbody.com
traceroute to beachbody.com (66.208.81.68), 64 hops max, 52 byte packets
 1  foo.acme (y.y.y.y)  1.716 ms  13.343 ms  6.139 ms
 2  x.x.x.x (x.x.x.x)  74.524 ms  158.532 ms  6.721 ms
 3  tg9-2.cr01.slkcutxd.integra.net (209.63.98.37)  33.225 ms  24.794 ms  24.587 ms
 4  * be4.sc01.sntdcabl.integra.net (209.63.82.166)  32.474 ms  36.895 ms
 5  be1.br02.plalca01.integra.net (209.63.100.118)  24.120 ms  22.298 ms  22.176 ms
 6  peer-02.palo.twtelecom.net (198.32.175.111)  21.401 ms  22.576 ms  21.492 ms
 7  oak1-ar1-xe-0-1-0-0.us.twtelecom.net (206.222.120.214)  23.042 ms  22.441 ms  48.562 ms
 8  74.202.6.2 (74.202.6.2)  29.358 ms  32.253 ms  30.283 ms
 9  204.212.188.26 (204.212.188.26)  25.949 ms !X  30.199 ms !X *  


Update 2
Further investigation reveals that this only affects Mac Users!
2nd phone call with Veracity confirms that unusually high percentage of mac users have been reporting problems with iTunes. Level 3 techs have no idea what is causing this.

Update 3
Captured event in wireshark on 2 computers at the same time

Mac (has issue)
http://cl.ly/0o1D2r0K1s2s
Filter = "ip.dst==e3570.b.akamaiedge.net"

Windows (problem doesn't affect windows pc's)
http://cl.ly/3v3e1s2M1W27
Filer = "ip.dst==e3570.b.akamaiedge.net"
Ctrl + F "beachbody"

I don't know why the source/destination is ip.dst==e3570.b.akamaiedge.net and not "beachbody.com" or 66.208.81.68 (the beach body website ip)

spuder
  • 1,725
  • 3
  • 26
  • 42
  • It affects pretty much every webpage. I added a second example to the question. – spuder May 27 '13 at 05:57
  • 2
    Among other things, I see RFC 1918 addresses in the public Internet here. Your ISP's network people are obviously completely clueless, since this _alone_ is sufficient to break your network connectivity (by screwing up path MTU discovery). – Michael Hampton May 27 '13 at 06:04
  • I can't remember where i seen excellent post about network troubleshooting. And fault in that case was ISP switch or router. I try to google it out. – Guntis May 27 '13 at 06:11
  • cannot fount that post ... – Guntis May 27 '13 at 09:36
  • The latency jump is geographic (from utah to CA) so that's normal. However, it definitely looks like an ISP issue as the routing seems to be completely messed up when they do that transfer. – Nathan C May 28 '13 at 11:50
  • While I do not have the answer to your networking troubles, I do have some good news for you. I see that you're in Provo, and have Veracity as your ISP (as do I). Veracity has been notoriously bad for setting up apartment networks, though they're direct connections for individual residences isn't bad. You won't be getting any good customer service out of them, since they're losing the iProvo network in a month to Google. That being said, soon enough it will be replaced with Google Fiber, and from what I've heard, their customer service is great. That, and most likely the apartment networking h – LandonWO May 27 '13 at 06:14
  • 2
    I think the next step is going to be to run Wireshark and look for anything out of the ordinary while one of these Macs tries to access the network. – Michael Hampton May 29 '13 at 02:32
  • 1
    Also look for MTU problems with `traceroute www.beachbody.com 1500`. – Michael Hampton May 29 '13 at 02:37
  • +1 @MichaelHampton - This SCREAMS of an MTU problem. Sniffing the wire is the answer (as it so often is). – Evan Anderson May 29 '13 at 02:48

3 Answers3

4

From your Wireshark capture, there are two obvious wrong things appearing:

  1. All of the IP packets you send have an invalid checksum of 0. This may be an artifact of how the OS captures the packets, so we'll ignore that for now...

  2. This is probably causing you a lot of grief: It appears your ISP is repsonding to some (but not all) of your requests with ICMP Time Exceeded responses, which has the effect of severing your connection. For instance, see your SYN packet in line 324 and your ISP's response from 97.75.190.142 in line 327. Since your packets have a TTL of 64 set in them, this strongly suggests your ISP has a routing loop somewhere in their network.

Send a copy of this pcap file to your ISP's network people. They should be able to figure out what in their network is broken.

Michael Hampton
  • 244,070
  • 43
  • 506
  • 972
  • Thanks! I did notice the invalid checksum error, but I shrugged it off after reading this page: http://wiki.wireshark.org/TCP_Checksum_Verification The TTL of 64 does look like it might be the culprit. I'll look into that. – spuder May 29 '13 at 05:05
  • 1
    Since there's almost nothing on planet Earth more than 20 hops away, getting a Time Exceeded - from your own ISP - is extremely suspicious. This is about as close as you're likely to get to a smoking gun proving it's their fault. – Michael Hampton May 29 '13 at 05:10
  • Thanks, I've passed this information on to the ISP, they don't seem to enthused about looking into it. – spuder Jun 06 '13 at 16:48
1

I had problems with random slowdowns and dropped connections at my complex recently. The best way for me to prove to them there were issues using a low-level tool:

  1. Make sure you connect a wired connection directly to wall, leaving out any routers and other devices you can. If you can do this with multiple machines, better.
  2. Run a continuous ping and watch for large variance in response times or worse, timeouts (indicating packets being dropped).

ping -t -w 1000 google.com

  1. Take a screen shot or send them the output if there are breaks in the stream. You want to see low variance of a few ms difference in response times, and very few,if any, drops. Run this for a long time, more than a few minutes. Such as:

C:>ping -t -w 1000 google.com

Pinging google.com [74.125.140.102] with 32 bytes of data: Reply from 74.125.140.102: bytes=32 time=19ms TTL=48 Reply from 74.125.140.102: bytes=32 time=17ms TTL=48 Reply from 74.125.140.102: bytes=32 time=21ms TTL=48 Reply from 74.125.140.102: bytes=32 time=16ms TTL=48 Reply from 74.125.140.102: bytes=32 time=17ms TTL=48 Reply from 74.125.140.102: bytes=32 time=29ms TTL=48 Reply from 74.125.140.102: bytes=32 time=20ms TTL=48 Reply from 74.125.140.102: bytes=32 time=45ms TTL=48 Reply from 74.125.140.102: bytes=32 time=16ms TTL=48 Reply from 74.125.140.102: bytes=32 time=19ms TTL=48 Reply from 74.125.140.102: bytes=32 time=15ms TTL=48 Reply from 74.125.140.102: bytes=32 time=15ms TTL=48

  1. If you can show there is a problem, keep calling them. It may take awhile to get people to notice.

Hope that helps.

Noah Stahl
  • 453
  • 2
  • 8
1

FYI - ping is the tool to check latency. This is processed in the data plane and is a true indication of lag for data packets. traceroute or tracert get processed in the control plane, and response times are not an indication of network latency, but can be impacted by high cpu utilization. traceroute and tracert should only be used to show path selection.

Security_Pete
  • 99
  • 1
  • 1
  • 11