2

I'm troubleshooting a customer who requires the ability to send 5000 pings from the router to their remote site over a satellite link with zero timeouts, yet they keep experiencing one to five packets lost per test.

Under ordinary circumstances, I'd be willing to chalk up such a low loss rate as the cost of a satellite link, but the drops only show up when pinging from the router to the remote site. To clarify, here's the involved network devices:

Outbound Traffic

  1. 192.1.1.51 Router Hub
  2. 192.1.1.52 TX Switch Hub
  3. 192.1.1.50 Encapsulator Hub
  4. 172.1.1.1 Remote Site Remote

Return Traffic

  1. 172.1.1.1 Remote Site Remote
  2. 192.1.1.28 Channel Unit Hub
  3. 192.1.1.53 RX Switch Hub
  4. 192.1.1.51 Router Hub

When pinging from the Router to the remote site, the losses show up. When pinging from a Sun server attached to the TX switch (bypassing the router), the 5000 pings complete without a single loss. This verifies the entire satellite path, and all equipment except for the router.

Then I tried sending 5000 pings from the router to all of the other devices aside from the remote site...and I got back all 5000 almost instantaneously with no drops, so the connection from the router to everything else in the path is verified good.

The router in question is a Cisco 7206VXR, and the cpu utilization doesn't appear to ever go above 50%. The highest process is only at 20%, so I'm not confident that it's simply a matter of the router dropping ICMP packets due to lower priority, particularly given the router will send 5000 packets to local devices with no issues.

I also looked into the possibility of a null route, but the only possible culprit is an essential route for remote access, according to the customer, and I can't post their running config here to get a second opinion.

Any suggestions would be greatly appreciated. I have very little networking experience, and I'm beating my head against the wall to reconcile these seemingly contradictory symptoms.

Liesmith
  • 235
  • 2
  • 9
  • How long are you waiting between pings? This could be a buffer issue? – Edwin Aug 13 '13 at 01:42
  • On the router, the pings are sent as soon as a response is received. So, when I'm pinging local devices, 5000 pings will complete in a second or so. When I'm pinging the remote site, the latency is about 500-600ms, so there's about a half-second between pings. The default timeout is 2 seconds, but I tried raising that to 2 minutes with no appreciable difference in results. On the Sun server, Solaris waits one second between pings, regardless of response time. – Liesmith Aug 13 '13 at 16:04

2 Answers2

3

Datagrams are a best effort service. If you have a requirement that data be reliably delivered, you cannot use datagrams It really is that simple. The entire design of the system, end to end, is not meant to meet this requirements. You can't just impose it on the system as a whole at the end like putting a cherry on a sundae.

David Schwartz
  • 31,449
  • 2
  • 55
  • 84
  • Thanks, the customer now states that the router doesn't have any significant traffic load, so no packets should be getting dropped. I'll try to come up with a different test for their traffic, thanks for the assistance. – Liesmith Aug 13 '13 at 16:16
  • @Liesmith: There is no reason the customer knows why packets should be getting dropped, it doesn't follow that there isn't any reason. If there were an actual requirement that zero packets be dropped, each component of the system would have to have been chosen to ensure it can meet that requirement. – David Schwartz Aug 13 '13 at 18:20
  • I was able to ping from one remote terminal to another remote terminal, 5000 100-byte packets, no losses. Each packet is traversing the satellite four times, and the router twice over the course of its path, so I think it's just a matter of how the Cisco router treats ICMP traffic directed at the router itself. When I tried pinging from the remote terminal to the router, and vice-versa, simultaneously, the packet loss doubled on each test, so the amount of traffic (however slight) seems to have a large bearing on how many ICMP packets are dropped. – Liesmith Aug 13 '13 at 22:42
  • @Liesmith That makes sense. – David Schwartz Aug 13 '13 at 23:07
  • Now, the customer is stating that they've set up two PCs on this network, and pinged through the router, and they are seeing the same symptoms: a few packets dropped out of 5000. They said they also tried TCP (but wouldn't elaborate), and saw the same thing, so now these symptoms aren't making any sense to me. – Liesmith Aug 15 '13 at 22:48
  • Someone who understands the hardware and protocols used end-to-end could probably figure it out. It may be corruption on the satellite link, if it doesn't use error correction and/or retransmission. – David Schwartz Aug 15 '13 at 23:57
  • My first thought is always the satellite link, but I can ping 5000 times from a server hooked directly to the Tx switch; so the signal is passing over the satellite, but skipping the router. For ICMP, there shouldn't be any re-transmission. For their supposed TCP test, there should be re-transmission because it passes through an accelerator (not listed in my original question, because non-TCP traffic bypasses it). – Liesmith Aug 16 '13 at 08:53
1

It turns out the problem was that CEF was enabled globally on the hub router, but explicitly disabled ("no ip route-cache cef") on the interface which connects to the hub LAN. Once the explicit disable statements were removed, the packet loss vanished.

I don't understand why that worked, given that there was no packet loss between the hub devices and the hub router, but I can't argue with the results.

Hopefully, this can help anyone else who is stuck trying to isolate a very minor packet loss.
Thanks again to everyone who offered advice on this issue.

Liesmith
  • 235
  • 2
  • 9