4

I have an issue with private network traffic not being masqueraded in very specific circumstances.

The network is a group of VMware guests using the 10.1.0.0/18 network.

The problematic host is 10.1.4.20 255.255.192.0 and the only gateway it is configured to use is 10.1.63.254. The gateway server 37.59.245.59 should be masquerading all outbound traffic and forwarding it through 37.59.245.62, but for some reason, 10.1.4.20 ends up occasionally having 37.59.245.62 in its routing cache.

ip -s route show cache 199.16.156.40
199.16.156.40 from 10.1.4.20 via 37.59.245.62 dev eth0
    cache  used 149 age 17sec ipid 0x9e49
199.16.156.40 via 37.59.245.62 dev eth0  src 10.1.4.20
    cache  used 119 age 11sec ipid 0x9e49

ip route flush cache 199.16.156.40

ping api.twitter.com
PING api.twitter.com (199.16.156.40) 56(84) bytes of data.
64 bytes from 199.16.156.40: icmp_req=1 ttl=247 time=93.4 ms

ip -s route show cache 199.16.156.40
199.16.156.40 from 10.1.4.20 via 10.1.63.254 dev eth0
    cache  age 3sec
199.16.156.40 via 10.1.63.254 dev eth0  src 10.1.4.20
    cache  used 2 age 2sec

The question is, why am I seeing a public IP address in my routing cache on a private network?

Network information for the app server (without lo) :

ip a

eth0      Link encap:Ethernet  HWaddr 00:50:56:a4:48:20
          inet addr:10.1.4.20  Bcast:10.1.63.255  Mask:255.255.192.0
          inet6 addr: fe80::250:56ff:fea4:4820/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:1523222895 errors:0 dropped:407 overruns:0 frame:0
          TX packets:1444207934 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:1524116772058 (1.5 TB)  TX bytes:565691877505 (565.6 GB)

Network information for the VPN gateway (without lo too) :

 eth0      Link encap:Ethernet  HWaddr 00:50:56:a4:56:e9
           inet addr:37.59.245.59  Bcast:37.59.245.63  Mask:255.255.255.192
           inet6 addr: fe80::250:56ff:fea4:56e9/64 Scope:Link
           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
           RX packets:7030472688 errors:0 dropped:1802 overruns:0 frame:0
           TX packets:6959026084 errors:0 dropped:0 overruns:0 carrier:0
           collisions:0 txqueuelen:1000
           RX bytes:7777330931859 (7.7 TB)  TX bytes:7482143729162 (7.4 TB)

 eth0:0    Link encap:Ethernet  HWaddr 00:50:56:a4:56:e9
           inet addr:10.1.63.254  Bcast:10.1.63.255  Mask:255.255.192.0
           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1

 eth0:1    Link encap:Ethernet  HWaddr 00:50:56:a4:56:e9
           inet addr:10.1.127.254  Bcast:10.1.127.255  Mask:255.255.192.0
           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1

 tun0      Link encap:UNSPEC  HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
           inet addr:10.8.1.1  P-t-P:10.8.1.2  Mask:255.255.255.255
           UP POINTOPOINT RUNNING NOARP MULTICAST  MTU:1500  Metric:1
           RX packets:477047415 errors:0 dropped:0 overruns:0 frame:0
           TX packets:833650386 errors:0 dropped:101834 overruns:0 carrier:0
           collisions:0 txqueuelen:100
           RX bytes:89948688258 (89.9 GB)  TX bytes:1050533566879 (1.0 TB)

eth0 leads to the outside world, and tun0 to an openvpn network of VMs on which sits the app server.

ip r for the VPN gateway :

default via 37.59.245.62 dev eth0  metric 100
10.1.0.0/18 dev eth0  proto kernel  scope link  src 10.1.63.254
10.1.64.0/18 dev eth0  proto kernel  scope link  src 10.1.127.254
10.8.1.0/24 via 10.8.1.2 dev tun0
10.8.1.2 dev tun0  proto kernel  scope link  src 10.8.1.1
10.9.0.0/28 via 10.8.1.2 dev tun0
37.59.245.0/26 dev eth0  proto kernel  scope link  src 37.59.245.59

ip r on the app server :

default via 10.1.63.254 dev eth0  metric 100
10.1.0.0/18 dev eth0  proto kernel  scope link  src 10.1.4.20

Firewall rules:

Chain PREROUTING (policy ACCEPT 380M packets, 400G bytes) 
pkts bytes target prot opt in out source destination 

Chain INPUT (policy ACCEPT 127M packets, 9401M bytes) 
pkts bytes target prot opt in out source destination 

Chain OUTPUT (policy ACCEPT 1876K packets, 137M bytes) 
pkts bytes target prot opt in out source destination 

Chain POSTROUTING (policy ACCEPT 223M packets, 389G bytes) 
pkts bytes target prot opt in out source destination 

32M 1921M MASQUERADE all -- * eth0 10.1.0.0/17 0.0.0.0/0
NickW
  • 10,263
  • 1
  • 20
  • 27
greg0ire
  • 316
  • 1
  • 7
  • 26
  • might be useful to add the --no-dns (or just -n) flag to your mtr. I suspect that one of the routes in the middle is your 37.59.245.62. – Dan Pritts Mar 07 '14 at 15:29
  • Looks like you have a lot of flapping routes. – NickW Mar 07 '14 at 15:29
  • @DanPritts : updated my answer – greg0ire Mar 07 '14 at 15:33
  • @NickW: how do you see that ? – greg0ire Mar 07 '14 at 15:39
  • Not you specifically, but the route used by your machine when you ping changes often, either the network admin is trying some failing form of load balancing, or the routes offered to get to twitter are changing quite often.. – NickW Mar 07 '14 at 15:45
  • my guess is that 37.59.245.62 and 10.1.63.254 are both on your LAN/LANs, or possibly they are the same device. So, as NickW suggests, something is wrong on your local network. Why it would only show up with twitter, I don't know - or have you not tried other places? – Dan Pritts Mar 07 '14 at 15:50
  • @DanPritts : The app server requests other services (facebook, google plus jquery cdns)... no problem. – greg0ire Mar 07 '14 at 16:10
  • Where is that app server VM located on? Please show ip r on all involved hosts, including the host of the app server VM. (And ip a ls for that one as well.) – ch2500 Mar 09 '14 at 00:40
  • ip a ls gives the same output as ip a (check that with a diff). I edited my answer with ip r output on both machines I have access to. What kind of answer do you expect to your first question ? – greg0ire Mar 09 '14 at 12:53

3 Answers3

3

Unfortunately, most of what you're seeing is due to routing issues between external routers, they obtain and update their routing info dynamically, to help route traffic around problematic areas, but when those routes get changed often (normally due to availability) it is called route flapping. That is getting reflected down to you, normally end users don't see any of this..

You could attempt to disable your route cache, as explained here (note the caveats, it's not something that seems to offer much on the upside), but I think you'd be better off just talking to the network admin(s) locally as it seems it's their routing which is really unstable.

I am of course going with the assumption that it isn't you responsible for network administration.

NickW
  • 10,263
  • 1
  • 20
  • 27
  • Does the `37.59.245.62` that show up in the route cache indicate that `37.59.245.62` is having problems ? I'm not the main responsible for network administration (I'm a developer), but I have the power to make changes on it. – greg0ire Mar 07 '14 at 16:22
  • No, basically it means that you are receiving dynamic updates to your routes, and those routes may be becoming unavailable (for many reasons), leaving you with a bad cached route – NickW Mar 07 '14 at 16:37
  • 37.59.245.62 could be having issues, but I wouldn't point a finger at it immediately, it could just have issues with external routes changing frequently, what you really need is to be connected to a router which will keep you and your routing cache away from those updates. That or the network admins need to ensure that routes with higher priorities are not unstable. – NickW Mar 07 '14 at 16:41
  • Regarding your second comment, `37.59.245.62`, is not available to the app server, never has been, never will be. How can this entry even exist ? That's what I don't understand... As [you explained to me before](http://serverfault.com/a/578182/50341), that would be because of ICMP messages – greg0ire Mar 10 '14 at 10:42
  • Well, it could be ICMP, but now I see a bit more information on your network, it looks like you are receiving dynamic routing updates directly on your machine, which is usually not a good idea if the network is as unstable as yours is (because of the routing cache). Proper routers have methods for invalidating caches when they receive updates, I've not used linux in those sort of situations, so I can't say if that is the case. I'd personally ask your network admins to keep those updates to themselves. – NickW Mar 10 '14 at 10:54
  • What's strange is that on other machines of the VPN, `37.59.245.62` does not get in the cache. In fact, the only entries related to twitter in these machines' cache point to the gateway, like they should. – greg0ire Mar 10 '14 at 10:57
  • That's why I think that the servers is receiving dynamic updates, somewhere, a route arrives that says `199.16.156.X/YY` is available via `37.59.245.62` with a low enough priority to become the route of choice, then it goes away, but your server has it cached still. Like I said, this would be better handled further up the chain, let the network admins sort it out, explain the problem, they're the ones who should set up filters to prevent this sort of problem. – NickW Mar 10 '14 at 11:03
  • What I don't understand is the `199.16.156.X/YY` is available via `37.59.245.62` part. From the app server, this is just wrong, because `37.59.245.62` is not available for it (not directly). So the problem is not that it stays in cache, but that it gets in the cache in the first place. – greg0ire Mar 10 '14 at 12:12
  • Well, it is the gateway for the VPN server, `default via 37.59.245.62` with the additional information you've added into the question, I'm wondering if the VPN server is the one responsible for the change.. does your VM host have a different routing table? – NickW Mar 10 '14 at 12:32
  • let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/13492/discussion-between-greg0ire-and-nickw) – greg0ire Mar 10 '14 at 12:50
  • I'm going on holiday, and I think you deserve the bounty the most. Here it is. – greg0ire Mar 14 '14 at 19:49
  • problem solved today :) – greg0ire Apr 16 '14 at 17:24
3

Have someone, or you, take a look at the router/L3 device at 10.1.4.20. It looks like it might be receiving bad routes from an upstream peer that are then being withdrawn and then re-advertised.

Couradical
  • 376
  • 1
  • 6
  • I have access to 10.1.4.20, it is the gateway of my VPN. What do you want me to do ? – greg0ire Mar 07 '14 at 15:56
  • Ooops 10.1.4.20 is not the gateway, it is the app server – greg0ire Mar 07 '14 at 15:59
  • Is this server multihomed? eg. are the there multiple NICs, and are they all plugged in? what about on the network upstream? I'm taking a closer look at the traces, and it looks like you're receiving routes from 10.1.63.254 and 37.59.245.62. Are those both on the same network? If so, you should have routes directing traffic. – Couradical Mar 07 '14 at 18:32
  • PS - what's your NW config look like? IP/NetMask/GW would help. – Couradical Mar 07 '14 at 18:35
  • I added `ifconfig` output for both the app server and the gateway. There is one NIC for the app server, butmany NICs for the gateway. How can I know if 10.1.63.254 and 37.59.245.62 are on the same network ? Sorry for the n00bery, I'm a developer, not a sysadmin. – greg0ire Mar 07 '14 at 20:48
  • There has been a big edit to this question, maybe you want to have a look at it... – greg0ire Mar 11 '14 at 15:55
0

I asked this somewhere else, and it turns out the solution was to turn off ICMP redirects.

greg0ire
  • 316
  • 1
  • 7
  • 26