2

I'm running SLES 12 SP3 in a production environment. There are several interfaces configured on two separate networks. Things are working fine for a while(days at a time) and for no apparent reason, 1 or 2 of the gateways will get dropped from the route table. There is no indication as to why in any logs(/var/log/messages). The ARP table still shows entries from the interfaces in question to the router IP addresses.

I know the Linux kernel does route table garbage collection, and there are tunables related to this. But the route table doesn't appear nearly full, there are less then 50 entries in it. Are there other events that cause the Linux kernel to remove a gateway from the route table? Are there other places I should be looking on the system for clues as to why the router was removed?

Thanks in advance.

jetson23
  • 51
  • 5
  • Do you use any routing daemons or DHCP clients? – eckes Jun 10 '19 at 07:05
  • DHCP is not in play, all IP addresses on the host in question are static. I am not running any routing daemons that I know of. – jetson23 Jun 12 '19 at 11:47
  • @eckes - Just curious as to if you were inferring anything specific by implicating DHCP (or a routing daemon)? I am trying to determine if DHCP is definitely disabled on my system. I'm also suspicious of wicked and nanny doing things unexpectedly. – jetson23 Jun 21 '19 at 02:24
  • Nö, it’s just that the only reason I can imagine is user mode. And dhcp Clients and routing daemons are the only ones messing with the routing table by default – eckes Jun 21 '19 at 02:26
  • Thanks @eckes. Can you give some examples of routing daemons that might do this? Are there standard routing daemons that come with SLES that I can look for on my system? – jetson23 Jun 21 '19 at 13:19
  • 1
    Maybe quagga/zebra, systemd-network, bird, exabgp. Not sure if keepalived or vrrpd could also mess with the routing table. – eckes Jun 21 '19 at 15:10

1 Answers1

2

Linux dropped routing cache for IPv4 (only) in kernel 3.6. That's described there for example: David Miller: routing cache is dead, now what ?. It relies now only on LPC-trie for performance. So as far as I understand there's no route garbage collection done for IPv4 on SLES12 which should be at least kernel 3.12 if not higher.

You could keep the command ip -ts monitor running and log its output for later analysis to find what's going on, especially around when the route disappeared. For example maybe some address also disappeared and reappeared, while leaving the route lost?

A.B
  • 11,090
  • 2
  • 24
  • 45
  • 1
    But if several interfaces are using the same *IP LAN* on the same ethernet LAN expect trouble anyway: ARP flux etc unless you configure the system for this. Though this shouldn't make routes disappear in the routing table. – A.B Jun 07 '19 at 21:59
  • Thanks for your response @A.B. I've been playing with "ip -ts monitor" and that may help me figure out what exactly is going on at the time, since there's no real indication of if/when this is going to happen. Could you explain more about your second comment? The ARP table looks ok after the fact. Both routers have an entry in the ARP cache for their appropriate interfaces. I was trying to paste the ARP table but the response is too long. I guess I'm wondering what the issues might be? Thanks. – jetson23 Jun 08 '19 at 20:20
  • 1
    About "ARP flux" (that you could also search on internet):you have to check on the peer systems (including routers) that each (of you system) interface's IP has its own ARP entry with its MAC. Usually a single card's MAC is present for all IPs, thus forcing all incoming traffic through a single card. Sometimes you even have to disable rp_filter because of this.That's not a problem unless you rely on bandwidth or redondancy. Of course there are settings to make it behave as intended but it's not trivial (and there are questions on SF or UL SE about it, and two categories of answers for the fix) – A.B Jun 09 '19 at 11:00
  • Thanks @A.B. I did some reading on ARP flux, and as it turns out, according to the sources I read, we had already set up our sysctl parameters to prevent this issue(arp_announce = 2 and arp_ignore = 1). So I don't think that's what's getting us. Going back to "ip -ts monitor", from the output I see that routes on our system become "STALE", and then are "PROBE"d and set to "REACHABLE", which I'm assuming means they responded to the probe. Would a route be removed if the router didn't respond? What happens if there's heavy traffic and the response packet happens to get dropped? – jetson23 Jun 10 '19 at 19:48
  • 1
    run instead: `ip -ts monitor link` , `ip -ts monitor address` , `ip -ts monitor route`. stale reachable etc is about ARP (with ip neighbour) and thus not about routes. – A.B Jun 10 '19 at 21:11
  • And for a specific interface `ip -ts monitor route dev INTERFACE_NAME`. – SebMa Jun 23 '22 at 09:41