5

I am operating a router based on a Linux server running Debian stable (Buster). It uses Quagga to speak BGP4 to four peers (one of which sends the entire Internet routing table for IPv4 and IPv6, the others send significantly fewer routes).

About once or twice per day, the server loses IPv6 connectivity for about five minutes.

When this happens, it appears the server cannot send any packest to IPv6 addresses. It appears this affects any addresses and interfaces - the main Ethernet adapter that connects to the Internet as well as the special "Ethernet-over-USB" interface that connects to the built-in management adapter (Lenovo XClarity controller). It can, however, ping ::1 as well as any of its own addresses (link-local and routanke ones).

Also, "ip -6 neigh ls" doesn't show anything as "REACHABLE", only "STALE" or "DELAY". Nevertheless, tcpdump on the router itself doesn't appear to show any neighbor solicitation packets getting out. When I try to reach another machine on the same LAN, tcpdump on the target does not show any neighbor solicitation packets being received either.

This state lasts for about five minutes, after which everything returns back to normal, without any manual intervention.

IPv4 connectivity doesn't appear to be affected by this.

I tried to analyze this some more, by running analystic tools (ping, vmstat, perf record), saving their output, and correlating them with the time. Here's what I can say so far:

  • There does not appear to be any excessive amount of network traffic when the problem happens

  • There does not appear to be any kind of RAM or CPU usage spike

  • normal operations of the Internet cause some incremental routing table changes every once in a while, which are being executed by quagga; they do not appear to be correlated to the outages; such outages also happen after periods of relatively little change

At any point in time, perf shows fib6_walk_continue as one of the top symbols; typically around 5% overhead. However, pretty much exactly at the time when the IPv6 connectivity stops, the following symbols get to the top:

fib6_walk_continue (around 30%) native_queued_spin_lock_slowpath (around 10%) fib6_age (around 10%)

Initally they all appear to belong to the "swapper" cmd. After about a minute, quagga notices that it can't reach the peers anymore and starts deleting IPv6 routes; when this happens, the same three symbols appear in the perf output as belonging to zebra.

Pretty much exactly when the normal perf output returns (with intel_idle at the top), connectivity comes back.

Has anyone seen something like this before?

Software: Debian Buster with the latest packages, specifically linux-image-4.19.0-9-amd64 and quagga-core as well as quagga-bgpd 1.2.4-3

Hardware: Lenovo SR550 with 6-core "Intel(R) Xeon(R) Bronze 3104 CPU @ 1.70GHz" and 32 GB RAM

Edit: This question was closed by ServerFault with the comment "Questions seeking installation, configuration or diagnostic help must include the desired end state, the specific problem or error, sufficient information about the configuration and environment to reproduce it, and attempted solutions."

I believe "the specific problem or error, sufficient information about the configuration and environment to reproduce it" have already been provided in the original description.

"the desired end state" would be that the machine doens't lose IPv6 connectivity.

For "attempted solutions": For now, I've reverted the Linux kernel to the version from Debian 9 "Stretch" (linux-image-4.9.0-13-amd64 version 4.9.228-1), while keeping the rest of the packages at their current versions in Debian 10 "Buster".

So far, the symptoms have stopped.

If this persists for a few weeks, I assume the behaviour being a Linux kernel bug introduced somewhere between Linux 4.9 (from Stretch) and 4.19 (from Buster), and see if it might already be fixed in a later kernel version.

ftc
  • 91
  • 4
  • 6
    Is it the same five minutes every day? Is something else going on at that time? cron jobs, etc? – Michael Hampton Jun 19 '20 at 19:54
  • @Michael: no, “about once per day” is just a rough estimate of the frequency (more often than once a week, Less often than once per hour). I cannot find anything that might cause it; there definitely is no correlation to cron jobs or anything else I can see. – ftc Jun 20 '20 at 20:21
  • Can you get the outputs of `ip -6 a` (are there IPv6 addresses yet?) and `ip -6 r` during such a phase? – Hauke Laging Jun 21 '20 at 00:06
  • "ip -6 a" gives me the same list no matter whether I can reach anything via IPv6 or not. "ip -6 r" gives me a long list of routes (>80'000 lines) which fluctuates over time as Quagga adapts to the current state of the Internet. I've checked that there are NO changes relevant to link-local routes nor to static routes to hosts on the same Ethernet as the router; any and all differences in the output of "ip -6 r" are routes on the Internet marked with "proto zebra" as coming from Quagga. – ftc Jun 22 '20 at 10:44
  • Thank you very much for the idea about getting the routing table! I'll start a job that counts the number of changed routes over time and correlate that with the time of the outage; maybe the connectivity loss happens when Quagga updates a (relatively high) number of routes? – ftc Jun 22 '20 at 10:52
  • Negative - the outages do NOT happen after any significant number of routing table changes. Typically, they even happen during very quiet phases. – ftc Jun 26 '20 at 10:29
  • You should keep running, for logging purposes, a background command `ip -ts monitor > /path/to/ipmonitor.log`. This logs every network change happening, including neighbour caches, routes etc, so could be quite verbose. This might contribute later to figure out what's happening. – A.B Jun 26 '20 at 10:45
  • If you respond to comments then you should address the author with @. If you do not do that then the author is not notified. Thus I didn't notice your responses for days. You got notified without such addressing because the comments belong to your question. – Hauke Laging Jun 26 '20 at 21:05
  • @HaukeLaging - thanks, I didn't know that. – ftc Jun 27 '20 at 12:28
  • @A.B - thank you very much; I started this now and will analyze it the next time the problem happens. – ftc Jun 27 '20 at 12:28

0 Answers0