I am operating a router based on a Linux server running Debian stable (Buster). It uses Quagga to speak BGP4 to four peers (one of which sends the entire Internet routing table for IPv4 and IPv6, the others send significantly fewer routes).
About once or twice per day, the server loses IPv6 connectivity for about five minutes.
When this happens, it appears the server cannot send any packest to IPv6 addresses. It appears this affects any addresses and interfaces - the main Ethernet adapter that connects to the Internet as well as the special "Ethernet-over-USB" interface that connects to the built-in management adapter (Lenovo XClarity controller). It can, however, ping ::1 as well as any of its own addresses (link-local and routanke ones).
Also, "ip -6 neigh ls" doesn't show anything as "REACHABLE", only "STALE" or "DELAY". Nevertheless, tcpdump on the router itself doesn't appear to show any neighbor solicitation packets getting out. When I try to reach another machine on the same LAN, tcpdump on the target does not show any neighbor solicitation packets being received either.
This state lasts for about five minutes, after which everything returns back to normal, without any manual intervention.
IPv4 connectivity doesn't appear to be affected by this.
I tried to analyze this some more, by running analystic tools (ping, vmstat, perf record), saving their output, and correlating them with the time. Here's what I can say so far:
There does not appear to be any excessive amount of network traffic when the problem happens
There does not appear to be any kind of RAM or CPU usage spike
normal operations of the Internet cause some incremental routing table changes every once in a while, which are being executed by quagga; they do not appear to be correlated to the outages; such outages also happen after periods of relatively little change
At any point in time, perf shows fib6_walk_continue as one of the top symbols; typically around 5% overhead. However, pretty much exactly at the time when the IPv6 connectivity stops, the following symbols get to the top:
fib6_walk_continue (around 30%) native_queued_spin_lock_slowpath (around 10%) fib6_age (around 10%)
Initally they all appear to belong to the "swapper" cmd. After about a minute, quagga notices that it can't reach the peers anymore and starts deleting IPv6 routes; when this happens, the same three symbols appear in the perf output as belonging to zebra.
Pretty much exactly when the normal perf output returns (with intel_idle at the top), connectivity comes back.
Has anyone seen something like this before?
Software: Debian Buster with the latest packages, specifically linux-image-4.19.0-9-amd64 and quagga-core as well as quagga-bgpd 1.2.4-3
Hardware: Lenovo SR550 with 6-core "Intel(R) Xeon(R) Bronze 3104 CPU @ 1.70GHz" and 32 GB RAM
Edit: This question was closed by ServerFault with the comment "Questions seeking installation, configuration or diagnostic help must include the desired end state, the specific problem or error, sufficient information about the configuration and environment to reproduce it, and attempted solutions."
I believe "the specific problem or error, sufficient information about the configuration and environment to reproduce it" have already been provided in the original description.
"the desired end state" would be that the machine doens't lose IPv6 connectivity.
For "attempted solutions": For now, I've reverted the Linux kernel to the version from Debian 9 "Stretch" (linux-image-4.9.0-13-amd64 version 4.9.228-1), while keeping the rest of the packages at their current versions in Debian 10 "Buster".
So far, the symptoms have stopped.
If this persists for a few weeks, I assume the behaviour being a Linux kernel bug introduced somewhere between Linux 4.9 (from Stretch) and 4.19 (from Buster), and see if it might already be fixed in a later kernel version.