2

This is a problem I've been somewhat ignoring for a few years now.

I have a Debian stable server running linux 3.16.0-4-amd64. A few minutes to hours after booting, the server looses outbound network connectivity and stops responding to SSH and ping. Open SSH connections hang. I have 3 KVM-based virtual machines running on that host, though, and they can literally run for years with no connectivity problems whatsoever. I can also reboot them.

/etc/network/interfaces:

auto lo
iface lo inet loopback

iface eth0 inet manual

auto br0
iface br0 inet static
    address xxx.xxx.xxx.6
    netmask 255.255.255.0
    network xxx.xxx.xxx.0
    broadcast xxx.xxx.xxx.255
    gateway xxx.xxx.xxx.1
    bridge_ports eth0
    bridge_stp off
    bridge_maxwait 0
    bridge_fd 0

The journal doesn't show anything interesting. The only network-related message in it is the following, and it usually comes 10–15 minutes after booting, but potentially hours before the disconnect:

kernel: br0: Multicast hash table maximum of 512 reached, disabling snooping: eth0

route -n and ip addr output doesn't change when the disconnect happens. ping 8.8.8.8 says:

From OWN_IPv4 icmp_seq=1 Destination Host Unreachable

Disabling IPv6 (which I don't currently use) didn't help.

Edit: This happens regardless of whether the virtual machines are running or not. I just found it curious that they have connectivity while the host has none, that's why I mentioned them. There shouldn't be any traffic except the occasional SSH scanning.

Adrian Heine
  • 328
  • 4
  • 22
  • It would be good to know what network load these virtual machines are maintaining on eth0, along with ingress/egress bandwidth limits. For example, I once received a case where an FTP server stopped responding to connect requests, and we discovered that it was sending so much outbound traffic thru established connections that there was literally no bandwidth left to acknowledge retransmits or handshake properly (safety tip: don't run an FTP server on an asynchronous DSL connection!) – George Erhard Feb 08 '17 at 23:52

1 Answers1

0

The machine has not lost network connectivity for 16 hours straight, so I am pretty sure it is ›fixed‹. What I did was booting with a /etc/network/interfaces file without the bridge defined:

auto lo
iface lo inet loopback

auto eth0
allow-hotplug eth0
iface eth0 inet static
    address xxx.xxx.xxx.6
    netmask 255.255.255.0
    network xxx.xxx.xxx.0
    broadcast xxx.xxx.xxx.255
    gateway xxx.xxx.xxx.1
    # dns-* options are implemented by the resolvconf package, if installed
    dns-nameservers SOME_IP SOME_OTHER_IP

After two hours (just to be sure), I copied over the /etc/network/interfaces from the question and ran:

ip address flush eth0 scope global && ifup br0

After four minutes, the multicast hash table ran full again, but I did not care about that. After another two hours, I booted up the virtual machines.

So, apparently booting with the bridge makes the system loose connectivity after a varying amount of time, whereas adding the bridge after booting seems to work. No idea why that is the case, though.

Adrian Heine
  • 328
  • 4
  • 22