MAC address timeing out of the Linux bridge

Question

We stumbled upon a weird behaviour (well, everything works according to specifications but it should be somehow tunable and/or I would like to know how to properly set it up) in linux bridge or alternatively in the image of the VMs.

Problem: MAC address of gateway ageing out of the bridge's mac address table (on the Linux hypervisor but could be any other box), due to the low number of frames observed with the virtual MAC of the gateway (as packets incoming from different subnets come from the hardware MAC address of the gateway). The only way a virtual MAC address can be observed is the ARP (either request or response from GW). The ARP from the linux virtual machines is the only way a MAC entry is refreshed, but the ARP is never issued when in use (from the Linux machines). The gateway is set to request ARPs only for 'existing' IPs - lowering the number of packets seen (and the requests happening in 75% of timeout value are unicasted instead of broadcasted). Resulting in bridge flooding all interfaces increasing the Rx packets on virtual machines. The problem occurs less often on hypervisors with the higher number of virtual machines and is less of a problem on hypervisors with lesser number of hypervisors.

I would like to know whether there is any systematic way to evade future occurrence of this. E.g. tune the MAC address timeout (I don't believe that it is a good idea). Or force unicast ARP requests from the VMs periodically?

Tune your MAC address table timeout to be slightly longer than the host ARP table timeout. — Ron Maupin, Jul 15 '20 at 12:46
Thank you, I have tried that, but it just suppresses the issue right? When I tried to prolong the MAC addr. timeout to match the physical switches in datacenter (30min) the problem was reoccuring. Even when set to 1 hour the problem could be observed from time to time, setting the timeout to 5 hours looked fine, but I think the difference between 'my' value and one of the physical switch should not differ that much. — ILikeMatDotH, Jul 15 '20 at 13:09
It all depends on the activity. That is how switches work. We use something like 14,500 seconds for the MAC address tables in all our switches. — Ron Maupin, Jul 15 '20 at 13:27
Well, I thought that it is too large timeout and it would be too large and would waste resources. Thank you very much. Have you ever had a problem with it? (Something like CAM overflow?) — ILikeMatDotH, Jul 15 '20 at 13:53
How many hosts are in the broadcast domain? Switches can usually handle several thousand, but you should not have a broadcast domain that large. — Ron Maupin, Jul 15 '20 at 14:15
broadcast domain is usually /24 subnet ( -> 254 possibly assigned addresses), but there are multiple virtual bridges for multiple networks, meaning this could even be ten times as large — ILikeMatDotH, Jul 16 '20 at 05:41
The only time I ever heard of a switch running out of space was a question where the user had something like 16,000 hosts. — Ron Maupin, Jul 16 '20 at 13:01
I found the question to which I was referring: https://networkengineering.stackexchange.com/q/39868/8499 — Ron Maupin, Jul 18 '20 at 18:02

MAC address timeing out of the Linux bridge

0 Answers0