KVM QEMU VMs with static IP loose network connectivity

Question

I have a KVM/QEMU setup with both host & guest (VM) running Ubuntu 18.04 LTS with bridged networking. VMs are configured with static IP loose network connectivity randomly (there is no pattern). VMs which are configured with DHCP works fine.

Here is my host network config,

network:
    version: 2
    ethernets:
        eno1:
            dhcp4: no
            dhcp6: no
        eno4:
            dhcp4: true
        eno5np0:
            dhcp4: true
        eno6np1:
            dhcp4: true
        ens2f0np0:
            dhcp4: true
        ens2f1np1:
            dhcp4: true
    bridges:
        br0:
            interfaces: [eno1]
            dhcp4: no
            addresses:
            - 10.2.0.92/24
            gateway4: 10.2.1.252
            nameservers:
                addresses:
                - 8.8.8.8

Here is my vm (guest) network config with static IP,

network:
    version: 2
    ethernets:
            ens3:
                    dhcp4: no
                    addresses:
                    - 10.2.0.210/23
                    gateway4: 10.2.1.252
                    nameservers:
                        addresses:
                        - 8.8.8.8

Here is my vm (guest) network config with DHCP,

network:
    version: 2
    ethernets:
            ens3:
                    dhcp4: true

VMs with static IP goes into kind of idle state. So when ever trying to SSH or access the services in that, it takes time then it connects,

$ nc -z -v -w5 10.2.0.210 22
nc: connect to 10.2.0.210 port 22 (tcp) timed out: Operation now in progress

Try again, it will work, because the VM moved from idle to working state because of the first try,

$nc -z -v -w5 10.2.0.210 22
Connection to 10.2.0.210 22 port [tcp/ssh] succeeded!

There is no issue with VMs which has DHCP. It connects just fine any time,

$ nc -z -v -w5 10.2.0.184 22
Connection to 10.2.0.184 22 port [tcp/ssh] succeeded!

I have checked the following links,

but it didn't help.

Any issue in the KVM configuration? Not only SSH, but any services exposed in the VMs are also not accessible. I have verified that VMs are in running state when I query virsh.

Have you thought to check the kvm development bug tracker? This may be an issue they are aware of and if not, they may have a better inkling of what's causing the problem. — Rowan Hawkins, Mar 26 '20 at 20:33
One rather basic issue I see is that your gateway on br0 is not within the address scope you define a /24 instead of a /23. That doesn't explain why it works sometimes though. — Rowan Hawkins, Mar 26 '20 at 20:39
When in DHCP also it gives IP with /23. I used the DHCP IP as static IP. — jaks, Mar 29 '20 at 09:01
You missed what I was pointing out. The host can have a different mask and Network then the guest, I get that. What isn't set correctly is that the HOSTS gateway isn't within the network assigned to it. You are having a strange networking issue and you have this wrong network setting. It would make sense to fix this to ensure that it isn't impacting networking from the host in general. — Rowan Hawkins, Mar 30 '20 at 08:04
Sounds like the bridge is having problems with its MAC address forwarding table. When a dhcp client acquires an address it does a broadcast and then the MAC is known. Do these events correlate with VM's going up or down? Or maybe there is link layer communication from the switch the bridge is connected to. Probably just sending out a couple of bytes from the VM with static address with a ping would build it in the forwarding table also. — Gerrit, Mar 30 '20 at 21:29
@Gerrit the VM was working fine for 6 hours and after that it becomes unreachable. After that connecting to VM via console and pinging google for sometime brings the VM back to network — jaks, Apr 02 '20 at 03:41
@RowanHawkins I think you are right. Now I changed my network settings to 10.2.0.0/16 CIDR. Static IPs are set with /16 prefix length and haven't faced issue for 1 day. I will monitor this for a couple of days — jaks, Apr 02 '20 at 03:42
@jaks all you need is a /23. A /16 describes all addresses from 10.2.0.0 to 10.2.255.255, it massively increases your broadcast domain. 10.2.0.92/23 would encompass 10.2.0.0 to 10.2.1.255, just like the hosts in your VM's. The issue isn't overlapping spaces, but that with the /24 the hosts gateway is not reachable by the host. If you were to ping guest -> host, the packet would leave the guest and hit the router and then hit the host, but the reply path back from the host would fail because the hosts gateway is not within 10.2.0.0/10.2.0.255. — Rowan Hawkins, Apr 02 '20 at 09:23

score 0 · Accepted Answer · answered Apr 02 '20 at 10:06

One rather basic issue I see is that your gateway on br0 is not within the address scope. You define a /24 instead of a /23.

  br0:
            interfaces: [eno1]
            dhcp4: no
            addresses:
            - 10.2.0.92/**24**
            gateway4: 10.2.1.252
            nameservers:
                addresses:
                - 8.8.8.8

You just need to change /24 into /23

10.2.0.92/23 would encompass 10.2.0.0 to 10.2.1.255, just like the hosts in your VM's. The issue isn't overlapping spaces, but that with the /24 the hosts gateway is not reachable by the host.

If you were to ping from guest -> host... The packet would leave the guest, and get broadcast because the host is within the NETMASK for the current network of the guest. The host would receive the packet. The host would send a reply. Because the destination of x.x.1.x is not on the current network of x.x.0.x the packet would normally be routed to the gateway. Oh wait, the gateway isn't on x.x.0.x either. The packet would go nowhere.

Remember, packets are not smart. They only go exactly where you tell them.

A part of this that I didn't cover above is ARP. @Gerrit's comment above addresses that as well. When packets are sent on within the collision domain they travel by MAC address not IP. When 10.2.0.210/23 sends a packet to 10.2.0.92, outbound packets are sent directly to 10.2.0.92 after 10.2.0.210/23 sends a broadcast packet asking who is 10.2.0.92. I'm not sure if the guest will get the reply or not. It may since the ARP reply the requester MAC in it. The Guest will add that information to its own ARP table.

The Host though on the reply wont have the MAC address of the Guest, because for the host the guest lies outside its collission domain of a /24. It would normally go to the gateway to get routed, but it cant do that either because the HOST gateway is also not in the local network. Gateways which are Routers can't route packets back down the wire they come in on. The packet would need to traverse the device. It would probably get dropped if it could have made it to the gateway.

What I find more interesting is that it works sometimes. Netmasks, Broadcast, and Collision Domains all only affect the the sender of packets, not the receivers of them. Possibly because the Guest is a virtual thing on the host that some packets are going through the Virtual switch and are being seen.

For the past two days, there are no issues after changing the all network setting to 10.150.0.0/16 — jaks, Apr 03 '20 at 11:04

KVM QEMU VMs with static IP loose network connectivity

1 Answers1