KVM QEMU Guest VMs randomly lose network connection

Question

I'm working on setting up a server with KVM/QEMU and all Linux servers. We are going to use this server for web development, git, VoIP PBX, etc. (We were using XenServer and Windows Server 2016, but I'm a Linux fan.) I've come across some issues with virtual machines seemingly randomly losing network connection or going to sleep or something like that. I can't seem to pin down what the issue is.

I've looked through a lot of forums and posts even here on Server Fault, but nothing quite fits what I'm trying to do. I'll attach an image below of our network setup. We have 2 locations, and a VPN between them with firewalls. The machine in question is a Dell PowerEdge R710. I've successfully installed Ubuntu 18.10 and KVM/QEMU on it as a host OS (18.10 because of an issue with Virtual Manager not showing all network connections in 18.04.) I use Virtual Manager to manage installing/monitoring new VMs from my laptop (Dev Computer 1) over ssh.

I have 6 guest VMs all installed with either Ubuntu 18.04 or Debian 9 (our VoIP PBX) and they all work great except for the occasional network hiccup. All are connected through a bonded bridge in the host machine (including the host itself). There are 4 NICs all bonded and I've used the bond as an interface for the bridge. I'm using netplan for the network configuration and I'll post the config yaml below. I'm using static IP configurations for all the guest VMs that simply set an IP for the default "ens3" interface through netplan, but I can post that too if it will help.

Some interesting things I've noticed:

I can always ssh into the host machine, it never seems to lose connection.
When one of the 6 machines loses network connection, I can still ssh into it from the host machine, but it will sometimes hang for a bit while reestablishing connection.
If I ssh into the offending VM from the host and do a ping to the gateway (firewall) it will snap out of it and we can connect to it again.
Occasionally the guest VMs will be unable to see each other, but if I ssh into whichever can't see the other and run a ping it will usually start working after a few "Destination Host Unreachable" messages.

I can get any other command outputs or logs that would be necessary to further diagnose this, and I'd really appreciate anyone who may know more about this looking into it. I'm a huge Linux fan, and want this to work the way I know it can, but these random disconnects are not making this solution look very good. Thanks to any who take time to read this!

Network Map

Host machine netplan configuration:

network:
    version: 2
    renderer: networkd
    ethernets:
        eno1:
            dhcp4: false
            dhcp6: false
        eno2:
            dhcp4: false
            dhcp6: false
        eno3:
            dhcp4: false
            dhcp6: false
        eno4:
            dhcp4: false
            dhcp6: false
    bonds:
        bond0:
            interfaces:
                - eno1
                - eno2
                - eno3
                - eno4
            addresses: [192.168.5.20/24]
            dhcp4: false
            gateway4: 192.168.5.1
            nameservers:
                addresses: [192.168.1.6,1.1.1.1]
    bridges:
        br0:
            addresses: [192.168.5.21/24]
            dhcp4: false
            gateway4: 192.168.5.1
            nameservers:
                addresses: [192.168.1.6,1.1.1.1]
            interfaces:
                - bond0

score 1 · Answer 1 · answered Apr 12 '19 at 01:03

1

I have an almost identical configuration currently in production. Ubuntu 18.04+KVM/QEMU on an R710 and I have not experienced this issue.

While it's possible that it's a difference of Ubuntu versions, with you being on 18.10, or an actual hardware issue you're having, the only notable difference I see in this configuration is the bond - which I am not using. My bridge configuration looks like the one below:

    bridges:
        br0:
            dhcp4: yes
            interfaces:
                - eno1

It's only using eno1 as that's the only interface with a cable running to it. It may be worthwhile, purely for troubleshooting purposes, to attempt using a similar configuration so see if it resolves the issue.

If that is the issue, the things that stick out to me as being potentially flawed in your configuration are the redundant parameters in your bond/bridge. To my understanding, parameters like the addresses, gateway, and nameservers should be innately inherited from the interface in use. Potentially attempt setting all of these settings in either the bridge or the bond, but not both.

Lastly, considering it appears we are on near-identical hardware, running some sort of test on the VM host to confirm that the network card itself is not bad.

Hope this helps!

answered Apr 12 '19 at 01:03

svartedauden

11
2

Thanks for your suggestions! Sorry I haven't gotten around to testing this yet. I will have to make a trip to where the machine is located to do these kinds of test because I might inadvertently break my connection to ssh into it. I will try these things as soon as I get the chance. – ZhangXector Apr 29 '19 at 16:40
Finally got around to trying this. I can't post the new config here, because it's too long. Essentially, we removed the bond completely and set the bridge to use the eno1 interface alone. We haven't seen any issues with running this way since changing it a couple of weeks ago. Our main issue now is that the other interfaces are not being utilized at all, so we can only go as fast as that 1 interface allows us. I would hope that there would be a way to allow us to use all 4, but without losing connection intermittently. This does point to the bond being the issue, though. – ZhangXector Mar 09 '20 at 18:10
Was your connection to the host over the bonded set? if not, maybe the switch was doing something odd because it didn't recognize the bond on its end. Since there were 4 links to the original bond, you could now do testing on 2 of the links bonded without worrying about taking your systems offline irreversibly. The other thing it could be is the card unable to handle the load of the full bond. Ive seen some really strange behavior of fake Intel NIC cards. and Im not the only one. The fakes have cheap drivers on them that claim to be Pulse or Delta. – Rowan Hawkins Apr 02 '20 at 10:26
For reference, the fakes are many of the Intel i350 and i340 cards on ebay and Amazon. You need to make sure that there is an Intel hologram, or if its a thrid party card from Dell/IBM/HP that the driver chips have the logo molded into the package case, not printed on. There are websites with pictures comparing the differences on the cards, real and fake. – Rowan Hawkins Apr 02 '20 at 10:37

KVM QEMU Guest VMs randomly lose network connection

1 Answers1

Linked