2

I'm trying to connect two hypervisors together across the internet. Both hypervisors are running KVM for virtualization and I manually created bridges for the VM networking, which currently works.

I would like to use Wireguard to connect the VM bridges on both these hypervisors, so that VMs can be migrated from one to the other hypervisor without any routing/networking changes in the VM itself.

Before starting that though, I realized that I have no idea how to set up Wireguard in such way that the bridges on both the hypervisors are visible to the other hypervisor, but the hypervisors themselves do not show up on the VPN, so that the VMs cannot connect/attack the hypervisors directly.

Is Wireguard able to do this? Or would it be impossible due to the way Wireguard works high up on the OSI stack instead of at ethernet frame level?

Would anyone be able to say whether or not Wireguard can be used to bridge two networks 'blindly'? Or would the hosts always show up on the network as well? Would some other VPN solution allow this? Or would I be best off using the firewall to lock down the hosts I don't want to be accessed?

Thanks in advance!

Alex
  • 389
  • 9
  • 23
  • 1
    wireguard works at layer 3 (IP), not at layer 2 (ethernet). So if you intend to link *ethernet* bridges, wireguard alone can't do this ([mailing list message](https://lists.zx2c4.com/pipermail/wireguard/2018-January/002336.html) with authoritative answer from wireguard's author in the next message). You'd need an extra encapsulation layer. Probably possible with a gretap interface. I didn't really try to understand the actual problem described. – A.B Sep 08 '19 at 20:09
  • @A.B Ahh yes, that's what I feared would be the case.. GRE-TAP Looks interesting though! I'll definitely give that a closer look. Also thanks for that mailing list link, it's nice to see people already went about the problem :) Thanks! If you want you can also post this as an answer because this more or less answers my question. – Alex Sep 09 '19 at 15:52
  • 1
    I made a mockup (with namespaces rather than hypervisors), and found that gretap gives trouble to handle big packets and fragmentation. I used this to overcome it: [GRE bridging, IPsec and NFQUEUE](https://backreference.org/2013/07/23/gre-bridging-ipsec-and-nfqueue/). Actually nftables and payload mangling can do it instead of nfqueue+userland. In my "implementation" the hypervisor provides a bridge to multiple VMs but this bridge doesn't communicate elsewhere (except to the other bridge through gretap+wireguard):hypervisors stay invisible, even if they handle the bridged traffic between them. – A.B Sep 10 '19 at 22:07
  • hey @A.B that looks really nice! Even some custom code and what not, looks pretty extensive. I'm kinda looking at Flannel right now, which also seems to be able to provide overlay networks, such as used by kubernetes/docker, but which should in theory also work with KVM, and can use simple vpns like Wireguard/etc as backend. I'm going to experiment a little and try out which setup works best. Feel free to also put your comment in an answer, because it does look like a valid answer that might work for some people :) – Alex Sep 13 '19 at 23:49

1 Answers1

3

Wireguard works at layer 3 (routed IP packet) while a bridge works at layer 2 (switched ethernet frame). So wireguard can't do this. This question was already asked and an answer made by wireguard's author in wireguard's mailing list: Bridging wg and normal interfaces?.

So an additional encapsulation layer is needed before going through wireguard. I'm sure there are several choices available (among those VXLAN also available on Linux).

I made an example using GRETAP and namespaces (rather than actual hypervisors and VMs) instead. I found that, to avoid having to force local VM traffic to have a more limited packet size than possible, fragmentation handling is needed, but GRETAP has some limitations about it, as described here: GRE bridging, IPsec and NFQUEUE. I chose to work around it using nftables' payload mangling rather than NFQUEUE and userspace.

Also note that RFC 6864 explains that fragmented packets should be limited most of the time to 6.4Mbp/s, due to the IP ID field limitations, which is slow nowadays, but going through a tunnel with strong integrity checks, relaxes it.

Here the (fake) VMs are linked with a bridge and don't have other connectivity. They cannot see anything else than themselves: they can't see the (fake) hypervisors linking the two bridges using gretap+wireguard. Just run this bash script which creates and configures several namespaces. Tested with nftables 0.9.2 and kernel 5.2.x.

#!/bin/bash

if ip netns id | grep -qv '^ *$' ; then
    printf 'ERROR: leave netns "%s" first\n' $(ip netns id) >&2
    exit 1
fi

hosts='vm11 vm12 hyp1 router hyp2 vm21 vm22'

for ns in $hosts; do
    ip netns del $ns 2>/dev/null || :
    ip netns add $ns
    ip netns exec $ns sysctl -q -w net.ipv6.conf.default.disable_ipv6=1
    ip netns exec $ns sysctl -q -w net.ipv4.icmp_echo_ignore_broadcasts=0
done

for ns in $hosts; do
    ip -n $ns link set lo up
done

link_ns () {
    ip -n $1 link add name "$3" type veth peer netns $2 name "$4"
    ip -n $1 link set dev "$3" up
    ip -n $2 link set dev "$4" up
}

for h in 1 2; do
    ip -n hyp$h link add bridge0 address 02:00:00:00:00:0$h type bridge
    ip -n hyp$h link set bridge0 up
    for n in 1 2; do
        link_ns vm$h$n hyp$h eth0 port$n
        ip -n hyp$h link set dev port$n master bridge0
        ip -n vm$h$n address add 10.0.$h.$n/16 dev eth0
    done
    link_ns hyp$h router eth0 site$h
done

ip -n router address add 192.0.2.1/24 dev site1
ip -n router address add 198.51.100.1/24 dev site2

ip -n hyp1 address add 192.0.2.100/24 dev eth0
ip -n hyp1 route add default via 192.0.2.1

ip -n hyp2 address add 198.51.100.200/24 dev eth0
ip -n hyp2 route add default via 198.51.100.1

privkey1=$(wg genkey)
privkey2=$(wg genkey)

pubkey1=$(printf '%s' "$privkey1" | wg pubkey)
pubkey2=$(printf '%s' "$privkey2" | wg pubkey)

for h in 1 2; do
    ip -n hyp$h link add name wg0 type wireguard
    ip -n hyp$h address add 10.100.0.$h/24 dev wg0
    ip -n hyp$h link set dev wg0 up
done

ip netns exec hyp1 wg set wg0 private-key <(printf '%s' "$privkey1") listen-port 11111 peer "$pubkey2" endpoint 198.51.100.200:22222 allowed-ips 10.100.0.2
ip netns exec hyp2 wg set wg0 private-key <(printf '%s' "$privkey2") listen-port 22222 peer "$pubkey1" endpoint 192.0.2.100:11111 allowed-ips 10.100.0.1

for h in 1 2; do
    ip -n hyp$h link add name gt0 mtu 1500 type gretap remote 10.100.0.$((3-$h)) local 10.100.0.$h nopmtud
    ip -n hyp$h link set gt0 master bridge0
    ip -n hyp$h link set gt0 up
    ip netns exec hyp$h nft -f /dev/stdin << EOF
table ip filter {
    chain output {
        type filter hook output priority 0; policy accept;
        ip protocol gre ip saddr 10.100.0.$h ip daddr 10.100.0.$((3-$h)) ip frag-off set ip frag-off & 0xbfff counter
    }
}
EOF
done

Example (with greater delays from remote side):

# ip netns exec vm11 ping -b 10.0.255.255
WARNING: pinging broadcast address
PING 10.0.255.255 (10.0.255.255) 56(84) bytes of data.
64 bytes from 10.0.1.1: icmp_seq=1 ttl=64 time=0.048 ms
64 bytes from 10.0.1.2: icmp_seq=1 ttl=64 time=0.194 ms (DUP!)
64 bytes from 10.0.2.2: icmp_seq=1 ttl=64 time=0.646 ms (DUP!)
64 bytes from 10.0.2.1: icmp_seq=1 ttl=64 time=0.685 ms (DUP!)
64 bytes from 10.0.1.1: icmp_seq=2 ttl=64 time=0.059 ms
64 bytes from 10.0.1.2: icmp_seq=2 ttl=64 time=0.154 ms (DUP!)
64 bytes from 10.0.2.2: icmp_seq=2 ttl=64 time=0.476 ms (DUP!)
64 bytes from 10.0.2.1: icmp_seq=2 ttl=64 time=0.490 ms (DUP!)
^C
--- 10.0.255.255 ping statistics ---
2 packets transmitted, 2 received, +6 duplicates, 0% packet loss, time 1050ms
rtt min/avg/max/mdev = 0.048/0.344/0.685/0.243 ms
A.B
  • 11,090
  • 2
  • 24
  • 45