0

I am trying to understand the vxlan driver code in linux kernel. The kernel version is: 3.16.0-29-generic

Looking at vxlan.c it appears that a vxlan dev is created per VNI and it is tied to the netns the netdevice belongs to and an udp socket is created per dev.

I am a bit perplexed by this though because except for the global netns, you cannot really attach a vxlan device to a physical device (ethx) because the physical device has to belong to the same netns as the vxlan device.

For example: If I create a vxlan link in the global netns, it works as expected:

ip link add vxlan0 type vxlan id 10 dev eth0
ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet 10.10.100.51/24 scope global lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 00:22:4d:99:32:6b brd ff:ff:ff:ff:ff:ff
    inet 192.168.0.25/24 brd 192.168.0.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::222:4dff:fe99:326b/64 scope link 
       valid_lft forever preferred_lft forever
15: vxlan0: <BROADCAST,MULTICAST> mtu 1450 qdisc noop state DOWN group default 
    link/ether fe:9c:49:26:ba:63 brd ff:ff:ff:ff:ff:ff

If I try to do the same thing in a network namespace, it won't work:

ip netns exec test0 ip link add vxlan1 type vxlan id 20 dev eth0
ip netns exec test0 ip a
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

The problem here is that it does not like "dev eth0" because the code checks to see if eth0 is in the same netns as the link being added.

If I create the same device without eth0 it works fine:

ip netns exec test0 ip link add vxlan1 type vxlan id 20 
ip netns exec test0 ip a
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
3: vxlan1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default 
    link/ether 46:7a:5b:87:7d:2f brd ff:ff:ff:ff:ff:ff

If you cannot attach a carrier to the vxlan device, how can you really tx/rx packets to/from outside the host?

Does it mean that realistically you can only use vxlan driver with global netns or alternatively you "have" to use it with a bridge?

vxlan packets have a VNI associated with them. You should be able to use it to directly send packets to a dev in a non-global netns, similar to what is possible with macvlans really.

Am I missing something?

cizixs
  • 12,931
  • 6
  • 48
  • 60
NetCubist
  • 61
  • 2
  • 6
  • I was under the impression that physical devices can belong only to the global netns, it appears I was mistaken. I tried to add eth0 to a netns and it succeeded. Atleast, I think so, because i lost connectivity! – NetCubist Feb 05 '15 at 08:43

3 Answers3

1

I think you can have a look at Software Defined Networking using VXLAN by Thomas Richter (presented in LinuxCon 2013).

You can turn on l2miss and l3miss of the vxlan device which is not in global netns and set ARP and FDB entries manually.

The following example shows how to achieve this.

function setup_overlay() {
    docker run -d --net=none --name=test-overlay ubuntu sleep 321339
    sleep 3
    pid=`docker inspect -f '{{.State.Pid}}' test-overlay`
    ip netns add overlay
    ip netns exec overlay ip li ad dev br0 type bridge
    ip li add dev vxlan212 type vxlan id 42 l2miss l3miss proxy learning dstport 4789
    ip link set vxlan212 netns overlay
    ip netns exec overlay ip li set dev vxlan212 name vxlan1
    ip netns exec overlay brctl addif br0 vxlan1
    ip li add dev vetha1 mtu 1450 type veth peer name vetha2 mtu 1450
    ip li set dev vetha1 netns overlay
    ip netns exec overlay ip -d li set dev vetha1 name veth2
    ip netns exec overlay brctl addif br0 veth2
    ip netns exec overlay ip ad add dev br0 $bridge_gatway_cidr
    ip netns exec overlay ip li set vxlan1 up
    ip netns exec overlay ip li set veth2 up
    ip netns exec overlay ip li set br0 up
    ln -sfn /proc/$pid/ns/net /var/run/netns/$pid
    ip li set dev vetha2 netns $pid
    ip netns exec $pid ip li set dev vetha2 name eth1 address $container1_mac_addr
    ip netns exec $pid ip ad add dev eth1 $container1_ip_cidr
    ip netns exec $pid ip li set dev eth1 up

    ip netns exec overlay ip neighbor add $container2_ip lladdr $container2_mac_addr dev vxlan1 nud permanent
    ip netns exec overlay bridge fdb add $container2_mac_addr dev vxlan1 self dst $container2_host_ip vni 42 port 4789
}

# setup overlay on host1
bridge_gatway_cidr='10.0.0.1/24'
container1_ip_cidr='10.0.0.2/24'
container1_mac_addr='02:42:0a:00:00:02'
container2_ip='10.0.0.3'
container2_mac_addr='02:42:0a:00:00:03'
container2_host_ip='192.168.10.22'
setup_overlay

# setup overlay on host2
bridge_gatway_cidr='10.0.0.1/24'
container1_ip_cidr='10.0.0.3/24'
container1_mac_addr='02:42:0a:00:00:03'
container2_ip='10.0.0.2'
container2_mac_addr='02:42:0a:00:00:02'
container2_host_ip='192.168.10.21'
setup_overlay

The above script setup a overlay network between two docker containers on two hosts. Vxlan device connects to the bridge br0 in overlay netns and br0 connects to the container netns with a pair of veth device.

Now check your newly setup overlay network.

# ping container2 on host1
ip netns exec $pid ping -c 10 10.0.0.3
## successful output
root@docker-1:/home/vagrant# ip netns exec $pid ping -c 10 10.0.0.3
PING 10.0.0.3 (10.0.0.3) 56(84) bytes of data.
64 bytes from 10.0.0.3: icmp_seq=1 ttl=64 time=0.879 ms
64 bytes from 10.0.0.3: icmp_seq=2 ttl=64 time=0.558 ms
64 bytes from 10.0.0.3: icmp_seq=3 ttl=64 time=0.576 ms
64 bytes from 10.0.0.3: icmp_seq=4 ttl=64 time=0.614 ms
64 bytes from 10.0.0.3: icmp_seq=5 ttl=64 time=0.521 ms
64 bytes from 10.0.0.3: icmp_seq=6 ttl=64 time=0.389 ms
64 bytes from 10.0.0.3: icmp_seq=7 ttl=64 time=0.551 ms
64 bytes from 10.0.0.3: icmp_seq=8 ttl=64 time=0.565 ms
64 bytes from 10.0.0.3: icmp_seq=9 ttl=64 time=0.488 ms
64 bytes from 10.0.0.3: icmp_seq=10 ttl=64 time=0.531 ms

--- 10.0.0.3 ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 9008ms
rtt min/avg/max/mdev = 0.389/0.567/0.879/0.119 ms

## tcpdump sample on host1
root@docker-1:/home/vagrant# tcpdump -vv -n -s 0 -e -i eth1
tcpdump: listening on eth1, link-type EN10MB (Ethernet), capture size 262144 bytes
12:09:35.589244 08:00:27:00:4a:3a > 08:00:27:82:e5:ca, ethertype IPv4 (0x0800), length 148: (tos 0x0, ttl 64, id 59751, offset 0, flags [none], proto UDP (17), length 134)
    192.168.0.11.42791 > 192.168.0.12.4789: [no cksum] VXLAN, flags [I] (0x08), vni 42
02:42:0a:00:00:02 > 02:42:0a:00:00:03, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 49924, offset 0, flags [DF], proto ICMP (1), length 84)
    10.0.0.2 > 10.0.0.3: ICMP echo request, id 1908, seq 129, length 64
12:09:35.589559 08:00:27:82:e5:ca > 08:00:27:00:4a:3a, ethertype IPv4 (0x0800), length 148: (tos 0x0, ttl 64, id 38389, offset 0, flags [none], proto UDP (17), length 134)
    192.168.0.12.56727 > 192.168.0.11.4789: [no cksum] VXLAN, flags [I] (0x08), vni 42
02:42:0a:00:00:03 > 02:42:0a:00:00:02, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 19444, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.0.3 > 10.0.0.2: ICMP echo reply, id 1908, seq 129, length 64
12:09:36.590840 08:00:27:00:4a:3a > 08:00:27:82:e5:ca, ethertype IPv4 (0x0800), length 148: (tos 0x0, ttl 64, id 59879, offset 0, flags [none], proto UDP (17), length 134)
    192.168.0.11.42791 > 192.168.0.12.4789: [no cksum] VXLAN, flags [I] (0x08), vni 42
02:42:0a:00:00:02 > 02:42:0a:00:00:03, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 49951, offset 0, flags [DF], proto ICMP (1), length 84)
    10.0.0.2 > 10.0.0.3: ICMP echo request, id 1908, seq 130, length 64
12:09:36.591328 08:00:27:82:e5:ca > 08:00:27:00:4a:3a, ethertype IPv4 (0x0800), length 148: (tos 0x0, ttl 64, id 38437, offset 0, flags [none], proto UDP (17), length 134)
    192.168.0.12.56727 > 192.168.0.11.4789: [no cksum] VXLAN, flags [I] (0x08), vni 42
02:42:0a:00:00:03 > 02:42:0a:00:00:02, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 19687, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.0.3 > 10.0.0.2: ICMP echo reply, id 1908, seq 130, length 64

Clean up on each host

ip netns del overlay
ip netns del $pid
docker rm -v -f test-overlay

To explain why the vxlan device works with no receivers in a non-global netns:

Note that we first create the vxlan device in global netns and move it into the overlay netns. This is indeed needed cause vxlan driver in kernel will keep a reference to the src netns when creating vxlan device. See the following code in drivers/net/vxlan.c:

static int vxlan_dev_configure(struct net *src_net, struct net_device *dev,
     
        struct vxlan_config *conf)
     {
    //...
    vxlan->net = src_net;
    
    //...
}

and vxlan driver creates udp socket in the src netns

vxlan_sock_add(vxlan->net, vxlan->cfg.dst_port, vxlan->cfg.no_share, vxlan->flags);

muru
  • 4,723
  • 1
  • 34
  • 78
chenchun
  • 69
  • 5
0

Turns out you can add physical devices to a non-global netns. Hence the question is moot. I would rather see one vxlan device in the global netns sending the packets to the appropriate netns based on VNI similar to how this is achieved in the case of macvlans tho.

NetCubist
  • 61
  • 2
  • 6
0

In kernel 4.3 there are patches that add a new method of working with VXLAN, using a single VXLAN netdev and with routing rules to add tunnel info.

According to the patches, you will be able to create routing rules that look at the tunnel info such as:

ip rule add from all tunnel-id 100 lookup 100
ip rule add from all tunnel-id 200 lookup 200

And add encapsulation headers with rules such as:

ip route add 40.1.1.1/32 encap vxlan id 10 dst 50.1.1.2 dev vxlan0
haggai_e
  • 4,689
  • 1
  • 24
  • 37