Kubernetes + Calico on Oracle Cloud VMs

Question

[ Disclaimer: this question was originally posted on ServerFault. However, since the official K8s documentation states "ask your questions on StackOverflow", I am also adding it here ]

I am trying to deploy a test Kubernetes cluster on Oracle Cloud, using OCI VM instances - however, I'm having issues with pod networking.

The networking plugin is Calico - it seems to be installed properly, but no traffic gets across the tunnels from one host to another. For example, here I am trying to access nginx running on another node:

root@kube-01-01:~# kubectl get pod -o wide
NAME                    READY   STATUS    RESTARTS   AGE   IP              NODE              NOMINATED NODE
nginx-dbddb74b8-th9ns   1/1     Running   0          38s   192.168.181.1   kube-01-06   <none>
root@kube-01-01:~# curl 192.168.181.1
[ ... timeout... ]

Using tcpdump, I see the IP-in-IP (protocol 4) packets leaving the first host, but they never seem to make it to the second one (although all other packets, including BGP traffic, make it through just fine).

root@kube-01-01:~# tcpdump -i ens3 proto 4 &
[1] 16642
root@kube-01-01:~# tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ens3, link-type EN10MB (Ethernet), capture size 262144 bytes

root@kube-01-01:~# curl 192.168.181.1
09:31:56.262451 IP kube-01-01 > kube-01-06: IP 192.168.21.64.52268 > 192.168.181.1.http: Flags [S], seq 3982790418, win 28000, options [mss 1400,sackOK,TS val 9661065 ecr 0,nop,wscale 7], length 0 (ipip-proto-4)
09:31:57.259756 IP kube-01-01 > kube-01-06: IP 192.168.21.64.52268 > 192.168.181.1.http: Flags [S], seq 3982790418, win 28000, options [mss 1400,sackOK,TS val 9661315 ecr 0,nop,wscale 7], length 0 (ipip-proto-4)
09:31:59.263752 IP kube-01-01 > kube-01-06: IP 192.168.21.64.52268 > 192.168.181.1.http: Flags [S], seq 3982790418, win 28000, options [mss 1400,sackOK,TS val 9661816 ecr 0,nop,wscale 7], length 0 (ipip-proto-4)

root@kube-01-06:~# tcpdump -i ens3 proto 4 &
[1] 12773
root@kube-01-06:~# tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ens3, link-type EN10MB (Ethernet), capture size 262144 bytes

What I have checked so far:

The Calico routing mesh comes up just fine. I can see the BGP traffic on the packet capture, and I can see all nodes as "up" using calicoctl

root@kube-01-01:~# ./calicoctl node status Calico process is running.

IPv4 BGP status
+--------------+-------------------+-------+----------+-------------+
| PEER ADDRESS |     PEER TYPE     | STATE |  SINCE   |    INFO     |
+--------------+-------------------+-------+----------+-------------+
| 10.13.23.123 | node-to-node mesh | up    | 09:12:50 | Established |
| 10.13.23.124 | node-to-node mesh | up    | 09:12:49 | Established |
| 10.13.23.126 | node-to-node mesh | up    | 09:12:50 | Established |
| 10.13.23.129 | node-to-node mesh | up    | 09:12:50 | Established |
| 10.13.23.127 | node-to-node mesh | up    | 09:12:50 | Established |
| 10.13.23.128 | node-to-node mesh | up    | 09:12:50 | Established |
| 10.13.23.130 | node-to-node mesh | up    | 09:12:52 | Established |
+--------------+-------------------+-------+----------+-------------+

The security rules for the subnet allow all traffic. All the nodes are in the same subnet, and I have a stateless rule permitting all traffic from other nodes within the subnet (I have also tried adding a rule permitting IP-in-IP traffic explicitly - same result).
The source/destination check is disabled on all the vNICs on the K8s nodes.

Other things I have noticed:

I can get calico to work if I disable IP in IP encapsulation for same-subnet traffic, and use regular routing inside the subnet (as described here for AWS)
Other networking plugins (such as weave) seem to work correctly.

So my question here is - what is happening to the IP-in-IP encapsulated traffic? Is there anything else I can check to figure out what is going on?

And yes, I know that I could have used managed Kubernetes engine directly, but where is the fun (and the learning opportunity) in that? :D

Edited to address Rico's answer below:

1) I'm also not getting any pod-to-pod traffic to flow through (no communication between pods on different hosts). But I was unable to capture that traffic, so I used node-to-pod as an example.

2) I'm also getting a similar result if I hit a NodePort svc on another node than the one the pod is running on - I see the outgoing IP-in-IP packets from the first node, but they never show up on the second node (the one actually running the pod):

root@kube-01-01:~# tcpdump -i ens3 proto 4 &
[1] 6499
root@kube-01-01:~# tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ens3, link-type EN10MB (Ethernet), capture size 262144 bytes
root@kube-01-01:~# curl 127.0.0.1:32137
20:24:08.460069 IP kube-01-01 > kube-01-06: IP 192.168.21.64.40866 > 192.168.181.1.http: Flags [S], seq 3175451438, win 43690, options [mss 65495,sackOK,TS val 19444115 ecr 0,nop,wscale 7], length 0 (ipip-proto-4)
20:24:09.459768 IP kube-01-01 > kube-01-06: IP 192.168.21.64.40866 > 192.168.181.1.http: Flags [S], seq 3175451438, win 43690, options [mss 65495,sackOK,TS val 19444365 ecr 0,nop,wscale 7], length 0 (ipip-proto-4)
20:24:11.463750 IP kube-01-01 > kube-01-06: IP 192.168.21.64.40866 > 192.168.181.1.http: Flags [S], seq 3175451438, win 43690, options [mss 65495,sackOK,TS val 19444866 ecr 0,nop,wscale 7], length 0 (ipip-proto-4)
20:24:15.471769 IP kube-01-01 > kube-01-06: IP 192.168.21.64.40866 > 192.168.181.1.http: Flags [S], seq 3175451438, win 43690, options [mss 65495,sackOK,TS val 19445868 ecr 0,nop,wscale 7], length 0 (ipip-proto-4)

Nothing on the second node ( kube-01-06, the one that is actually running the nginx pod ):

root@kubespray-01-06:~# tcpdump -i ens3 proto 4
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ens3, link-type EN10MB (Ethernet), capture size 262144 bytes

I used 127.0.0.1 for ease of demonstration - of course, the exact same thing happens when I hit that NodePort from an outside host:

20:25:17.653417 IP kube-01-01 > kube-01-06: IP 192.168.21.64.56630 > 192.168.181.1.http: Flags [S], seq 980178400, win 64240, options [mss 1440,nop,wscale 8,nop,nop,sackOK], length 0 (ipip-proto-4)
20:25:17.654371 IP kube-01-01 > kube-01-06: IP 192.168.21.64.56631 > 192.168.181.1.http: Flags [S], seq 3932412963, win 64240, options [mss 1440,nop,wscale 8,nop,nop,sackOK], length 0 (ipip-proto-4)
20:25:17.667227 IP kube-01-01 > kube-01-06: IP 192.168.21.64.56632 > 192.168.181.1.http: Flags [S], seq 2017119223, win 64240, options [mss 1440,nop,wscale 8,nop,nop,sackOK], length 0 (ipip-proto-4)
20:25:20.653656 IP kube-01-01 > kube-01-06: IP 192.168.21.64.56630 > 192.168.181.1.http: Flags [S], seq 980178400, win 64240, options [mss 1440,nop,wscale 8,nop,nop,sackOK], length 0 (ipip-proto-4)
20:25:20.654577 IP kube-01-01 > kube-01-06: IP 192.168.21.64.56631 > 192.168.181.1.http: Flags [S], seq 3932412963, win 64240, options [mss 1440,nop,wscale 8,nop,nop,sackOK], length 0 (ipip-proto-4)
20:25:20.668595 IP kube-01-01 > kube-01-06: IP 192.168.21.64.56632 > 192.168.181.1.http: Flags [S], seq 2017119223, win 64240, options [mss 1440,nop,wscale 8,nop,nop,sackOK], length 0 (ipip-proto-4)

3) As far as I can tell (please correct me if I'm wrong here), the nodes are aware of routes to pod networks, and pod-to-node traffic is also encapsulated IP-in-IP (notice the protocol 4 packets in the first capture above)

root@kube-01-01:~# kubectl get pod -o wide
NAME                     READY   STATUS    RESTARTS   AGE   IP               NODE              NOMINATED NODE
alpine-9d85bf65c-2wx74   1/1     Running   1          23m   192.168.82.194   kube-01-08   <none>
nginx-dbddb74b8-th9ns    1/1     Running   0          10h   192.168.181.1    kube-01-06   <none>

root@kube-01-01:~# route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
<snip>
192.168.181.0   10.13.23.127    255.255.255.192 UG    0      0        0 tunl0

score 1 · Answer 1 · answered Nov 11 '18 at 20:47

1

Maybe it is a MTU issue:

Typically the MTU for your workload interfaces should match the network MTU. If you need IP-in-IP then the MTU size for both the workload and tunnel interfaces should be 20 bytes less than the network MTU for your network. This is due to the extra 20 byte header that the tunnel will add to each packet.

Read more here.

answered Nov 11 '18 at 20:47

Ali Javadi

31
1
3

1

The MTU for the tunl0 interfaces is set at 1440, with 1500 MTU on the physical interfaces: `tunl0 UP RUNNING NOARP MTU:1440 ` . Also, the size of the default nginx web page is 612 bytes, well below the MTU limit... – Bogd Nov 12 '18 at 08:50

score 1 · Answer 2 · answered Aug 04 '21 at 12:12

After a long time and a lot of testing, my belief is that this was caused by IP-in-IP (ipip, or IP protocol 4) traffic being blocked by the Oracle cloud networking layer.

Even though I was unable to find this documented anywhere, it is something that is common for cloud providers (Azure, for example, does the same thing - disallows IP-in-IP and unknown IP traffic).

So the possible workarounds here should be the same ones as the ones listed in the Calico documentation for Azure:

Disabling IP-in-IP for same-subnet traffic (as I mentioned in the question)
Switching Calico to VXLAN encapsulation
Using Calico for policy only, and flannel for encapsulation (VXLAN)

score 0 · Answer 3 · answered Nov 11 '18 at 19:33

Are you having issues connecting from Pod to Pod?

The short answer here would seem that he PodCidr packets are getting encapsulated when they are communicating to another pod either on the same node or another node.

Note:

By default, Calico’s IPIP encapsulation applies to all container-to-container traffic.

So you will be able to connect to a pod on another node if you are inside the pod. For example, if you connect with kubectl exec -it <pod-name>.

This is the reason you can't connect to a pod/container from root@kube-01-01:~# since your node/host doesn't know anything about the PodCidr. It sends 192.168.x.x packets through the default node/host route, however, your physical network is not 192.168.x.x so they get lost since there's no other node/host that physically understands that.

The way you would connect to a nginx service would be through a Kubernetes Service, this is different from the network overlay and allows you connect to pods outside of the PodCidr. Note that these service rules are managed by the kube-proxy and are generally iptables rules. Also, with iptables you can explicitly do things like if you want to talk to IP A.A.A.A you need to go through a physical interface (i.e. tun0) or you have to through IP B.B.B.B.

Hope it helps!

I just used node-to-pod communication because it was easier to demo. But the exact same thing happens if I try pod-to-pod communication, or if I hit a NodePort svc going to a pod on another node. I see the IP-in-IP traffic leaving the first node, but it never seems to get to the second one. I will edit the question to clarify this. — Bogd, Nov 11 '18 at 20:10
Also, there is one thing I do not understand here - according to the docs (https://kubernetes.io/docs/concepts/cluster-administration/networking/), "all nodes can communicate with all containers (and vice-versa) without NAT". So nodes _should_ be able to talk to pods via the network overlay. And on bare metal (also using calico), nodes were able to talk to pods. It's just this cloud deployment that I cannot get to work... — Bogd, Nov 11 '18 at 20:37

Kubernetes + Calico on Oracle Cloud VMs

3 Answers3