[ Disclaimer: this question was originally posted on ServerFault. However, since the official K8s documentation states "ask your questions on StackOverflow", I am also adding it here ]
I am trying to deploy a test Kubernetes cluster on Oracle Cloud, using OCI VM instances - however, I'm having issues with pod networking.
The networking plugin is Calico - it seems to be installed properly, but no traffic gets across the tunnels from one host to another. For example, here I am trying to access nginx running on another node:
root@kube-01-01:~# kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE
nginx-dbddb74b8-th9ns 1/1 Running 0 38s 192.168.181.1 kube-01-06 <none>
root@kube-01-01:~# curl 192.168.181.1
[ ... timeout... ]
Using tcpdump, I see the IP-in-IP (protocol 4) packets leaving the first host, but they never seem to make it to the second one (although all other packets, including BGP traffic, make it through just fine).
root@kube-01-01:~# tcpdump -i ens3 proto 4 &
[1] 16642
root@kube-01-01:~# tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ens3, link-type EN10MB (Ethernet), capture size 262144 bytes
root@kube-01-01:~# curl 192.168.181.1
09:31:56.262451 IP kube-01-01 > kube-01-06: IP 192.168.21.64.52268 > 192.168.181.1.http: Flags [S], seq 3982790418, win 28000, options [mss 1400,sackOK,TS val 9661065 ecr 0,nop,wscale 7], length 0 (ipip-proto-4)
09:31:57.259756 IP kube-01-01 > kube-01-06: IP 192.168.21.64.52268 > 192.168.181.1.http: Flags [S], seq 3982790418, win 28000, options [mss 1400,sackOK,TS val 9661315 ecr 0,nop,wscale 7], length 0 (ipip-proto-4)
09:31:59.263752 IP kube-01-01 > kube-01-06: IP 192.168.21.64.52268 > 192.168.181.1.http: Flags [S], seq 3982790418, win 28000, options [mss 1400,sackOK,TS val 9661816 ecr 0,nop,wscale 7], length 0 (ipip-proto-4)
root@kube-01-06:~# tcpdump -i ens3 proto 4 &
[1] 12773
root@kube-01-06:~# tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ens3, link-type EN10MB (Ethernet), capture size 262144 bytes
What I have checked so far:
- The Calico routing mesh comes up just fine. I can see the BGP traffic on the packet capture, and I can see all nodes as "up" using calicoctl
root@kube-01-01:~# ./calicoctl node status Calico process is running.
IPv4 BGP status
+--------------+-------------------+-------+----------+-------------+
| PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO |
+--------------+-------------------+-------+----------+-------------+
| 10.13.23.123 | node-to-node mesh | up | 09:12:50 | Established |
| 10.13.23.124 | node-to-node mesh | up | 09:12:49 | Established |
| 10.13.23.126 | node-to-node mesh | up | 09:12:50 | Established |
| 10.13.23.129 | node-to-node mesh | up | 09:12:50 | Established |
| 10.13.23.127 | node-to-node mesh | up | 09:12:50 | Established |
| 10.13.23.128 | node-to-node mesh | up | 09:12:50 | Established |
| 10.13.23.130 | node-to-node mesh | up | 09:12:52 | Established |
+--------------+-------------------+-------+----------+-------------+
- The security rules for the subnet allow all traffic. All the nodes are in the same subnet, and I have a stateless rule permitting all traffic from other nodes within the subnet (I have also tried adding a rule permitting IP-in-IP traffic explicitly - same result).
- The source/destination check is disabled on all the vNICs on the K8s nodes.
Other things I have noticed:
- I can get calico to work if I disable IP in IP encapsulation for same-subnet traffic, and use regular routing inside the subnet (as described here for AWS)
- Other networking plugins (such as weave) seem to work correctly.
So my question here is - what is happening to the IP-in-IP encapsulated traffic? Is there anything else I can check to figure out what is going on?
And yes, I know that I could have used managed Kubernetes engine directly, but where is the fun (and the learning opportunity) in that? :D
Edited to address Rico's answer below:
1) I'm also not getting any pod-to-pod traffic to flow through (no communication between pods on different hosts). But I was unable to capture that traffic, so I used node-to-pod as an example.
2) I'm also getting a similar result if I hit a NodePort svc on another node than the one the pod is running on - I see the outgoing IP-in-IP packets from the first node, but they never show up on the second node (the one actually running the pod):
root@kube-01-01:~# tcpdump -i ens3 proto 4 &
[1] 6499
root@kube-01-01:~# tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ens3, link-type EN10MB (Ethernet), capture size 262144 bytes
root@kube-01-01:~# curl 127.0.0.1:32137
20:24:08.460069 IP kube-01-01 > kube-01-06: IP 192.168.21.64.40866 > 192.168.181.1.http: Flags [S], seq 3175451438, win 43690, options [mss 65495,sackOK,TS val 19444115 ecr 0,nop,wscale 7], length 0 (ipip-proto-4)
20:24:09.459768 IP kube-01-01 > kube-01-06: IP 192.168.21.64.40866 > 192.168.181.1.http: Flags [S], seq 3175451438, win 43690, options [mss 65495,sackOK,TS val 19444365 ecr 0,nop,wscale 7], length 0 (ipip-proto-4)
20:24:11.463750 IP kube-01-01 > kube-01-06: IP 192.168.21.64.40866 > 192.168.181.1.http: Flags [S], seq 3175451438, win 43690, options [mss 65495,sackOK,TS val 19444866 ecr 0,nop,wscale 7], length 0 (ipip-proto-4)
20:24:15.471769 IP kube-01-01 > kube-01-06: IP 192.168.21.64.40866 > 192.168.181.1.http: Flags [S], seq 3175451438, win 43690, options [mss 65495,sackOK,TS val 19445868 ecr 0,nop,wscale 7], length 0 (ipip-proto-4)
Nothing on the second node ( kube-01-06
, the one that is actually running the nginx pod ):
root@kubespray-01-06:~# tcpdump -i ens3 proto 4
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ens3, link-type EN10MB (Ethernet), capture size 262144 bytes
I used 127.0.0.1 for ease of demonstration - of course, the exact same thing happens when I hit that NodePort from an outside host:
20:25:17.653417 IP kube-01-01 > kube-01-06: IP 192.168.21.64.56630 > 192.168.181.1.http: Flags [S], seq 980178400, win 64240, options [mss 1440,nop,wscale 8,nop,nop,sackOK], length 0 (ipip-proto-4)
20:25:17.654371 IP kube-01-01 > kube-01-06: IP 192.168.21.64.56631 > 192.168.181.1.http: Flags [S], seq 3932412963, win 64240, options [mss 1440,nop,wscale 8,nop,nop,sackOK], length 0 (ipip-proto-4)
20:25:17.667227 IP kube-01-01 > kube-01-06: IP 192.168.21.64.56632 > 192.168.181.1.http: Flags [S], seq 2017119223, win 64240, options [mss 1440,nop,wscale 8,nop,nop,sackOK], length 0 (ipip-proto-4)
20:25:20.653656 IP kube-01-01 > kube-01-06: IP 192.168.21.64.56630 > 192.168.181.1.http: Flags [S], seq 980178400, win 64240, options [mss 1440,nop,wscale 8,nop,nop,sackOK], length 0 (ipip-proto-4)
20:25:20.654577 IP kube-01-01 > kube-01-06: IP 192.168.21.64.56631 > 192.168.181.1.http: Flags [S], seq 3932412963, win 64240, options [mss 1440,nop,wscale 8,nop,nop,sackOK], length 0 (ipip-proto-4)
20:25:20.668595 IP kube-01-01 > kube-01-06: IP 192.168.21.64.56632 > 192.168.181.1.http: Flags [S], seq 2017119223, win 64240, options [mss 1440,nop,wscale 8,nop,nop,sackOK], length 0 (ipip-proto-4)
3) As far as I can tell (please correct me if I'm wrong here), the nodes are aware of routes to pod networks, and pod-to-node traffic is also encapsulated IP-in-IP (notice the protocol 4 packets in the first capture above)
root@kube-01-01:~# kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE
alpine-9d85bf65c-2wx74 1/1 Running 1 23m 192.168.82.194 kube-01-08 <none>
nginx-dbddb74b8-th9ns 1/1 Running 0 10h 192.168.181.1 kube-01-06 <none>
root@kube-01-01:~# route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
<snip>
192.168.181.0 10.13.23.127 255.255.255.192 UG 0 0 0 tunl0