Source IP address translation for intra-cluster traffic

Question

I'm trying to dive into K8s networking model and I think I have a pretty good understanding of it so far, but there is one thing that I can't get my head around. In the Cluster Networking guide, the following is mentioned:

Kubernetes imposes the following fundamental requirements on any networking implementation (barring any intentional network segmentation policies):

all containers can communicate with all other containers without NAT

all nodes can communicate with all containers (and vice-versa) without NAT

the IP that a container sees itself as is the same IP that others see it as

The second bullet point specifies that x-node container communication should be possible without NAT. This is however not true when kube-proxy runs in iptables mode. This is the dump of the iptables from one of my nodes:

Chain POSTROUTING (policy ACCEPT)
target     prot opt source               destination         
KUBE-POSTROUTING  all  --  anywhere             anywhere             /* kubernetes postrouting rules */

Chain KUBE-POSTROUTING (1 references)
target     prot opt source               destination         
MASQUERADE  all  --  anywhere             anywhere             /* kubernetes service traffic requiring SNAT */ mark match 0x4000/0x4000

/* sample target pod chain being marked for MASQ */
Chain KUBE-SEP-2BKJZA32HM354D5U (1 references)
target     prot opt source               destination         
KUBE-MARK-MASQ  all  --  xx.yyy.zzz.109       anywhere             /* kube-system/heapster: */
DNAT       tcp  --  anywhere             anywhere             /* kube-system/heapster: */ tcp to:xx.yyy.zzz.109:8082

Chain KUBE-MARK-MASQ (156 references)
target     prot opt source               destination         
MARK       all  --  anywhere             anywhere             MARK or 0x4000

Looks like K8s is changing the source IP of marked outbound packets to the node's IP (for a ClusterIP service). And they even explicitly mention this in Source IP for Services with Type=ClusterIP:

Packets sent to ClusterIP from within the cluster are never source NAT’d if you’re running kube-proxy in iptables mode, which is the default since Kubernetes 1.2. If the client pod and server pod are in the same node, the client_address is the client pod’s IP address. However, if the client pod and server pod are in different nodes, the client_address is the client pod’s node flannel IP address.

This starts by saying packets within the cluster are never SNAT'd but then proceedes to say packages sent to pods in other nodes are in fact SNAT'd. I'm confused about this - am I misinterpreting the all nodes can communicate with all containers (and vice-versa) without NAT requirement somehow?

score 1 · Accepted Answer · answered Oct 27 '18 at 00:24

1

If you read point 2:

Pod-to-Pod communications: this is the primary focus of this document.

This still applies to all the containers and pods running in your cluster, because all of them are in the PodCidr:

all containers can communicate with all other containers without NAT
all nodes can communicate with all containers (and vice-versa)
without NAT the IP that a container sees itself as is the same IP that others see it as

Basically, all pods have unique IP addresses and are in the same space and can talk to each at the IP layer.

Also, if you look at the routes on one of your Kubernetes nodes you'll see something like this for Calico, where the podCidr is 192.168.0.0/16:

default via 172.0.0.1 dev ens5 proto dhcp src 172.0.1.10 metric 100
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown
172.31.0.0/20 dev ens5 proto kernel scope link src 172.0.1.10
172.31.0.1 dev ens5 proto dhcp scope link src 172.0.1.10 metric 100
blackhole 192.168.0.0/24 proto bird
192.168.0.42 dev calixxxxxxxxxxx scope link
192.168.0.43 dev calixxxxxxxxxxx scope link
192.168.4.0/24 via 172.0.1.6 dev tunl0 proto bird onlink
192.168.7.0/24 via 172.0.1.55 dev tunl0 proto bird onlink
192.168.8.0/24 via 172.0.1.191 dev tunl0 proto bird onlink
192.168.9.0/24 via 172.0.1.196 dev tunl0 proto bird onlink
192.168.11.0/24 via 172.0.1.147 dev tunl0 proto bird onlink

You see the packets with a 192.168.x.x are directly forwarded to a tunnel interface connected to the nodes, so no NATing there.

Now, when you are connecting from the outside the PodCidr your packets are definitely NATed, say through services are through an external host. You also definitely see iptable rules like this:

# Completed on Sat Oct 27 00:22:39 2018
# Generated by iptables-save v1.6.1 on Sat Oct 27 00:22:39 2018
*nat
:PREROUTING ACCEPT [65:5998]
:INPUT ACCEPT [1:60]
:OUTPUT ACCEPT [28:1757]
:POSTROUTING ACCEPT [61:5004]
:DOCKER - [0:0]
:KUBE-MARK-DROP - [0:0]

answered Oct 27 '18 at 00:24

Rico

58,485
12
111
141

thanks, but your explanation still doesn't match what I'm seeing in my cluster setup. We're using EKS with CNI (not Calico), and the iptables rules I posted seem to be masquerading all outbound packets, meaning requests from node A to node B are SNATed, even though Pod A can call Pod B's IP directly without needing NAT... – Arian Motamedi Oct 29 '18 at 14:30
What does your `ip route` look like from one of your nodes? Also, what's the podCidr? – Rico Oct 29 '18 at 15:07
`default via 11.212.103.1 dev eth0` `11.212.103.0/25 dev eth0 proto kernel scope link src 11.212.103.74` `11.212.103.18 dev eni4a2a03f68cd scope link` `11.212.103.20 dev eni0503b781c46 scope link` `11.212.103.21 dev eni8f2af04efb8 scope link` `11.212.103.23 dev eni8dd2db3a03a scope link` `+ a few more` `169.254.169.254 dev eth0` `172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown` – Arian Motamedi Oct 29 '18 at 16:23
formatting is messed up, I can message it to you if you'd like. Also how do I determine the podCidr? – Arian Motamedi Oct 29 '18 at 16:24
With this: `$ kubectl -n kube-system describe configmap kube-proxy | grep clusterCIDR` – Rico Oct 29 '18 at 16:29
Odd... it may be that AWS CNI is doing a funky iptables rules. You can also check any of the files under `/etc/cni/net.d` on any of your nodes – Rico Oct 29 '18 at 17:08
this is running in aws, is the cluster cidr the same as vpc/subnet cidr? – Arian Motamedi Oct 29 '18 at 17:08
It's not supposed to. – Rico Oct 29 '18 at 17:12
hmm ok. the only file under `/etc/cni/net.d` doesn't seem to have any info on this either: # cat /etc/cni/net.d/aws.conf { "type": "aws-cni", "name": "aws-cni", "vethPrefix": "eni" } – Arian Motamedi Oct 29 '18 at 19:28
yeah, it seem like AWS CNI's uses veth that are actual ENIs in the AWS cloud – Rico Oct 29 '18 at 19:45
So then is it safe to say/assume it's the AWS CNI that's adding those explicit SNAT rules for outbound packets and not k8s? – Arian Motamedi Oct 29 '18 at 21:48

Source IP address translation for intra-cluster traffic

1 Answers1