1

When learning the Kubernetes CNI, I heard some plugins are using the BGP or VXLAN under the hood.

On the internet, border gateway protocol (BGP) manages how packets are routed between edge routers.

Autonomous systems (AS) are network routers managed by a single enterprise or service provider. for example, Facebook and Google.

Autonomous systems (AS) communicate with peers and form a mesh.

enter image description here

But I still can't figure out how does the CNI plugin take advantage of BGP.

Imagine there is a Kubernetes cluster, which is composed of 10 nodes. Calico is the chosen CNI plugin.

  • Who plays the Autonomous System(AS) role? Is each node an AS?

  • How are packets forward from one node to another node? Is the iptable still required?

Ryan Lyu
  • 4,180
  • 5
  • 35
  • 51

1 Answers1

2

The CNI plugin is responsible for allocating IP addresses (IPAM) and ensuring that packets get where they need to get.

For Calico specifically, you can get a lot of information from the architecture page as well as the Calico network design memoirs.

Whenever a new Pod is created, the IPAM plugin allocates an IP address from the global pool and the Kubernetes scheduler assigns the Pod to a Node. The Calico CNI plugin (like any other) configures the networking stack to accept connections to the Pod IP and routes them to the processes inside. This happens with iptables and uses a helper process called Felix.

Each Node also runs a BIRD (BGP) daemon that watches for these configuration events: "IP 10.x.y.z is hosted on node A". These configuration events are turned into BGP updates and sent to other nodes using the open BGP sessions. When the other nodes receive these BGP updates, they program the node route table (with simple ip route commands) to ensure the node knows how to reach the Pod. In this model, yes, every node is an AS.

What I just described is the "AS per compute server" model: it is suitable for small deployments in environments where nodes are not necessarily on the same L2 network. The problem is that each node needs to maintain a BGP session with every other node, which scales as O(N^2).

For larger deployments therefore, a compromise is to run one AS per rack of compute servers ("AS per rack"). Each top of rack switch then runs BGP to communicate routes to other racks, while the switch internally knows how to route packets.

Botje
  • 26,269
  • 3
  • 31
  • 41