Deciphering the Kubernetes Networking Maze: Navigating Load-Balance, BGP, IPVS and Beyond

Hossein Yousefi
ITNEXT
Published in
7 min readMar 8, 2024

--

In the Kubernetes world, every day you hear ipvs vs iptables || pureLB vs metalLB || overlay vs underlay || Nodeport vs Loadbalance and a lot more, and it’s HARD to put all those together from different sources. This is what I did here.

Do you know the answer to these questions?
How are all networking aspects managed?
How pureLB is connected to CNI?
How ClusterIP service is connected to IPVS?
Why can’t I see the open port using netstat when I used Nodeport?

If yes I suggest to continue, else again continue, you will be surprised.

BIG picture of everything

In my mind, it’s hard to put 20 websites and articles together and understand Kubernetes networking, but I did that, and I want to make it easier for you.

I’m going to talk about the connections of these subjects to Kubernetes and see how those are integrated.
Load-balance, ipvs, iptables, BGP, bridge, CNI, PureLB, endpoint, svc, overlay, underlay, ipip, kube-proxy, ingress controller.

Let’s do it fast and step by step:

1- The relation of CNI, LB controller and Kube-proxy

CNI: It configures the Kubernetes networking by creating and configuring a network interface for each container. The Kubelet calls CNI to set up network interfaces and assign IP addresses to them.

CNI works under 2 models:
* Encapsulated (overlay)
* Unencapsulated (underlay)

  • Encapsulated (overlay): Represents technologies like VXLAN and IPSEC. It’s Layer-2 over Layer-3 networking that can span multiple Kubernetes nodes. Layer 2 is isolated, so there isn’t any need for routing distribution. It generates an IP header that provides the IP package. This model provides a bridge that connects workers and pods. The element that manages the communications is CRI.
  • Unencapsulated (underlay): This provides an L3 network to route packets between containers. Workers are required to manage route distribution using BGP to distribute routing information of pods. This model involves extending a network router between workers.

It is possible that a CNI uses both architecture.

LB-controller: MetalLB, PureLB,… provide the functionality of the LoadBalancer service type in Kubernetes.
When creating a Load Balancer (LB) service, the assigned external IP will be created as a secondary address under the primary interface. This allows the BGP BIRD router to capture the IP and add routes, addresses, and other configurations.

When a new IP is assigned:

In the case of an overlay network, IPVS or iptables will be adopted.

In the case of an underlay network, a routing table will be adopted additionally.

Kube-proxy: Maintains network rules in iptables, ipvs,…
It adds network policies, NAT, forward rules and more.

To give you a simple example:
When you create an svc, kube-proxy adds rules to iptables.

Good to know that netfilter can be replaced by EBPF
and
IPtables can be replaced by IPVS

Summary for this part:

  • Kube-proxy: maintaines rules in IPTABLES, IPVS and more.
  • CNI: Provides a common interface to the underlying networks, routes traffic to desired destinations, and performs other related functions.
  • LB-controller: Provides load balancing functionality and updates the host interface to add secondary IP addresses.

2-POD to POD / Container to Container — single node (IP addr based)

The implementation of Custom Bridge (CBR), Veth (virtual Ethernet), Ethernet (eth), and the entire networking setup is handled by container runtimes such as containerd, CRI-O, and Mirantis. Most Container Runtime Interfaces (CRIs) utilize Container Networking Interface (CNI) plugins for this purpose, including options like Calico, Flannel, and Cilium.

All containers inside a pod share the same network since they are inside of the same network namespace.

The “pause” container is responsible for networking and inter-process communication (IPC) in Kubernetes.

A Veth will be created under CBR for each pod, and L-2 routing will be performed in this bridge. For instance, packets from pod1 to pod2 go through CBR, and NAT doesn’t occur in this case.

3- POD to POD / Container to Container — multi node (IP addr based)

How is a pod’s IP address routable across the nodes’ network?
- Both nodes are on the same network. (can see each other)
- CNI creates a route for each pod on each node.

Since CBR in node-1 doesn’t have the Mac addr of pod4, packets go through the interface that routing table is specified. It can be a tunnel, another interface, eth0 or …. it really depends on the structure.

Each node in Kubernetes has its own CIDR, which allows us to route traffic to the correct node.

4- POD to POD / Container to Container — multi node (Service IP addr based)

When it comes to services, IPTABLES/IPVS plays a vital role. In Netfilter, the service IP address will change to a related pod IP address randomly (based on the LB algorithm). Kube-proxy is responsible for updating Netfilter rules and assigning pod IP addresses to services.

When a node recieve a packet with a svc destination, in Netfilter, rules matches with the service and routed to the destination pod ip addr.

BUT HOW?!

Services update endpoint slices with pod IP addresses by matching the pods’ labels specified in the service selector. When the selector matches the labels of pods, the relevant information such as IP addresses, ports, protocols, etc., are fetched and injected into the endpoint slice associated with the service.

Remember: Services, endpoints, Nodeports, LB are nothing more than just rules in IPTABLES/IPVS.

What happens when we call a service by its name? untill now we only mentioned IP based routing.

Kubernetes implements a DNS server, typically CoreDNS, KubeDNS.
DNS pod is exposed as a kubernetes service with a static ClusterIP that is parsed to each container at start-up.
Now it’s easy, Names will be resolved inside containers by Kubernetes DNS server.

Are you ready for the BIG picture? It’s time.

This is the implementation of CNI (Calico), IPVS, PURELB, IPIP (overlay) and ingress controller all together in one picture with their roles.

Calico: All the networking is done by Calico. overlay (IPIP) and underlay (BGP).

  • As Mentioned before all pods IP addresses will be assigned to CBR which in this case is kube-ipvs bridge interface.
  • Each pod has its own virtual interface.
  • tunl0 (IPIP) is a virtual interface and connects nodes to each other with overlay architecture. It means all pod’s ip addresses pass through this tunnel.
  • PureLB is LB controller and it also manages the host network by implementing kube-lb0 as a virtual interface and also it adds LB ips to host primary interface as a secondary ip address.
  • PureLB is also compatible with Routing protocols with BGP,OSPF,…. Since we already have a BGP BIRD router implemented by Calico, PureLB understand it and won’t implement another BGP router.
  • BGP catch all the ip addresses which are assigned to the interfaces and defines route for them in routing table.

Until now when a packet from pod-1 wants to reach pod-5 in another node, kube-ipvs can not answer what it doesn’t know so, next level is the routing table which updated by BGP. It will be routed to desire destination via tunl0 because we have overlay network architecture. If you call a service ipvs rules come into the game.

Now we know endpoints, services, Nodeports, and LB are the only rules in IPVS. With that in mind:

We have a LoadBalancer type service for our ingress controller, which means the ingress is accessible from outside. When we call this IP, packets are directed to the appropriate node, and then IPVS forwards them to the NodePort (NAT) to route them across nodes and find the correct one.

After that, the NodePort is associated with a ClusterIP, which knows the IP addresses of the ingress controller pods. This setup is beneficial because as soon as the ingress controller receives the packet, based on the defined rules, it routes it to the desired service and then to the destination pod.

The goal of this article was not to provide exhaustive explanations for each component. Instead, it aimed to consolidate information for those already familiar with each concept, offering a comprehensive overview in one place.

I used several sources, which I mention here for further information:

https://medium.com/thermokline/comparing-k8s-load-balancers-2f5c76ea8f31

https://medium.com/@seifeddinerajhi/kube-proxy-and-cni-the-hidden-components-of-kubernetes-networking-eb30000bf87a

https://docs.tigera.io/calico/latest/networking/

https://purelb.gitlab.io/docs/how_it_works/overview/

--

--

As a Platform Engineer who likes sharing knowledge, I believe we shouldn’t experience something that other people have experienced it before.