Kubernetes networking deep dive: Did you make the right choice?

Kubernetes networking design can be intimidating, especially when you are the one to make decisions for cluster-level network choices. In this session, we will discuss how these choices will affect cluster routing and load balancing, focusing on KubeProxy modes(iptables vs IPVS) and network solutions.

Published in

ITNEXT

5 min readJan 14, 2019

Updates:

This is my first Kubernetes blog on Medium, I am so happy to see it getting recognized by Kubernetes official account, so encouraging! A shout-out to Yves Sinkgraven and Kiarash Irandoust for inviting me to this great community ITNEXT.

The main purpose of this blog is to help Kubernetes users to get comfortable with K8S major network components, common usage patterns, and corresponding troubleshooting tools. This will provide a good foundation for you to design your next cluster or to analyze your existing cluster network issues and make suggestions for improvements.

First question, KubeProxy is a critical and required component in all K8S clusters, which mode is the right one for you? iptable or IPVS?

Next, how to choose the best L2/L3 network solution? KubeRouter, Calico, Flannel or others?

After deploying the cluster and have the network up and running. What tools can I use to verify expected behavior for routing and load balancing?

To answer all the above questions, we provided 3 network combinations as below to try to cover the most common scenarios.

Network Plugins and KubeProxy modes Combinations

Cluster A: Calico(ipip cross-subnet) + KubeProxy(IPVS mode)
Cluster B: Calico(ipip always) + KubeProxy(iptables mode)
Cluster C: Kube-router + KubeProxy(iptables mode)

Cluster A: Calico(ipip cross-subnet) + KubeProxy(IPVS mode)

In this cluster, Calico is chosen as the network plugin, IPVS mode is enabled for KubeProxy.

IP-in-IP encapsulation is using a mode of cross-subnet, meaning it is only for traffic crossing subnet boundaries. This provides better performance in AWS multi-AZ deployments.

Let’s take a look at the worker node Routing table

Calico cross-subnet mode routing table

tunel0 is used for cross-subnet communication between nodes
VM eth0 is used for intra-subnet node communications.
Notice there is a 10.0.7.0/24 blackhole, this subnet is used by the local pods on the worker node, communicating using the cali- interfaces

For all the 3 interface types in Routing table, we can also find the corresponding matches in the ip addrresult below

Worker node Network interfaces
Command: ip addr

Calico cross-subnet mode network interface

From the result above, we can see IPVS proxier creates a dummy interface kube-ipvs0, and bind service IP addresses to this interface.

Notice service IPs under kube-ipvs0 interface will have a corresponding matching record in the ipvs load balancing results, as shown in the ipvsadm output below.

IPVS-Based Load Balancing
IPVS (IP Virtual Server) is built on top of the Netfilter and implements transport-layer load balancing as part of the Linux kernel.
Kube-proxy can configure IPVS to handle the translation of virtual Service IPs to pod IPs.
From the snippet below, we can find matching service cluster IPs load balancing on top of pods IPs.

ipvsadm -ln

10.1.0.1:443 refers to the Kube-Controller-Manager service, the pod ips are master node ips
10.1.0.10:53 refers to the CoreDNS service, the pods ips are pointing to the 2 coredns pods

IPVS supports a lot more load balancing algorithms than iptables mode(round-robin only), these scheduling algorithms are implemented as kernel modules. Ten are shipped with the Linux Virtual Server.

Cluster B: Calico(ipip always) + KubeProxy(iptables mode)

In this cluster, IP-in-IP mode set to Always, Calico will route using IP-in-IP for all traffic originating from a Calico enabled node to all Calico networked containers and nodes.

Notice in the routing table below

No VM eth0 is used for calico network.
Only tunl0 is used to inter-node traffic.
For the pods on the VM, cali- interfaces are being used.

Calico ipip always mode routing table

Network interfaces
Interfaces setttings are matching the routing table, only eth0, tunl0 and cali- interfaces are used. No kube-ipvs0 is involved since Kube-proxy is using iptables mode

How about the K8S service and pod load balancing?
Let’s take a look at the iptables output for an K8S ingress controller.

KUBE-SVC-3C2I2DNJ4VWZY5PF refers to an ingress controller service with cluster IP 10.1.60.159 (Scroll to the right in the gist to see more)

SVC IP is loading balancing on top of 3 pods in round-robin fashion
The first record in the iptables with tcp dpt:30998 represents the node port being used by the AWS ELB which points to ingress controller service.

KUBE-SEP-XXXXXXXXXX refers to ingress controller pods, which belongs to a replica set of size 3

KUBE-SEP-P6JNEFEXMECE2WS6 with a pod IP 10.0.11.27
KUBE-SEP-DX25GZBAXCASAQMI with a pod IP 10.0.35.21
KUBE-SEP-HI43CU4ZL6YUQHDB with a pod IP 10.0.11.26

iptables for k8s svc and pod

Cluster C: Kube-router + KubeProxy(iptables mode)

Similar to Calico cross-subnet mode, kube-router uses eth0 for intra-subnet traffic and tunneling for inter-subnet traffic between nodes.

For the pods on the node, kube-bridge is being used for container traffic before they hit eth0 or tun- interfaces.

Network interfaces

As seen from the network interface output, there are two interesting types

Kube-bridge between VM eth0 and pod veth0
veth pairs are created between each pod and VM

More useful tools

crictl: CLI for kubelet CRI.
For k8s network troubleshooting on the worker node, crictl is more k8s-friendly than docker.
crictl psdoes not show the un-related pause containers or the extremely long container names, comparing to docker ps output. Besides, it will output the pod ID where the container belongs to.

crictl ps

When debugging in Cluster B, to confirm the cali- interfaces and 10.0.104.0/24 blackhole are used by local pods, it is very convenient to use the crictl commands to get the local pod ips.

netshoot: Kubernetes network trouble-shooting swiss-army container
In several situations, installing missing tools is not an option when you try to understand what is happening in your network infrastructure.

When following Immutable Infrastructure, no tools can be installed on the VM.
When you do not have enough permissions to install tools
The VM disk is not writable

The netshootcontainer has a set of powerful networking debugging tools that can be used to troubleshoot Docker and Kubernetes networking issues

apache2-utils 
bash
bird 
bridge-utils 
busybox-extras 
calicoctl
conntrack-tools 
curl 
dhcping 
drill 
ethtool
file 
fping 
iftop 
iperf 
iproute2 
iptables 
iptraf-ng 
iputils 
ipvsadm
libc6-compat 
liboping 
mtr 
net-snmp-tools 
netcat-openbsd 
ngrep 
nmap 
nmap-nping 
nmap-nping 
py-crypto 
py2-virtualenv 
python2 
scapy 
socat 
strace 
tcpdump 
tcptraceroute 
util-linux 
vim

In the example below, it is using Privileged mode and Host’s network namespace, this will give you almost all the access needed from inside the container.

docker run -it --privileged --net host nicolaka/netshoot

That’s it, This should provide a good foundation to explore other network solutions and troubleshooting tools. Hope it is helpful!