Kubernetes Networking: Behind the scenes

Jul 17, 2018


One of the things I love the most about Kelsey Hightower’s Kubernetes The Hard Way guide— other than it just works (even on AWS!)—is that it keeps networking clean and simple; a perfect opportunity to understand what the role of the Container Network Interface (CNI) is for example. Having said that, Kubernetes networking is not really very intuitive, especially for newcomers… and do not forget “there is no such thing as container networking

While there are very good resources around this topic (links here), I couldn’t find a single example that connects all of the dots with commands outputs that network engineers love and hate, showing what is actually happening behind the scenes. So, I decided to curate this information from a number of different sources to hopefully help you better understand how things are tied together. This is not only important for verification purposes, but also to ease troubleshooting. You can follow along with this example in your own Kubernetes The Hard Way cluster, as all of the IP addressing and settings are taken from it (May 2018 commits, before Nabla Containers).

Let’s start from the end; we have three controller and three worker nodes.

You might notice there are also at least three different private network subnets!. Bear with me, we will explore them all. Keep in mind that while we refer to very specific IP prefixes, these are just the ones chosen for the Kubernetes The Hard Way guide, so they have local significance and you can chose any other RFC 1918 address block for your environment. I will post a separate blog post for IPv6.

Node network (

This is the internal network all your nodes are part of, specified with the flag — private-network-ip in GCP or option — private-ip-address in AWS when provisioning the compute resources.

Provisioning controller nodes in GCP

Provisioning controller nodes in AWS

Each of your instances will then have two IP addresses; a private one from the node network (controllers:${i}/24, workers:${i}/24) and a public IP address assigned by your Cloud provider, which we will discuss later on when we get to NodePorts.


$ gcloud compute instances listNAME          ZONE        MACHINE_TYPE   PREEMPTIBLE  INTERNAL_IP  EXTERNAL_IP     STATUS
controller-0 us-west1-c n1-standard-1 35.231.XXX.XXX RUNNING
worker-1 us-west1-c n1-standard-1 35.231.XX.XXX RUNNING


$ aws ec2 describe-instances --query 'Reservations[].Instances[].[Tags[?Key==`Name`].Value[],PrivateIpAddress,PublicIpAddress]' --output text | sed '$!N;s/\n/ /' 34.228.XX.XXX controller-0 34.173.XXX.XX worker-1

All nodes should be able to ping each other if the security policies are correct (…and if ping is actually installed in the host).

Pod network (

This is the network where pods live. Each worker node runs a subnet of this network. In our setup POD_CIDR=10.200.${i}.0/24 for worker-${i}.

To understand how this is setup, we need to take a step back and review the Kubernetes networking model, which requires that:

  • All containers can communicate with all other containers without NAT
  • All nodes can communicate with all containers (and vice-versa) without NAT
  • The IP that a container sees itself as is the same IP that others see it as

Considering there can be multiple ways to meet these, Kubernetes will typically handoff the network setup to a CNI plugin.

A CNI plugin is responsible for inserting a network interface into the container network namespace (e.g. one end of a veth pair) and making any necessary changes on the host (e.g. attaching the other end of the veth into a bridge). It should then assign the IP to the interface and setup the routes consistent with the IP Address Management section by invoking appropriate IPAM plugin. [CNI Plugin Overview]

Network namespace

A namespace wraps a global system resource in an abstraction that makes it appear to the processes within the namespace that they have their own isolated instance of the global resource. Changes to the global resource are visible to other processes that are members of the namespace, but are invisible to other processes. [Namespaces man page]

Linux provides seven different namespaces (Cgroup, IPC, Network, Mount, PID, User and UTS). Network namespaces (CLONE_NEWNET) determine the network resources that are available to a process, “each network namespace has its own network devices, IP addresses, IP routing tables, /proc/net directory, port numbers, and so on”. [Namespaces in operation]

Virtual Ethernet (Veth) devices

A virtual network (veth) device pair provides a pipe-like abstraction that can be used to create tunnels between network namespaces, and can be used to create a bridge to a physical network device in another namespace. When a namespace is freed, the veth devices that it contains are destroyed. [Network namespace man page]

Let’s bring this down to earth and see how all this is applied to our cluster. First of all, Network plugins in Kubernetes come in a few flavors; CNI plugins being one of them (why not CNM?). The Kubelet in each node will tell the container runtime what Network plugin to use. The Container Network Interface (CNI) sits in the middle between the container runtime and the network implementation. Only the CNI-plugin configures the network.

The CNI plugin is selected by passing Kubelet the — network-plugin=cni command-line option. Kubelet reads a file from — cni-conf-dir (default /etc/cni/net.d) and uses the CNI configuration from that file to set up each pod’s network. [Network Plugin Requirements]

The actual CNI plugin binaries are located in — cni-bin-dir (default /opt/cni/bin)

Notice our kubelet.service execution parameters include network-plugin=cni.

ExecStart=/usr/local/bin/kubelet \\
--config=/var/lib/kubelet/kubelet-config.yaml \\
--network-plugin=cni \\

Kubernetes first creates the network namespace for the pod before invoking any plugins. This is done by creating a pause container that “serves as the “parent container” for all of the containers in your pod” [The Almighty Pause Container]. Kubernetes then invokes the CNI-plugin to join the pause container to a network. All containers in the pod use the pause network namespace (netns).

"cniVersion": "0.3.1",
"name": "bridge",
"type": "bridge",
"bridge": "cnio0",
"isGateway": true,
"ipMasq": true,
"ipam": {
"type": "host-local",
"ranges": [
[{"subnet": "${POD_CIDR}"}]
"routes": [{"dst": ""}]

Our CNI config indicates we use the bridge plugin to configure a L2 Linux software bridge in the root namespace with name cnio0 (the default name is cni0) that acts as a gateway (“isGateway”: true).

It will also setup a veth pair to attach the pod to the bridge just created.

To allocate L3 info such as IP addressees, an IPAM-plugin (ipam) is called. The type is host-local in this case, “which stores the state locally on the host filesystem, therefore ensuring uniqueness of IP addresses on a single host” [host-local plugin]. The IPAM-plugin returns this info to the previous plugin (bridge), so any routes provided in the config can be configured(“routes”: [{“dst”: “”}]). If no gw is provided, it will be derived from a subnet. A default route is also configured in the pod network namespace pointing to the bridge (which is configured with first IP of the pod subnet).

Last, but not least, we also requested to masquerade (“ipMasq”: true) traffic originating from the pod network . We don’t really need NAT here, but that’s the config in Kubernetes The Hard Way. So, for the sake of completeness, I should mention the entries in iptables the bridge plugin configured for this this particular example; All packets from the pod which destination isn’t in the range will be NAT’ed, which is somehow not aligned with “all containers can communicate with all other containers without NAT”. Well, we will prove you don’t need NAT in short.

Pod routing

We are now ready to configure pods. We are going to take a look at all the Network namespaces in one of the worker nodes and analyze one of them after creating a nginx deployment as described in here. We will use lsns with the option -t to select the type of namespace (net).

ubuntu@worker-0:~$ sudo lsns -t net
4026532089 net 113 1 root /sbin/init
4026532280 net 2 8046 root /pause
4026532352 net 4 16455 root /pause
4026532426 net 3 27255 root /pause

We can find out the inode number of these with the option -i in ls.

ubuntu@worker-0:~$ ls -1i /var/run/netns
4026532352 cni-1d85bb0c-7c61-fd9f-2adc-f6e98f7a58af
4026532280 cni-7cec0838-f50c-416a-3b45-628a4237c55c
4026532426 cni-912bcc63-712d-1c84-89a7-9e10510808a0

Optionally, you can also list all the Network namespaces with ip netns.

ubuntu@worker-0:~$ ip netns
cni-912bcc63-712d-1c84-89a7-9e10510808a0 (id: 2)
cni-1d85bb0c-7c61-fd9f-2adc-f6e98f7a58af (id: 1)
cni-7cec0838-f50c-416a-3b45-628a4237c55c (id: 0)

In order to see all the processes running in the network namespace cni-912bcc63–712d-1c84–89a7–9e10510808a0 (4026532426), you could do something like:

ubuntu@worker-0:~$ sudo ls -l /proc/[1-9]*/ns/net | grep 4026532426  | cut -f3 -d"/" | xargs ps -p
27255 ? Ss 0:00 /pause
27331 ? Ss 0:00 nginx: master process nginx -g daemon off;
27355 ? S 0:00 nginx: worker process

This indicates we are running nginx in a pod along with pause. The pause container and rest of containers in the pod share the net and ipc namespace. Let’s keep the pause PID 27255 handy.

Let’s now see what kubectl can tell us about this pod:

$ kubectl get pods -o wide | grep nginx
nginx-65899c769f-wxdx6 1/1 Running 0 5d worker-0

Some more details:

$ kubectl describe pods nginx-65899c769f-wxdx6
Name: nginx-65899c769f-wxdx6
Namespace: default
Node: worker-0/
Start Time: Thu, 05 Jul 2018 14:20:06 -0400
Labels: pod-template-hash=2145573259
Annotations: <none>
Status: Running
Controlled By: ReplicaSet/nginx-65899c769f
Container ID: containerd://4c0bd2e2e5c0b17c637af83376879c38f2fb11852921b12413c54ba49d6983c7
Image: nginx

We have the pod name nginx-65899c769f-wxdx6 and the ID of one of the containers in it (ngnix), nothing about pause yet. It’s time to dig deeper on the worker node to connect all the dots. Keep in mind Kubernetes The Hard Way doesn’t use Docker, so we will use the Containerd CLI ctr to explore the container details.

ubuntu@worker-0:~$ sudo ctr namespaces ls

With the Containerd namespace (k8s.io), we can get the container ID’s for ngnix:

ubuntu@worker-0:~$ sudo ctr -n k8s.io containers ls | grep nginx
4c0bd2e2e5c0b17c637af83376879c38f2fb11852921b12413c54ba49d6983c7 docker.io/library/nginx:latest io.containerd.runtime.v1.linux

And pause:

ubuntu@worker-0:~$ sudo ctr -n k8s.io containers ls | grep pause
0866803b612f2f55e7b6b83836bde09bd6530246239b7bde1e49c04c7038e43a k8s.gcr.io/pause:3.1 io.containerd.runtime.v1.linux
21640aea0210b320fd637c22ff93b7e21473178de0073b05de83f3b116fc8834 k8s.gcr.io/pause:3.1 io.containerd.runtime.v1.linux
d19b1b1c92f7cc90764d4f385e8935d121bca66ba8982bae65baff1bc2841da6 k8s.gcr.io/pause:3.1 io.containerd.runtime.v1.linux

The container ID for ngnix ending in 983c7 matches what we got with kubectl. Let’s see if we can find out which pause container belongs to the nginx pod.

ubuntu@worker-0:~$ sudo ctr -n k8s.io task ls
d19b1b1c92f7cc90764d4f385e8935d121bca66ba8982bae65baff1bc2841da6 27255 RUNNING
4c0bd2e2e5c0b17c637af83376879c38f2fb11852921b12413c54ba49d6983c7 27331 RUNNING

Do you remember the PID’s 27331 and 27355 running in the network namespace cni-912bcc63–712d-1c84–89a7–9e10510808a0?

ubuntu@worker-0:~$ sudo ctr -n k8s.io containers info d19b1b1c92f7cc90764d4f385e8935d121bca66ba8982bae65baff1bc2841da6
"ID": "d19b1b1c92f7cc90764d4f385e8935d121bca66ba8982bae65baff1bc2841da6",
"Labels": {
"io.cri-containerd.kind": "sandbox",
"io.kubernetes.pod.name": "nginx-65899c769f-wxdx6",
"io.kubernetes.pod.namespace": "default",
"io.kubernetes.pod.uid": "0b35e956-8080-11e8-8aa9-0a12b8818382",
"pod-template-hash": "2145573259",
"run": "nginx"
"Image": "k8s.gcr.io/pause:3.1",


ubuntu@worker-0:~$ sudo ctr -n k8s.io containers info 4c0bd2e2e5c0b17c637af83376879c38f2fb11852921b12413c54ba49d6983c7
"ID": "4c0bd2e2e5c0b17c637af83376879c38f2fb11852921b12413c54ba49d6983c7",
"Labels": {
"io.cri-containerd.kind": "container",
"io.kubernetes.container.name": "nginx",
"io.kubernetes.pod.name": "nginx-65899c769f-wxdx6",
"io.kubernetes.pod.namespace": "default",
"io.kubernetes.pod.uid": "0b35e956-8080-11e8-8aa9-0a12b8818382"
"Image": "docker.io/library/nginx:latest",

We now know exactly which containers are running in this pod (nginx-65899c769f-wxdx6) and network namespace (cni-912bcc63–712d-1c84–89a7–9e10510808a0):

  • nginx (ID: 4c0bd2e2e5c0b17c637af83376879c38f2fb11852921b12413c54ba49d6983c7)
  • pause (ID: d19b1b1c92f7cc90764d4f385e8935d121bca66ba8982bae65baff1bc2841da6)

So, how is this pod (nginx-65899c769f-wxdx6) actually connected to the network?. Let’s take the pause PID 27255 we got before to run commands in its network namespace (cni-912bcc63–712d-1c84–89a7–9e10510808a0).

ubuntu@worker-0:~$ sudo ip netns identify 27255

We will use nsenter for this purpose with option -t to specify the target pid and also provide -n without a file in order to enter the network namespace of the target process (27255). Let’s see what ip link show,

ubuntu@worker-0:~$ sudo nsenter -t 27255 -n ip link show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
3: eth0@if7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
link/ether 0a:58:0a:c8:00:04 brd ff:ff:ff:ff:ff:ff link-netnsid 0

and ifconfig eth0 say:

ubuntu@worker-0:~$ sudo nsenter -t 27255 -n ifconfig eth0
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet netmask broadcast
inet6 fe80::2097:51ff:fe39:ec21 prefixlen 64 scopeid 0x20<link>
ether 0a:58:0a:c8:00:04 txqueuelen 0 (Ethernet)
RX packets 540 bytes 42247 (42.2 KB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 177 bytes 16530 (16.5 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

We confirm the IP address we got before from kubectl get pod is configured on pod’s eth0 interface. This interface is part of a veth pair; one end in the pod and the other in the root namespace. To find out what interface is on the other end, we use ethtool.

ubuntu@worker-0:~$ sudo ip netns exec cni-912bcc63-712d-1c84-89a7-9e10510808a0 ethtool -S eth0
NIC statistics:
peer_ifindex: 7

This tells us the peer ifindex is 7. We can now check what that is in the root namespace. We can do this with ip link:

ubuntu@worker-0:~$ ip link | grep '^7:'
7: veth71f7d238@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master cnio0 state UP mode DEFAULT group default

to double-check, see:

ubuntu@worker-0:~$ sudo cat /sys/class/net/veth71f7d238/ifindex

Cool, the virtual link is clear now. We can see what else is connected to our Linux bridge with brctl:

ubuntu@worker-0:~$ brctl show cnio0
bridge name bridge id STP enabled interfaces
cnio0 8000.0a580ac80001 no veth71f7d238

So we have this:

Validate routing

How do we actually forward traffic?. Let’s look at the routing table in the pod’s network namespace:

ubuntu@worker-0:~$ sudo ip netns exec cni-912bcc63-712d-1c84-89a7-9e10510808a0 ip route show
default via dev eth0 dev eth0 proto kernel scope link src

So, we know how to get to the root namespace at least (default via Let’s check the host’s route table now:

ubuntu@worker-0:~$ ip route list
default via dev eth0 proto dhcp src metric 100 dev cnio0 proto kernel scope link src dev eth0 proto kernel scope link src dev eth0 proto dhcp scope link src metric 100

We know how to forward packets to the VPC Router (Your VPC has an implicit router, which normally has the the second address in the primary IP range for the subnet ). Now, does the VPC router know how to reach each pod network?; No it doesn’t, so you’d expect the CNI-plugin installs routes there or you just do it manually (as in the guide). Haven’t checked yet, but the AWS CNI-plugin probably handles this for us in AWS. Keep in mind there are tons of CNI-plugins out there, this example represents the simplest network setup.

NAT Deep dive

Let’s create two identical busybox containers with a Replication Controller using kubectl create -f busybox.yaml.

We get:

$ kubectl get pods -o wide
busybox0-g6pww 1/1 Running 0 4s worker-1
busybox0-rw89s 1/1 Running 0 4s worker-0

Pings from one container to another should be successful:

$ kubectl exec -it busybox0-rw89s -- ping -c 2
PING ( 56 data bytes
64 bytes from seq=0 ttl=62 time=0.528 ms
64 bytes from seq=1 ttl=62 time=0.440 ms
--- ping statistics ---
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 0.440/0.484/0.528 ms

To understand the traffic flow you can either capture packets with tcpdump or use conntrack.

ubuntu@worker-0:~$ sudo conntrack -L | grep
icmp 1 29 src= dst= type=8 code=0 id=1280 src= dst= type=0 code=0 id=1280 mark=0 use=1

The pod’s source IP address10.200.0.21 is translated to the node IP

ubuntu@worker-1:~$ sudo conntrack -L | grep
icmp 1 28 src= dst= type=8 code=0 id=1280 src= dst= type=0 code=0 id=1280 mark=0 use=1

You can see the counters increasing in iptables as follows:

ubuntu@worker-0:~$ sudo iptables -t nat -Z POSTROUTING -L -v
Chain POSTROUTING (policy ACCEPT 0 packets, 0 bytes)
pkts bytes target prot opt in out source destination
5 324 CNI-be726a77f15ea47ff32947a3 all -- any any anywhere /* name: "bridge" id: "631cab5de5565cc432a3beca0e2aece0cef9285482b11f3eb0b46c134e457854" */
Zeroing chain `POSTROUTING'

On the other hand if we removed “ipMasq”: true from the CNI-plugin config, we would see the following (we don’t recommend changing this config on a running cluster, this is only for educational purposes):

$ kubectl get pods -o wide
busybox0-2btxn 1/1 Running 0 16s worker-0
busybox0-dhpx8 1/1 Running 0 16s worker-1

Ping should still work:

$  kubectl exec -it busybox0-2btxn -- ping -c 2
PING ( 56 data bytes
64 bytes from seq=0 ttl=62 time=0.515 ms
64 bytes from seq=1 ttl=62 time=0.427 ms
--- ping statistics ---
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 0.427/0.471/0.515 ms

Without NAT in this case:

ubuntu@worker-0:~$ sudo conntrack -L | grep
icmp 1 29 src= dst= type=8 code=0 id=1792 src= dst= type=0 code=0 id=1792 mark=0 use=1

So, we just verified “all containers can communicate with all other containers without NAT”.

ubuntu@worker-1:~$ sudo conntrack -L | grep
icmp 1 27 src= dst= type=8 code=0 id=1792 src= dst= type=0 code=0 id=1792 mark=0 use=1

Cluster network (

You probably noticed in the busybox example the IP addresses allocated for a busybox pod were different in each case. What if we wanted to make these containers available, so the other pods could reach?. You could take their current pod IP addresses, but these will eventually change. For this reason, you want to configure a Service resource that will proxy requests to a set of ephemeral pods.

“A Service in Kubernetes is an abstraction which defines a logical set of Pods and a policy by which to access them” [Kubernetes Services]

There are different ways to expose a service; the default type is ClusterIP, which will setup an IP address out of the cluster CIDR (only reachable from within the cluster). One example is the DNS Cluster Add-on configured in Kubernetes The Hard Way.

kubectl reveals the Service keeps track of the endpoints, and it will do the translation for you.

$ kubectl -n kube-system describe services
Selector: k8s-app=kube-dns
Type: ClusterIP
Port: dns 53/UDP
TargetPort: 53/UDP
Port: dns-tcp 53/TCP
TargetPort: 53/TCP

How exactly?… iptables again. Let’s go through the rules that were created for this example. You can list them all with the iptables-save command.

As packets are produced by a process (OUTPUT) or just arrived on the network interface (PREROUTING), they are inspected by the following iptables chains:

-A PREROUTING -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A OUTPUT -m comment --comment "kubernetes service portals" -j KUBE-SERVICES

The following targets match TCP packets destined to port 53 and translate the destination address to port 53.

-A KUBE-SERVICES -d -p tcp -m comment --comment "kube-system/kube-dns:dns-tcp cluster IP" -m tcp --dport 53 -j KUBE-SVC-ERIFXISQEP7F7OF4
-A KUBE-SVC-ERIFXISQEP7F7OF4 -m comment --comment "kube-system/kube-dns:dns-tcp" -j KUBE-SEP-32LPCMGYG6ODGN3H
-A KUBE-SEP-32LPCMGYG6ODGN3H -p tcp -m comment --comment "kube-system/kube-dns:dns-tcp" -m tcp -j DNAT --to-destination

The following targets match UDP packets destined to port 53 and translate the destination address to port 53.

-A KUBE-SERVICES -d -p udp -m comment --comment "kube-system/kube-dns:dns cluster IP" -m udp --dport 53 -j KUBE-SVC-TCOU7JCQXEZGVUNU
-A KUBE-SVC-TCOU7JCQXEZGVUNU -m comment --comment "kube-system/kube-dns:dns" -j KUBE-SEP-LRUTK6XRXU43VLIG
-A KUBE-SEP-LRUTK6XRXU43VLIG -p udp -m comment --comment "kube-system/kube-dns:dns" -m udp -j DNAT --to-destination

There are other types of Services in Kubernetes; NodePort in particular is also covered in Kubernetes The Hard Way. See Smoke Test: Services.

kubectl expose deployment nginx --port 80 --type NodePort

NodePort exposes the service on each Node’s IP at a static port (the NodePort). You can access the NodePort service from outside the cluster. You can check the port allocated with kubectl (31088 in this example).

$ kubectl describe services nginx
Type: NodePort
Port: <unset> 80/TCP
TargetPort: 80/TCP
NodePort: <unset> 31088/TCP

The pod is now reachable from the Internet at http://${EXTERNAL_IP}:31088/. Where EXTERNAL_IP is the public IP address of any of your worker instances. I used the worker-0’s public IP address in this example. The request is received in the node with private IP (the Cloud provider handles the public facing NAT), however the service is actually running in another node (worker-1, you can tell by the endpoint’s IP address10.200.1.18)

ubuntu@worker-0:~$ sudo conntrack -L | grep 31088
tcp 6 86397 ESTABLISHED src=173.38.XXX.XXX dst= sport=30303 dport=31088 src= dst= sport=80 dport=30303 [ASSURED] mark=0 use=1

So the packet is forwarded to worker-1 from worker-0 where it reaches destination.

ubuntu@worker-1:~$ sudo conntrack -L | grep 80
tcp 6 86392 ESTABLISHED src= dst= sport=14802 dport=80 src= dst= sport=80 dport=14802 [ASSURED] mark=0 use=1

Is it ideal?. Probably not, but it works. The iptables rules programed in this case are:

-A KUBE-NODEPORTS -p tcp -m comment --comment "default/nginx:" -m tcp --dport 31088 -j KUBE-SVC-4N57TFCL4MD7ZTDA
-A KUBE-SVC-4N57TFCL4MD7ZTDA -m comment --comment "default/nginx:" -j KUBE-SEP-UGTFMET44DQG7H7H
-A KUBE-SEP-UGTFMET44DQG7H7H -p tcp -m comment --comment "default/nginx:" -m tcp -j DNAT --to-destination

In other words, the destination address of packets with destination port 31088 is translated to The port is also translated from 31088 to 80.

We didn’t cover the Service type LoadBalancer that exposes the service externally using a cloud provider’s load balancer, as this post is long enough already.


While this might seem like a lot, we are only scratching the surface. I’m planning to cover IPv6, IPVS, eBPF and a couple of interesting actual CNI-plugins next.

I hope this has been informative. Please let me know if you think I got something wrong or any typo.

Further reading:



