A Hands-on Kubernetes Network Troubleshooting Journey

Published in

ITNEXT

14 min readOct 20, 2023

While developing the Kata/remote-hypervisor (aka peer-pods) approach, I encountered an issue where the Kubernetes pod IP was unreachable from the worker node. In this blog, I describe the Kubernetes network troubleshooting journey, believing it’ll be helpful to my readers.

Kata remote hypervisor (peer-pods) approach enables the creation of Kata VMs in any infrastructure environment by utilising the environment’s native infrastructure management APIs, such as AWS or Microsoft Azure APIs, when creating Kata VMs on AWS or Azure respectively. The cloud-api-adaptor sub-project of the CNCF confidential containers project implements the Kata remote hypervisor.

As shown in the diagram below, in the peer-pods approach, the pod (Kata) VM runs external to the Kubernetes (K8s) worker node, and the pod IP is reachable from the worker node using a VXLAN tunnel. Using a tunnel ensures that the pod networking continues to work as-is without any changes to the CNI networking.

When using Kata containers, a Kubernetes pod runs inside a VM, consequently we refer to the VM running the pod as the Kata VM or the pod VM

Problem

The pod IP 10.132.2.46, which is on the pod VM (IP: 192.168.10.201), is not reachable from the worker node VM (IP: 192.168.10.163)

Following are the details of VMs in my environment — worker node VM and pod (Kata) VM. OVN-Kubernetes was the Kubernetes CNI used.

+===========================+================+================+
|          VM Name          |   IP Address   |    Remarks     |
+===========================+================+================+
| ocp-412-ovn-worker-1      | 192.168.10.163 | Worker Node VM |
+---------------------------+----------------+----------------+
| podvm-nginx-priv-8b726648 | 192.168.10.201 | Pod VM         |
+---------------------------+----------------+----------------+

The easiest solution would have been to get help from networking experts to solve the issue. However, in my case, the experts were not available to work directly on the problem due to other pressing issues. Further, the peer-pods network topology was new and involved multiple networking stacks — Kubernetes CNI, Kata networking and VXLAN tunnel, making root cause difficult and time-consuming.

So, taking the situation as an opportunity to improve my Kubernetes networking skills, I started independently with some guidance from the core Linux networking experts.

In subsequent sections, I’ll walk you through my approach to debugging and finding the root cause of the issue. I hope this walkthrough will be of some help when troubleshooting Kubernetes networking issues.

Troubleshooting — Phase1

At a high level, the approach I took consisted of the following two steps:

Understanding the network topology
Identifying problematic pieces from the topology

Let’s ping the IP: 10.132.2.46 from the worker node VM and trace the flow across the networking stack:

[root@ocp-412-worker-1 core]# ping 10.132.2.46

Linux refers to the route table to determine where to send this packet.

[root@ocp-412-worker-1 core]# ip route get 10.132.2.46
10.132.2.46 dev ovn-k8s-mp0 src 10.132.2.2 uid 0

So, the route to the pod IP is via the device ovn-k8s-mp0

Let’s get the worker node network details and retrieve information on the ovn-k8s-mp0 device.

[root@ocp-412-ovn-worker-1 core]# ip r
default via 192.168.10.1 dev br-ex proto dhcp src 192.168.10.163 metric 48
10.132.0.0/14 via 10.132.2.1 dev ovn-k8s-mp0
10.132.2.0/23 dev ovn-k8s-mp0 proto kernel scope link src 10.132.2.2
169.254.169.0/29 dev br-ex proto kernel scope link src 169.254.169.2
169.254.169.1 dev br-ex src 192.168.10.163 mtu 1400
169.254.169.3 via 10.132.2.1 dev ovn-k8s-mp0
172.30.0.0/16 via 169.254.169.4 dev br-ex mtu 1400
192.168.10.0/24 dev br-ex proto kernel scope link src 192.168.10.163 metric 48


[root@ocp-412-ovn-worker-1 core]# ip a

[snip]

2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel master ovs-system state UP group default qlen 1000
    link/ether 52:54:00:f9:70:58 brd ff:ff:ff:ff:ff:ff
3: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 32:7c:7a:20:6e:5a brd ff:ff:ff:ff:ff:ff
4: genev_sys_6081: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65000 qdisc noqueue master ovs-system state UNKNOWN group default qlen 1000
    link/ether 3a:9c:a8:4e:15:0c brd ff:ff:ff:ff:ff:ff
    inet6 fe80::389c:a8ff:fe4e:150c/64 scope link
       valid_lft forever preferred_lft forever
5: br-int: <BROADCAST,MULTICAST> mtu 1400 qdisc noop state DOWN group default qlen 1000
    link/ether d2:b6:67:15:ef:06 brd ff:ff:ff:ff:ff:ff
6: ovn-k8s-mp0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether ee:cb:ed:8e:f9:e0 brd ff:ff:ff:ff:ff:ff
    inet 10.132.2.2/23 brd 10.132.3.255 scope global ovn-k8s-mp0
       valid_lft forever preferred_lft forever
    inet6 fe80::eccb:edff:fe8e:f9e0/64 scope link
       valid_lft forever preferred_lft forever
8: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether 52:54:00:f9:70:58 brd ff:ff:ff:ff:ff:ff
    inet 192.168.10.163/24 brd 192.168.10.255 scope global dynamic noprefixroute br-ex
       valid_lft 2266sec preferred_lft 2266sec
    inet 169.254.169.2/29 brd 169.254.169.7 scope global br-ex
       valid_lft forever preferred_lft forever
    inet6 fe80::17f3:957b:5e8d:a4a6/64 scope link noprefixroute
       valid_lft forever preferred_lft forever

[snip]

As shown in the above output, the ovn-k8s-mp0 interface has the IP 10.132.2.2/23

Let’s get the device details for the ovn-k8s-mp0 interface.

As shown in the output below, this interface is an OVS entity.

[root@ocp-412-ovn-worker-1 core]# ip -d li sh dev ovn-k8s-mp0
6: ovn-k8s-mp0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/ether ee:cb:ed:8e:f9:e0 brd ff:ff:ff:ff:ff:ff promiscuity 1 minmtu 68 maxmtu 65535
openvswitch addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535

Is ovn-k8s-mp0 an OVS bridge?

From the command output below, it’s clear that ovn-k8s-mp0 is not an OVS bridge. Only two bridges that exist on the worker node are br-ex , and br-int

[root@ocp-412-ovn-worker-1 core]# ovs-vsctl list-br
br-ex
br-int

So ovn-k8s-mp0 is an OVS port. We’ll need to find out the OVS bridge that owns this port.

From the command output below, it’s clear that ovn-k8s-mp0 is not an OVS port for the bridge br-ex.

[root@ocp-412-ovn-worker-1 core]# ovs-ofctl dump-ports br-ex ovn-k8s-mp0
ovs-ofctl: br-ex: unknown port `ovn-k8s-mp0`

From the command output below, it’s clear that ovn-k8s-mp0 is an OVS port for the bridge br-int.

[root@ocp-412-ovn-worker-1 core]# ovs-ofctl dump-ports br-int ovn-k8s-mp0
OFPST_PORT reply (xid=0x4): 1 ports
port "ovn-k8s-mp0": rx pkts=1798208, bytes=665641420, drop=2, errs=0, frame=0, over=0, crc=0tx pkts=2614471, bytes=1357528110, drop=0, errs=0, coll=0

To summarise, ovn-k8s-mp0 is a port on the br-int OVS bridge. It also holds the IP of the bridge, i.e., 10.132.2.2/23

Now, let’s get the network config details for the pod.

The pod network namespace must be known to determine the pod network details. The command shown below finds the pod network namespace from its IP.

[root@ocp-412-ovn-worker-1 core]# POD_IP=10.132.2.46; for ns in $(ip netns ls | cut -f 1 -d " "); do ip netns exec $ns ip a | grep -q $POD_IP; status=$?; [ $status -eq 0 ] && echo "pod namespace: $ns" ; done

pod namespace: c16c7a01-1bc5-474a-9eb6-15474b5fbf04

Once the pod network namespace is known, one can find the network config details for the pod, as shown below.

[root@ocp-412-ovn-worker-1 core]# NS=c16c7a01–1bc5–474a-9eb6–15474b5fbf04
[root@ocp-412-ovn-worker-1 core]# ip netns exec $NS ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
 inet 127.0.0.1/8 scope host lo
 valid_lft forever preferred_lft forever
 inet6 ::1/128 scope host
 valid_lft forever preferred_lft forever
2: eth0@if4256: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UP group default qlen 1000
 link/ether 0a:58:0a:84:02:2e brd ff:ff:ff:ff:ff:ff link-netns 59e250f6–0491–4ff4-bb22-baa3bca249f6
 inet 10.132.2.46/23 brd 10.132.3.255 scope global eth0
 valid_lft forever preferred_lft forever
 inet6 fe80::858:aff:fe84:22e/64 scope link
 valid_lft forever preferred_lft forever
4257: vxlan1@if4257: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
 link/ether ca:40:81:86:fa:73 brd ff:ff:ff:ff:ff:ff link-netns 59e250f6–0491–4ff4-bb22-baa3bca249f6
 inet6 fe80::c840:81ff:fe86:fa73/64 scope link
 valid_lft forever preferred_lft forever


[root@ocp-412-ovn-worker-1 core]# ip netns exec $NS ip r
default via 10.132.2.1 dev eth0
10.132.2.0/23 dev eth0 proto kernel scope link src 10.132.2.46

So eth0@if4256 is the primary network interface for the pod.

Let’s get the details for the eth0 device.

As can be seen from the output below, the eth0 device in the pod network namespace is a veth device.

[root@ocp-412-ovn-worker-1 core]# ip netns exec $NS ip -d li sh dev eth0
link/ether 0a:58:0a:84:02:2e brd ff:ff:ff:ff:ff:ff link-netns 59e250f6–0491–4ff4-bb22-baa3bca249f6
veth addrgenmode eui64 numtxqueues 8 numrxqueues 8 gso_max_size 65536 gso_max_segs 65535 tso_max_size 524280 tso_max_segs 65535 gro_max_size 65536

It’s known that veth devices work in pairs; one end is in the init (or the root) namespace, and the other is in the (pod) network namespace.

Let’s find the corresponding veth device pair for the pod in the init namespace.

[root@ocp-412-ovn-worker-1 core]# ip a | grep -A1 ^4256
4256: 8b7266486ea2861@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue master ovs-system state UP group default
link/ether de:fb:3e:87:0f:d6 brd ff:ff:ff:ff:ff:ff link-netns c16c7a01–1bc5–474a-9eb6–15474b5fbf04

So, 8b7266486ea2861@if2 is the veth device endpoint for the pod in the init namespace. This veth pair connects the init and the pod network namespace.

Let’s find out the details of the veth device endpoint.

[root@ocp-412-ovn-worker-1 core]# ip -d li sh dev 8b7266486ea2861
4256: 8b7266486ea2861@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue master ovs-system state UP mode DEFAULT group default
link/ether de:fb:3e:87:0f:d6 brd ff:ff:ff:ff:ff:ff link-netns c16c7a01–1bc5–474a-9eb6–15474b5fbf04 promiscuity 1 minmtu 68 maxmtu 65535
veth
openvswitch_slave addrgenmode eui64 numtxqueues 4 numrxqueues 4 gso_max_size 65536 gso_max_segs 65535

So 8b7266486ea2861@if2 is an OVS entity. Dumping the OVS switch details will provide details on which OVS bridge owns this port.

As shown in the output below, the bridge br-int owns the port.

Note that using the ovs-vsctl command is an alternative to the earlier command ovs-ofctl dump-ports <bridge> <port> This is to show that different commands can help explore the network topology.

[root@ocp-412-ovn-worker-1 core]# ovs-vsctl show

[snap]

Bridge br-int
        fail_mode: secure
        datapath_type: system

        [snip]

        Port "8b7266486ea2861"
            Interface "8b7266486ea2861"

[snap]

So br-int owns the port having the pod veth endpoint in the init namespace. Recall that it also holds the ovn-k8s-mp0 port.

Let’s also get the vxlan details of the pod.

As shown in the output below, the remote endpoint of the vxlan tunnel is the IP 192.168.10.201 This is the IP of the pod VM.

[root@ocp-412-ovn-worker-1 core]# ip netns exec $NS ip -d li sh dev vxlan1
4257: vxlan1@if4257: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/ether ca:40:81:86:fa:73 brd ff:ff:ff:ff:ff:ff link-netns 59e250f6–0491–4ff4-bb22-baa3bca249f6 promiscuity 0 minmtu 68 maxmtu 65535
vxlan id 555005 remote 192.168.10.201 srcport 0 0 dstport 4789 nolearning ttl auto ageing 300 udpcsum noudp6zerocsumtx noudp6zerocsumrx addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535

A question that comes to mind is how the packet is sent to the vxlan1 interface from the eth0 interface.

This is made possible by Linux Traffic Control(TC) set up in the network namespace to mirror traffic between eth0 and vxlan1 . This is known from the design of Kata containers. However, I think it’s a good practice to check for TC configurations when troubleshooting network issues.

The output below shows the TC filters configured for the devices in the pod network namespace in my environment.

[root@ocp-412-ovn-worker-1 core]# ip netns exec $NS tc filter show dev eth0 root
filter parent ffff: protocol all pref 49152 u32 chain 0
filter parent ffff: protocol all pref 49152 u32 chain 0 fh 800: ht divisor 1
filter parent ffff: protocol all pref 49152 u32 chain 0 fh 800::800 order 2048 key ht 800 bkt 0 terminal flowid not_in_hw
  match 00000000/00000000 at 0
        action order 1: mirred (Egress Redirect to device vxlan1) stolen
        index 1 ref 1 bind 1

[root@ocp-412-ovn-worker-1 core]# ip netns exec $NS tc filter show dev vxlan1 root
filter parent ffff: protocol all pref 49152 u32 chain 0
filter parent ffff: protocol all pref 49152 u32 chain 0 fh 800: ht divisor 1
filter parent ffff: protocol all pref 49152 u32 chain 0 fh 800::800 order 2048 key ht 800 bkt 0 terminal flowid not_in_hw
  match 00000000/00000000 at 0
        action order 1: mirred (Egress Redirect to device eth0) stolen
        index 2 ref 1 bind 1

The egress of eth0 is redirected to vxlan1 and the egress of vxlan1 is redirected to eth0

With all these details, it’s possible to create a diagram of the worker node’s network topology for reference and further analysis. The topology is depicted in the following diagram.

Now, let’s switch our focus to the pod VM.

Note that the pod VM uses a fixed pod network namespace named podns by design.

The following output shows the network configuration of the pod VM

ubuntu@podvm-nginx-priv-8b726648:/home/ubuntu# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: ens2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 52:54:00:e1:58:67 brd ff:ff:ff:ff:ff:ff
inet 192.168.10.201/24 brd 192.168.10.255 scope global dynamic ens2
valid_lft 2902sec preferred_lft 2902sec
inet6 fe80::5054:ff:fee1:5867/64 scope link
valid_lft forever preferred_lft forever

root@podvm-nginx-priv-8b726648:/home/ubuntu# ip r
default via 192.168.10.1 dev ens2 proto dhcp src 192.168.10.201 metric 100
192.168.10.0/24 dev ens2 proto kernel scope link src 192.168.10.201
192.168.10.1 dev ens2 proto dhcp scope link src 192.168.10.201 metric 100

root@podvm-nginx-priv-8b726648:/home/ubuntu# iptables -S
-P INPUT ACCEPT
-P FORWARD ACCEPT
-P OUTPUT ACCEPT

The following output shows the network configuration inside the podns network namespace.

root@podvm-nginx-priv-8b726648:/home/ubuntu# ip netns exec podns ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
3: vxlan0@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether 7e:e5:f7:e6:f5:1a brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 10.132.2.46/23 brd 10.132.3.255 scope global vxlan0
valid_lft forever preferred_lft forever
inet6 fe80::7ce5:f7ff:fee6:f51a/64 scope link
valid_lft forever preferred_lft forever

root@podvm-nginx-priv-8b726648:/home/ubuntu# ip netns exec podns ip r
default via 10.132.2.1 dev vxlan0
10.132.2.0/23 dev vxlan0 proto kernel scope link src 10.132.2.46

root@podvm-nginx-36590ccc:/home/ubuntu# ip netns exec podns ip -d li sh vxlan0
3: vxlan0@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/ether 7e:e5:f7:e6:f5:1a brd ff:ff:ff:ff:ff:ff link-netnsid 0 promiscuity 0 minmtu 68 maxmtu 65535
vxlan id 555005 remote 192.168.10.163 srcport 0 0 dstport 4789 nolearning ttl auto ageing 300 udpcsum noudp6zerocsumtx noudp6zerocsumrx addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535

root@podvm-nginx-priv-8b726648:/home/ubuntu# ip netns exec podns iptables -S
-P INPUT ACCEPT
-P FORWARD ACCEPT
-P OUTPUT ACCEPT

The vxlan tunnel setup looks okay. It shows the remote endpoint IP 192.168.10.163 , which is the IP of the worker node VM.

Also, no firewall rule exists in the pod VM.

However, you don’t see a veth pair in the pod VM like in the worker node. Now, the question that comes to mind is how the communication happens between the init and the podns network namespace without a veth pair. Note that the physical device is in the init (root) namespace, and the vxlan device is in the podns network namespace.

Thanks to Stefano Brivio for pointing out the Linux kernel commit which makes this happen.

commit f01ec1c017dead42092997a2b8684fcab4cbf126
Author: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Date: Thu Apr 24 10:02:49 2014 +0200
vxlan: add x-netns support
 
 This patch allows to switch the netns when packet is encapsulated or
 decapsulated.
 The vxlan socket is openned into the i/o netns, ie into the netns where
 encapsulated packets are received. The socket lookup is done into this netns to
 find the corresponding vxlan tunnel. After decapsulation, the packet is
 injecting into the corresponding interface which may stand to another netns.
 
 When one of the two netns is removed, the tunnel is destroyed.
 
 Configuration example:
 ip netns add netns1
 ip netns exec netns1 ip link set lo up
 ip link add vxlan10 type vxlan id 10 group 239.0.0.10 dev eth0 dstport 0
 ip link set vxlan10 netns netns1
 ip netns exec netns1 ip addr add 192.168.0.249/24 broadcast 192.168.0.255 dev vxlan10
 ip netns exec netns1 ip link set vxlan10 up

There is also a StackOverflow thread explaining this.

These details give us a good overview of the network topology of the pod VM, which is depicted in the following diagram.

Let’s run tcpdump on the vxlan0 interface to see if the ICMP requests are received from the worker node.

As shown in the output below, the ICMP requests are received, but there is no response.

root@podvm-nginx-priv-8b726648:/home/ubuntu# ip netns exec podns tcpdump -i vxlan0 -s0 -n -vv
tcpdump: listening on vxlan0, link-type EN10MB (Ethernet), capture size 262144 bytes

[snip]

10.132.2.2 > 10.132.2.46: ICMP echo request, id 20, seq 1, length 64
10:34:17.389643 IP (tos 0x0, ttl 64, id 27606, offset 0, flags [DF], proto ICMP (1), length 84)
10.132.2.2 > 10.132.2.46: ICMP echo request, id 20, seq 2, length 64
10:34:18.413682 IP (tos 0x0, ttl 64, id 27631, offset 0, flags [DF], proto ICMP (1), length 84)
10.132.2.2 > 10.132.2.46: ICMP echo request, id 20, seq 3, length 64
10:34:19.002837 IP (tos 0x0, ttl 1, id 28098, offset 0, flags [DF], proto UDP (17), length 69)

[snip]

Let’s take stock of the situation now.

After this exercise, you get a decent understanding of the network topology of the worker node and the pod VM and there is no indication of any setup issues with the tunnel. You also see that the ICMP packets are received by the pod VM, and no software firewall blocking the packets. So what’s next ?

Continue reading to find out what’s next :-)

Troubleshooting-Phase2

I used wireshark to analyse the tcpdump capture from a working (regular Kata) setup. Wireshark GUI makes it easy to understand the network traces captured via tcpdump .

There were no ARP requests and responses observed in the traces. However, the ARP table on the worker node gets populated, and the ARP table uses the MAC of the eth0 device (in the pod network namespace) and not the vxlan0 device of the pod VM (in the podns namespace).

? (10.132.2.46) at 0a:58:0a:84:02:2e [ether] on ovn-k8s-mp0

0a:58:0a:84:02:2e is the MAC of the eth0 interface in the pod network namespace on the worker node, whereas 7e:e5:f7:e6:f5:1a is the MAC of the vxlan0 interface in the podns namespace on the pod VM.

This is the cause of the problem — the pod IP not being reachable from the worker node. The ARP entry should point to the MAC of the vxlan0 device in the podns namespace on the pod VM (i.e., 7e:e5:f7:e6:f5:1a).

In hindsight, I should have started first by looking at the ARP table entry. Next time I encounter a similar issue I surely will ;)

Solution

A neat trick from Stefano Brivio solved the issue. Using the same MAC address on the vxlan0 interface of the pod VM as the pod’s eth0 interface on the worker node resolves the connectivity issue.

ip netns exec podns ip link set vxlan0 down
ip netns exec podns ip link set dev vxlan0 address 0a:58:0a:84:02:2e
ip netns exec podns ip link set vxlan0 up

The final network topology looks like this.

Conclusion

Debugging network issues in the Kubernetes cluster is non-trivial. However, with a well-defined approach, some help from experts and publicly available materials, it’s possible to root cause and fix issues. And it’s fun learning on the way.

I hope this is useful.

Following is a list of reference materials that were extremely helpful in my troubleshooting exercise.