Inspecting and Understanding Kubernetes (k8s) Service Network

Published in

ITNEXT

12 min readAug 3, 2022

Services bring stability

A Kubernetes Service object creates a stable network endpoint that sits in front of a set of Pods and load-balances traffic across them.

You always put a Service in front of a set of Pods that do the same job (have the same container images running in them). For example, you could put a Service in front of your web front-end Pods, and another in front of your authentication Pods. You never put a Service in front of Pods that are doing the different jobs(have different container images running in them).

Clients talk to the Service and the Service load balances traffic to the Pods.

In the diagram above, the Pods at the bottom can come and go as scaling, updates, failures, and other events occur and the Services keep track. However, the name, IP, and port of the Services will never change.

Anatomy of a Kubernetes Service

It’s useful to think of a Kubernetes Service as having a front-end and a back-end:

Front-end: name, IP, the port that never changes
Back-end: Pods that match a label selector

The front-end is stable and reliable. This means the name, IP, and port number are guaranteed to never change for the entire life of the Service. The stable nature of the Service front-end also means that you do not need to worry about stale entries on clients that cache DNS results for longer than the standards recommend.

The back-end is highly dynamic and will load-balance traffic to all Pods in the cluster that match the set of labels the Service is configured to look for.

Load-balancing in this situation is simple L4 round-robin load-balancing. This works at the “connection” level where all requests over the same connection go to the same Pod. This means two things:

Multiple requests from the same browser will always hit the same Pod. This is because browsers send all requests over a single connection that is kept open using keepalives. Requests via tools like curl open a new connection for each request and will therefore hit different Pods.
Load-balancing is not aware of application layer (L7) concepts such as HTTP headers and cookie-based session affinity.

Recapping the intro

Applications run in containers, which in turn run inside of Pods. All Pods in your Kubernetes cluster have their own IP address and are attached to the same flat Pod network. This means all Pods can talk directly to all other Pods. However, Pods are unreliable and come and go as scaling operations, rolling updates, rollbacks, failures, and other events occur. Fortunately, Kubernetes provides a stable networking endpoint called a Service that sits in front of a collection of similar Pods and presents a stable name, IP, and port. Clients connect to the Service and the Service load balances the traffic to Pods.

Service registration and discovery

When a new Service is created it is allocated a virtual IP address called a ClusterIP. This is automatically registered against the name of the Service in the cluster’s internal DNS and relevant Endpoints objects (or Endpoints slices) are created to hold the list of healthy Pods with that the Service will load-balance to.

At the same time, all Nodes in the cluster are configured with the iptables/IPVS rules that listen for traffic to this ClusterIP and redirect it to real Pod IPs. The flow is summarized in the image below, though the ordering of some events might be slightly different.

When a Pod needs to connect to another Pod, it does this via a Service. It sends a query to the cluster DNS to resolve the name of the Service to its ClusterIP and then sends traffic to the ClusterIP. This ClusterIP is on a special network called the Service network. However, there are no routes to the Service network, so the Pod sends traffic to its default gateway. This gets forwarded to an interface on the Node the Pod is running, and eventually the default gateway of the Node. As part of this operation, the Node’s kernel traps on the address and rewrites the destination IP field in the packet header (using iptables/ipvs) so that it now goes to the IP of a healthy Pod.

This is summarized in the image below.

Test Configuration

We went through a lot of theory to understand the service network, Let's inspect an actual service network of a Kubernetes cluster.

Provision a 3 node GKE cluster for this purpose. Pod and Service network config:

Then connect to the cluster using kubectl on cloud shell. Authorize and review the cluster config.

kubectl get pods -A
kubectl get nodes
kubectl get node -o custom-columns=NAME:'{.metadata.name}',\
PrivateIP:'{.status.addresses[?(@.type == "InternalIP")].address}'

We will ssh to the worker nodes of GKE and review how the iptables/ipvs rules are updated by the kube-proxy when services are created or when we scale the deployment associated with a service.

For that first, we have to create a deployment and a service to expose the pods of the deployment.

Let's create a deployment with 3 replicas

#on cloudshell with kubectl access
kubectl apply -f https://k8s.io/examples/controllers/nginx-deployment.yamlkubectl get deployment -o wide
kubectl get pods -o wide
kubectl get pods -o custom-columns=NAME:'{.metadata.name}',\
HOSTIP:'{.status.hostIP}',PODIP:'{.status.podIP}'

Pods are created all 3 different worker nodes with IPs 192.168.0.6, 192.168.1.6 and 192.168.2.5

Now let's create a service of type ClusterIP

kubectl expose deployment nginx-deployment  --name=nginx-svc  --port=80 --target-port=80 --selector='app=nginx'
kubectl get service

Service created with clusterIP 192.168.251.24

Reviewing Kube-proxy Config

Kube-Proxy runs as a static pod in GKE.

ssh to any of the worker nodes (connecting to the one with IP 10.128.0.4). Doesn't matter if the pod for the deployment nginx-deployment is running on that node.

Fig 9

On reviewing the kube-proxy log, we see that since the proxy mode was not mentioned in the kube-proxy start command, default (iptables) is considered to be the proxy mode.

# grep "proxy mode"  /var/log/kube-proxy.log
W0802 20:09:49.428959       1 server_others.go:565] Unknown proxy mode "", assuming iptables proxy

So we will look into iptables rules as that’s the default mode used by kube-proxy.

Understanding Service Network by Inspecting the iptables rules

Normal ClusterIP Service

Exposes the Service on a cluster-internal IP. Choosing this value makes the Service only reachable from within the cluster. This is the default ServiceType.

On a worker node (one with IP 10.128.0.4 for me), try the below command to search for rules related to the service we created. Let’s check the iptables “nat” table and search for the “nginx-svc” (service name).

iptables -t nat -L | grep -i nginx-svc

We see a lot of information but it's difficult to make sense of it (especially when you are new to iptables)

From the chains of PREROUTING and OUTPUT we can see, all data packets incoming or outgoing of Pods enters the chain KUBE-SERVICES as the starting point.

Let's first look at the KUBE-SERVICES chain as it is the entry point for service packets, matches the destination IP: port and dispatches the packet to the corresponding KUBE-SVC-* chain.

Since KUBE-SVC-HL5LMXD5JFHQZ6LN is the next chain, we will inspect that.

In this particular KUBE-SVC-HL5LMXD5JFHQZ6LN chain we see there are four rules :

The first one says if any traffic originated from outside the podCIDR associated with “this” node and is destined for the nginx-service at port 80 (http), add a Netfilter mark to packets, and Packets with this mark will be altered in a KUBE-POSTROUTING chain’s rule to use source network address translation (SNAT) with the node’s IP address as their source IP address. Consider Fig 12 and Fig 13.

KUBE-SVC-* acts as a load balancer, which distributes the packet to KUBE-SEP-* chains. The number of KUBE-SEP-* is equal to the number of endpoints behind the service (i.e. the number of running pods) i.e three. Which KUBE-SEP-* to be chosen is determined randomly. We can see the same in Fig 12. KUBE-SEP* rules are alike, so we will discuss only one. We will discuss the “statistic mode random probability” later in this article.

KUBE-SVC-HL5LMXD5JFHQZ6LN will dispatch packets to KUBE-SEP-7EX3YM24AF6XH4A3 and 2 other chains randomly.

Each KUBE-SEP-* chain represents a pod or endpoint respectively.

KUBE-SEP-7EX3YM24AF6XH4A3 has two rules:

Add a Netfilter mark to packets, and packets with this mark will be altered in a KUBE-POSTROUTING chain’s rule. KUBE-MARK-MASQ marks a packet for later masquerade (SNAT, so that packet appears to come from node IP) for the packet leaving the pod scheduled on the same node.
The second rule redirects all the packets to podIP of the pod scheduled on the same node at the target port (80 in this case)

Similar rules apply to the other two KUBE-SEP-* chains as well (Fig 15)

If we scale the deployment and make replica count 4 from 3, another KUBE-SEP-* chain will be created and a rule corresponding to that chain will be added to KUBE-SVC-HL5LMXD5JFHQZ6LN.

NodePort Service

Exposes the Service on each Node’s IP at a static port (the NodePort). A ClusterIP Service, to which the NodePort Service routes, are automatically created. You'll be able to contact the NodePort Service, from outside the cluster, by requesting <NodeIP>:<NodePort>.

There are 2 types of NodePort services:

default service (externalTrafficPolicy: Cluster)
externalTrafficPolicy: Local

We will discuss the default Nodeport service (externalTrafficPolicy: Cluster)

To test, I updated the existing service and made it of type “NodePort” with nodePort as 30010.

kubectl edit service nginx-svc

From iptables’ perspective, two sets of chains & rules are added to the chain KUBE-SERVICES & KUBE-NODEPORTS separately:

Regarding the KUBE-SERVICES chain, if there is no matching rule like KUBE-SVC* in this chain for a packet, it falls back to the last rule in the chain i.e. KUBE-NODEPORTS.

KUBE-NODEPORTS denotes that all packets accessing port 30010 go into the KUBE-SVC-HL5LMXD5JFHQZ6LN chain where they firstly are SNAT-ed (target KUBE-MARK-MASQ) and then forwarded to KUBE-SEP* chains to select a pod to route to.

UPDATE: The way NodePort routes the packets has changed. As of March 2023, packets are not sent to KUBE-SVC-* Chains, the packets are forwarded to a KUBE-EXT-* chain which then forwards to KUBE-SVC-* and the rest is the same as above.

Here is an example:

Why an extra chain? In my opinion chain KUBE-EXT-* gives us ability to reuse the chain.

LoadBalancer Service

Exposes the Service externally using a cloud provider’s (GCP) load balancer. NodePort and ClusterIP Services, to which the external load balancer routes, are automatically created.

If we change the service type from NodePort to LoadBalancer, there are no changes at the iptables level. It uses the same iptables chains and just adds an OSI Layer 4 (TCP) Loadbalancer in front of the “Node Ports”.

This is not true for the GKE Load Balancer service, GKE Load Balancer doesn’t forward traffic to Nodes and NodePorts, rather it forwards the incoming packets destined for Load Balancer IP and service Port to the KUBE-SVC-* (on each Node in Instance groups) related to that Service which is what happens when the traffic is received at ClusterIP. At KUBE-SVC-* chain packets are firstly SNAT-ed (target KUBE-MARK-MASQ) and then forwarded to KUBE-SEP* chains to select a pod to route to. This is Specific to GKE and GCP LoadBalancer.

Statistic mode random probability

To Load balance the traffic between available endpoints, iptables include a clause “statistic mode random probability xx.xxx” for each KUBE-SEP* rule in KUBE-SVC* chain.

iptables engine is deterministic and the first matching rule will always be used. In this example, KUBE-SEP-7EX3YM24AF6XH4A3 (Fig 18) will get all the connections but we want to load balance between the available endpoints.

To address this issue, iptables includes a module called statistic that skips or accepts a rule based on some statistic conditions. The statistic module supports two different modes:

random: the rule is skipped based on a probability
nth: the rule is skipped based on a round robin algorithm

Random balancing

Review Fig 18, Notice that 3 different probabilities are defined and not 0.33 everywhere. The reason is that the rules are executed sequentially.

With a probability of 0.33, the first KUBE-SEP* rule will be executed 33% of the time and skipped 66% of the time.

With a probability of 0.5, the second rule will be executed 50% of the time and skipped 50% of the time. However, since this rule is placed after the first one, it will only be executed on the remaining 66% of the time. Hence this rule will be applied to only (50% of the remaining 66%=33%) of requests.

Since only 33% of the traffic reaches the last rule, it must always be applied.

If we scale the replicas for this deployment from 3 to 4, what changes at the service config side at iptables layer?

The number of pods increased means the number of endpoint objects increased. So the number of KUBE-SEP* rules in KUBE-SVC* chain will also increase.

Compare Fig 20 with Fig 18, now the first KUBE-SEP* rule will be executed on 25% of all packets, second will be executed for 33% of the remaining 75% times which is also 25% of the total times. The third one will be executed for 50% of the remaining 50% time and the last one will be executed for 25% of the total number of times.

Things we didn't cover

There are certain service configs that we didn't discuss in this article:

External IP service - If there are external IPs that route to one or more cluster nodes, Kubernetes Services can be exposed on those external IPs. Traffic that ingresses into the cluster with the external IP (as destination IP), on the Service port, will be routed to one of the Service endpoints.
Session affinity service -Kubernetes supports ClientIP-based session affinity, session affinity makes requests from the same client always get routed back to the same pod.
No endpoint service - A ClusterIP service is always associated with backend pods, it uses a “selector” to select backend pods, if backend pods are found based on a selector, Kubernetes will create an endpoint object to map to the pod’s IP: Port, otherwise, that service will not have any endpoints.
Headless Service — Sometimes you don’t need or want load-balancing and a single service IP. In this case, you can create “headless” services by specifying "None" for the cluster IP (.spec.clusterIP).
NodePort service with externalTrafficPolicy: Local - Using “externalTrafficPolicy: Local” will preserve source IP and drop packets from the worker node that has no local endpoint.

I highly recommend going through my post on a special case when using the GCP firewall with the GKE service LoadBalancer

Using GCP Firewall with GKE LoadBalancer service

Purpose

harinderjitss.medium.com

Please read my other articles as well and share your feedback. If you like the content shared please like, comment, and subscribe for new articles.