Kubernetes multi-cluster networking made simple

Published in

ITNEXT

9 min readDec 19, 2018

TL;DR: Kubernetes multi-cluster networking should be simple in order to scale. If you don’t want to worry about complex routing, overlay networks, tunneling or have to necessarily encrypt the traffic between clusters (assuming your applications are already doing so), then running IPv6 transparently should help.

I just came back from Kubecon NA 2018, where I had the chance to attend very inspirational sessions that covered Kubernetes multi-cluster networking from different angles. Among them:

And I can’t stop thinking about the benefits IPv6 would bring to multi-cluster networking. After all Kelsey said during his Keynote:

People ignored the past and got nowhere. And then the light bulb went off and said; hey maybe is a new problem, but maybe an old solution is the answer. And that turned out to be the trick.
― Kelsey Hightower

Multi cluster networking (Private IPv4 address space)

If you are not familiar with Kubernetes — single cluster — networking, you can refer to my previous post; Kubernetes Networking: Behind the scenes or check these Kubernetes Networking Links.

Now, why would you want to run more than one cluster. Well, I’m not covering this here as Matt Caulfield did a terrific job in his talk to address the why and Thomas Graf presented very compelling use cases in his session; high availability, shared services, evacuation (cluster upgrade), etc.

So, what would it take to enable pod-to-pod communication? For starters, we need to define how to route pod networks beyond your local cluster, this is not something your CNI plugin would typically do for you. You also need to come up with an addressing schema, as you don’t want to run overlapping IP addresses between your clusters, otherwise this won’t work… Are we tunneling the traffic? running a new overlay network? encrypting the traffic to carry it over the Internet? the list goes on and on.

Sounds overly complicated, doesn’t it? However, keep in mind this is mostly a consequence of using private IPv4 addresses for our pod networks. Why? because there is simply not enough public IPv4 addresses available to cope with the micro services explosion.

Let’s look at different topologies support this.

2 clusters?

1 VPN tunnel, piece of cake. Could be between two cloud providers or just one of them and your on-prem cluster. You typically don’t need this if you have two cluster within the same cloud provider.

Maybe 2 VPN tunnels for redundancy. Still manageable.

3 clusters?

Maybe your distributed application needs an arbiter? If you also built redundant connection, you could probably call this the “Conjoined Triangles of Success”. Again, it doesn’t have to be three different cloud providers.

3+ ?

You have different design alternatives for your topology; Hub and spoke, Full mesh, etc. All of them with their pros and cons.

Hub and spoke

Full Mesh

How much more could this scale without becoming an operational nightmare? A full mesh demands N*(N-1)/2 interconnections with N equal to the number of clusters in this case.

There are solutions out there that try to address this problem like SD-WAN. But for the sake of this post, let’s see what it would look like if we only used public IPv6 addresses.

Multi cluster networking (Public IPv6 address space)

Why use IPv6? Because “we could assign a — public — IPv6 address to EVERY ATOM ON THE SURFACE OF THE EARTH, and still have enough addresses left to do another 100+ earths” [SOURCE]

Let’s examine how this would look and discuss the need for: special routing, encryption or tunneling and redundant connections. Do we need all this? What are the drawbacks? To put this in perspective, this is what the topology would look like for a 3 cluster setup.

Routing

We hand-off inter region networking to your underlying infrastructure/Internet. No need for special routing, you don’t need to build redundant connections between clusters. It probably scales to any number of clusters… Everything comes for free; what’s the catch? Time to talk about security.

Security

I know what you’re thinking… we are exposing our internal infrastructure to the Internet. ⚠️

Well, unless you’re setting up something like a Private Cluster, your pods can typically access the Internet via Network Address Translation (NAT) anyways, and NAT doesn’t really block packets. NAT by no means is enough to protect your infrastructure. It does however hide internal addressing at the cost of keeping the state of the translation somewhere else.

Let’s run the example described here to create a ngnix pod in a managed Kubernetes cluster.

kubectl create -f https://k8s.io/examples/application/shell-demo.yaml
kubectl exec -it shell-demo -- /bin/bash
apt-get update
apt-get install iputils-ping iproute2 -y

Now try to ping a public website (apt-get already accessed to the Internet by then)

root@shell-demo:/# ping azure.microsoft.com -c 1
PING l-0007.l-msedge.net (13.107.42.16) 56(84) bytes of data.
64 bytes from 13.107.42.16 (13.107.42.16): icmp_seq=1 ttl=119 time=3.09 ms--- l-0007.l-msedge.net ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 3.094/3.094/3.094/0.000 ms

So you are connected to the Internet with a degree of abstraction.

On the other hand, if you are concerned about someone accessing your pods from the Internet over IPv6, keep in mind you have multiple layers of security to isolate them. In AWS for example you have ACL’s for controlling traffic in and out of one or more subnets, so you can block any traffic coming to your subnet that isn’t originated in another one of your subnets. Then your Node/VM’s will have a Security Group, so you can also restrict access at instance level. Last, but of not least, you can also apply a NetworkPolicy in Kubernetes to protect you at pod level.

AWS would provide a /56 IPv6 subnet to your VPC. That’s equivalent to 256 /64 IPv6 subnets. Your ACL could for example only allow traffic between the /56 assigned to your different VPC or Kubernetes Clusters. So your pods can effectively only be accessed by other pods. More granular policies could then be applied with Security Group and NetworkPolicy.

As for tunneling, VPN connections, overlay networks, etc. It’s same as with NAT; they hide your endpoints at the cost of performance penalty for IPsec, packet overhead, etc. If you can live with having public IPv6 addresses in your pods (+ACL), none of the previous technologies are strictly required. I get this might not apply to everyone out there, that’s fine, is a trade-off.

Dealing with IPv6 addresses

Yeah, I know. It’s hard enough to remember an IPv4 address already, so 128 bit long IPv6 addresses sounds like a non-starter.

Well, while each pod gets its own IP address, those IP addresses cannot be relied upon to be stable over time as pods are not resurrected when they die. For this reason, Kubernetes has the concept of a Service which defines a logical set of pods determined by a Label Selector. [Kubernetes Service]. Also in Kubernetes “every Service defined in the cluster is assigned a DNS name” [DNS for Services].

What this means is your pods will be reached by their Service DNS name, so you don’t really need to deal with the addresses themselves, it all happens behind the curtains.

In the IPv4/IPv6 dual stack functionality proposal for Kubernetes clusters, they illustrate how you’d specify the address family for a Kubernetes Service.

endpointFamily: <ipv4|ipv6|dual-stack>       [Default: dual-stack]

Example:

spec:
  selector:
    app: MyApp
  ports:
  - protocol: TCP
    port: 80
    targetPort: 9376
  endpointFamily: ipv6

Current challenges

Kubernetes IPv6 support

The good news is they have made very good progress in this area, IPv6-only support is Alpha in Kubernetes Release 1.9 [IPv6 support added #508] and should be Beta in K8s 1.13. [Comment]. There is also a proposal to “Add IPv4/IPv6 Dual-Stack support and awareness #62822” [PR, KEP].

Why dual-stack? Unfortunately it’s almost 2020 and there are still some critical services that do not support IPv6 today; like some package, source code and container repositories your pods might need to access. 👎

Looking on the bright side, there are currently a handful of CNI plugins that are able to configure dual-stack addresses on a pod already. However, it’s important to keep in mind that Kubernetes is only aware of 1 address per pod, as defined in the PodStatus V1 core API. The dual-stack proposal says they will keep the PodIP field, but also add a new array PodIPs that stores additional pod IPs [PR].

type PodStatus struct {
        ...
        HostIP     string
        PodIP      string
        PodIPs     []PodIPInfo
        ...    
}

and

type PodIPInfo struct {
        IP         string
        Properties map[string]string
    }

What I’ve done in the past, when playing around with dual-stack clusters, is to make the IPv6 address a function of a hash of the IPv4 address allocated to the pod, so I only need to track one of these and derive the other one automatically. This is probably not recommended, but if you are curious, I did this in my Gophercon lightning talk demo: Container Network Interface and Go.

Last, but not least, Kube-proxy will be modified to drive iptables and ip6tables in parallel (if not using eBPF). On the other hand, Service access within a cluster will be done via all IPv4 service IPs or all IPv6 service IPs. Make sure you check out the non-goals of the proposal.

Cloud Provider IPv6 support

IPv6 support for VM/Instances among Cloud Service Providers is scarce at best. I would like to be able to do IPv6 subnet routing, in order to assign different IPv6 subnet to each VM Instance. AWS is ahead of the pack when it comes to IPv6 support, however it won’t let you breakdown the /56 they assign to you in a VPC, to individual /64 for your instances [Discussion Forums].

The AWS support team has been very helpful in walking me through a workaround to use smaller chunks of a /64 on a VM as a pod network, by using ENI’s. While using subnets smaller than a /64 is not a recommended practice and might not even work with Docker as “The subnet for Docker containers should at least have a size of /80, so that an IPv6 address can end with the container’s MAC address and you prevent NDP neighbor cache invalidation issues in the Docker layer”, this would at least let me do some experimentation in the meantime. I will write a separate post describing how this works ➡️ How to run IPv6-enabled Docker containers on AWS.

On the other hand, if you are running a managed Kubernetes cluster, then IPv6 might take a bit longer as the Kubernetes version they run will only include mature features. For instance GCP says; “To ensure stability and production quality, normal GKE clusters only enable features that are beta or higher. Alpha features are not enabled on normal clusters because they are not production-ready or upgradeable”. [GCP Docs]. So, IPv6 is a no-no in GCP as they don’t even support it on their instances [Issue tracker].

Conclusion

While there is some work that has to be done to make this a reality, I’m confident that it’s worthwhile and will help keep complexity under control. The IPv6 benefits will not only apply to a single, but also multi-cluster Kubernetes setups. Please let me know if you know of a Cloud Provider with proper IPv6 support.