Kubernetes Operators… getting down to business (logic)

C:\Dave\Storey
ITNEXT
Published in
8 min readNov 23, 2019

--

Last month I wrote an article about the struggles I experienced getting my head around the Operator Framework used to build custom state managing components in Kubernetes. In this post I’d like to take those concepts and findings further and go into a little more technical detail.

Please note: I am not intending to go into “How to write a custom operator” here, I will do that in a future blog post. This post is intended as a deeper dive into what operators are and how they work.

A Quick Recap

So first let’s cover some basic concepts I introduced in my previous blog post:

  • Operators allow us to define and encapsulate custom units of business logic within the Kubernetes ecosystem.
  • Operators allow us to define, monitor and recover custom components that are usually considered external to Kubernetes and/or containers.
  • Operators allow us to move closer toward our Utopian dream of true DevOps where our service deployments are nothing more than a configuration file and our on-call alerts/phones never ping because our services self heal and scale

Now I know all this seems too good to be true; much like one of those sales pitches from a software company that promises the world and never delivers, however, if you’re willing to do the necessary legwork, Kubernetes can make this a reality.

Where do I begin?

Ok so before we start to jump into examples, there are some simple “fundamentals” we need to cover here.

  • At its heart Kubernetes is nothing more than a monitoring system; essentially an infinite loop, constantly monitoring everything it knows about.
  • The things that Kubernetes monitors can be referred to as Perceived State, and comparing this against reality, referred to as Actual State.
  • When Kubernetes detects a discrepancy between perceived and actual states it then calculates which action to take to make these two states match.

Examples of actions Kubernetes could take during Reconciliation include:

  • A new version of an application is deployed; meaning new instances need creating and old ones removing once the new pods are stable.
  • Creating a new pod/instance of a service due to a running pod becoming unhealthy.
  • Removing pods because a user has issued a request to scale down or delete a service.

This list is not exhaustive, but just a few examples to get us started.

It’s time to Reconcile our Differences

So now we are thinking in terms of State; what happens when Kubernetes decides something is out of sync?

Well when state discrepancies are detected, K8s starts something we call “Reconciliation”. When this happens, K8s is trying to essentially make reality match what it’s been told reality should look like. Remember when we deploy services, our YAML files are essentially telling Kubernetes “I want the world to look like this”, and once it has achieved this, and the states are in sync, it will go back to monitoring this waiting for state discrepancies to appear.

For the most part Reconciliation is very simple; spinning something new up or tearing something unwanted down. But K8s can do so much more than just start and stop things. So let’s look a bit deeper shall we?

A Real World Example

Okay, so let’s take an example of an advanced scenario; let’s imagine we want to have kubernetes deploy and monitor instances of AWS SQS. This would allow us to deploy new SQS instances as part of our service deployment YAML and then have kubernetes do all the configuration, maintenance and heavy lifting for us.

Now I hate to state the obvious, but SQS does not run in a container; it runs in the AWS cloud platform, so how on earth can Kubernetes (a container orchestration platform) help us?

Well this is the part where I’m here to tell you:

Not everything we need to manage via Kubernetes needs to run in a container. Sometimes we need to think outside the proverbial box

i know right… mind blown!

Instead what we need to have is have something running in a container, that will allow us to interact with SQS.

Operator Framework to the Rescue!

Right, so we need K8s to be able to handle something stateful, but whatever that “something stateful” is, may or may not be running inside the Kubernetes eco-system. Great!

But how do we do this??? Well luckily there is something called the Kubernetes Operator Framework that can help us. Initially developer by CoreOS the framework was announced back in 2018 to much fanfare. The framework gives us all the tools and guidance we need to try and build extensions into Kubernetes called Operators. But what exactly are they?

My TL;DR; definition of Operators:

An Operator can be thought of simply as a RESTful API which runs in a container on K8s. They have Controllers and Actions (same as all RESTful services), that will get called by K8s whenever Reconciliation needs to happen.

But, how does Kubernetes know which controller to call?

When we write our operators we will be creating custom Controllers, we will then install or register our controllers with the K8s control plane. We do this by hooking our controller into something called the Controller Manager.

But even with our controller registered, how can Kubernetes know which controller does what? So there is one last piece of the puzzle that we haven’t looked at yet, one key concept/term of the Kubernetes that lets the Operator Framework do its thing; and that is CRD’s or Custom Resource Definitions.

CRD not CRUD

A custom resource is an extension of the Kubernetes API that allows us to define our own complex types. On their own they don’t really allow us to do much more than allow us to store and retrieve structured data from within K8s.

Think back to your YAML files, where you define the things you want in your cluster. You always define the kind of the thing you want, well that is essentially its’ resource definition. You’ve probably already encountered a few of them. Think of Pods, a Service, a Secret or a Deployment; they’re all resource definitions. They just happen to be the defaults that are pre-installed.

So now we know about CRD’s, we can see that when we combine/bind a custom resource with a custom controller via the Controller Manager component of K8s, well that is when we can really start to make some magic happen.

Ok, well let’s take look at a very simplified flow:

Simplified K8s CRD-Controller Reconcile loop flow

So what is going on in the simplified diagram above:

  • A user requests a new instance of a custom resource (example: SQS) via a yaml file using `kubectl`.
  • This request enters the Kubernetes system via the Kubernetes Control Plane, the brain and backbone of a Kubernetes instance.
  • Kubernetes then performs internal logic, querying its’ etcd instance to determine if any action is required.
  • Kubernetes compares current state to the desired state.
  • In this case we want a new SQS and so Kubernetes posts a message out to our Controller registered as being able to handle a CRD of Kind SQS; saying “hey we need to reconcile”
  • Our custom Controller runs just like any other service/app on a pod within a node.
  • When the Controller is called, the logic within the action is executed and the success/failure is reported back to the control plane.

Putting it all together

Right, so we now have a very simplified idea of how all these parts hang together in Kubernetes Land. But lets have a very quick recap:

  • A CRD (Custom Resource Type) allows us to request new custom resources via YAML or kubectl.
  • A Controller is essentially a RESTful service that runs within a container on a node and responds to control requests from K8s control plane.
  • A CRD is bound to a Controller via Controller Manager inside K8s
  • When there is a discrepancy between perceived state (what we have) and desired state (what we want), K8s will perform a Reconcile loop to try and put the states back in sync.

And what happens when our controller is invoked?

Well, frankly, whatever we want it to do. This is where the beauty of Operators lies because once your controller action has been called you can do whatever you want. I don’t want to bog this post down in technical implementation, so to keep things simple, here is a simple flow diagram to explain how this code could be structured:

Once I have performed whichever operation I need to, and my business logic is satisfied, I can report back to Kubernetes that my Controller has reconciled and then K8s will go back to monitoring things.

We’re Sinking! Or Should that be Syncing?!?!

Right, we have our controller, it’s out there in the wild, we can request new resources, change existing resources and delete resources by using YAML or kubectl. Everything is perfect! We are one step closer to our utopian dev ops dream!

But wait… what happens if someone comes along and deletes our SQS queue in AWS?!

Time to panic?!?!

Luckily, thanks to the Operator Framework we can also schedule manual Reconcile loops. Yes you read that right folks; we can tell K8s that we want it to do a reconcile loop outside the realms of state discrepancy. Your use cases may not require this, but it’s perfectly possible to write business logic to say “Hey Kubernetes, we’re all good now, but come back after x seconds and check again would you? -Thanks”.

Remember, Kubernetes knows nothing out the outside world, so if we want to be able to react to external events (SQS deletion etc), we would need to handle these scenarios outside of the normal state change flow. Now as if by magic if someone was to delete our SQS, well it would just magically reappear on the next Reconcile loop!

Phew…. that was a close one

A Word Of Caution

So you’ve seen here how Operators allow us to encapsulate complex business logic. You’ve also seen how we can maintain external state through reconciliation loops. But remember people:

  • Reconciliation loops need to be non-blocking. It is a poor design decision to implement reconciliation code that blocks for a number of seconds starving the pod of resources
  • Think about your upstream dependencies, don’t DDOS them.
  • Think about your control plane, if you are scheduling manual reconcile loops, and you are expecting a lot of components, be conservative as you could cripple your instance nodes.
  • Design your operators well, think about backwards compatibility to prevent breaking changes as you roll out new versions

A Final Note

I’m sure we have all seen that famous Mickey Mouse scene, the Sorcerers Apprentice, where Mickey finds a way to automate his hard work, but doesn’t learn how to control it properly, resulting in chaos and out of control processes. Well think about Operators the same way. Yes they are insanely powerful, BUT, think carefully about design and structure; because the last thing you want is your automation getting out of control

--

--

Writer for

Software engineer & lover of all things code. Too much to learn and so little time. Currently working at Trainline London.