The definitive guide to running EC2 Spot Instances as Kubernetes worker nodes

Ran Sheinberg
ITNEXT
Published in
16 min readMay 20, 2019

--

[Edit December 2020] While this blog post is still a good read with interesting context and history around Kubernetes, auto scaling and Spot Instances, I highly recommend that you look into running Spot Instances in EKS managed node groups — a capability that was launched during re:Invent 2020. You can read my launch blog post on the AWS containers blog:

In this post I’ll show you how to run your Kubernetes (EKS or other) worker nodes on EC2 Spot Instances and save up to 90% off the EC2 On-Demand price, without compromising the performance or availability of your applications. To have a robust and resilient cluster running on Spot Instances, there are some steps you need to take, and this post is the definitive (and ever evolving — as Kubernetes is) guide to walk you through those steps. I’m going to describe the different best practices you need to implement in your cluster and the tools and configurations for doing so. After reading this post you will have all the information you need to start using Spot Instances as Kubernetes worker nodes.

But before we get started: my name is Ran Sheinberg and I’m a Solutions Architect working for AWS. I specialize on Spot Instances and have talked to hundreds of AWS customers on how to implement Spot Instances in their specific workloads, planned POCs with them, and helped them troubleshoot and solve problems in their production workloads. Many of those conversations were centered around Kubernetes based workloads and how to run them on Spot. Each customer, AWS colleague, or person on GitHub/Slack that I interacted with on the subject of Kubernetes and Spot has had some sort of small or big contribution to this guide without knowing it.

Spot Instances — zero to hero in a couple of paragraphs

You should definitely read this part if you’re new to Spot Instances or if you’re not caught up with the major changes in the pricing model (no bidding, no price spikes) that were introduced in November 2017. Same for if you’re not familiar with how EC2 Auto Scaling groups allow to mix purchase options and instance types starting from November 2018.

Spot Instances are spare EC2 compute capacity that you can use at up to 90% discount compared to the On-Demand price. As opposed to On-Demand and Reserved Instances purchase options that have static prices (unless AWS reduces their price), Spot Instances’ price could fluctuate in very small intervals only a few times a day according to long-term supply and demand in each capacity pool (a combination of instance type in an availability zone) separately.

Spot pricing history for the last 3 months show r4.xlarge in US East (N. Virginia) with very small price differences between the availability zones (less than ~$0.002), and the Spot price actually only changed during those last 3 months in some of the availability zones

The one absolute crucial best practice you should take away from this section is flexibility and diversification. If you’re running your stateless, fault tolerant, distributed workload (examples: a cluster of web/application servers behind an ELB, batch processing that consume jobs from a queue, or container instances/worker nodes) using On-Demand (with RI, Savings Plans, or none of those) on a specific instance type because you qualified that instance or just started using it successfully, when you start adopting Spot in your applications you need to diversify usage to as many capacity pools (instance types in AZs) as possible.
The reason is simple: by using multiple capacity pools you: (a) increase the chances of getting Spot capacity, instead of trying to get Spot capacity from a single instance type in a specific AZ and failing if there’s no spare capacity in that capacity pool. And (b) if EC2 needs the capacity back for On-Demand usage, typically not all capacity pools will be interrupted at the same time, so only a small portion of your workload will be interrupted, only to be replenished by Auto Scaling groups (or Spot Fleet, we’ll get to that later) thus avoiding any impact on your availability, or the need to run On-Demand and pay more.

For example: if I’m running my application on c5.large today, I can probably also run it on m5.large — in most cases the application will work just fine if the Operating system simply sees more memory and a slightly slower CPU clock speed (unless we’re talking about CPU sensitive workloads or ones that use specific instruction sets like AVX-512). Similarly, you can use r5.large with even more memory, and go back a generation and also use c4.large / m4.large / r4.large. This concept works just fine with the Kubernetes scheduler, but requires some adaptations when starting to talk about autoscaling in Kuberentes — a topic we will dive very deep into later on in this post.

Also take a look at the allocation strategies which are supported by EC2 Auto Scaling groups. To follow the diversification best practice and spread your nodes/pods across as many capacity pools as possible, typically what customers have been using is lowest-price allocation strategy with a number that is equivalent or n-1 to the number of instance types you selected. However, the recommended approach is to use the capacity-optimized allocation strategy, launched in August 2019. This allocation strategy will choose the Spot Instances which are least likely to be interrupted, as it targets the deepest capacity pools. Read more about the launch of the capacity-optimized allocation strategy here.

So how does AWS make it easy to follow these instance type flexibility best practices? enter EC2 Auto Scaling groups.

EC2 Auto Scaling groups

If you’re confused about ASGs vs Spot Fleet or EC2 Fleet for your Kubernetes clusters, don’t be. These tools or APIs have similar traits, but I’ll make it very simple for you. Today Fleets are more suitable for large scale jobs that have a beginning and an end — for example, I need 3000 vCPUs to process my videos or images on S3, or to run a nightly Hadoop/Spark job, or any other type of batch computing job. While Spot Fleet does have Auto Scaling capabilities which are very similar to ASG’s capabilities, ASGs are better for workloads that run continuously as part of a service and do not need to reach a finish line. Some of the benefits of ASGs for these types of workloads include: lifecycle hooks (which will allow you to easily drain your container instances when there’s a “scale-in” activity, we’ll touch on that later in the post), protecting an instance from scale-in, attaching or detaching instances to/from an ASG, terminating a specific instance from an ASG, ELB Healthcheck integration, balancing the number of instances in each availability zone. Also, and most importantly for our topic at hand: community driven tools are integrated with EC2 ASGs — for example eksctl, Kubernetes cluster-autoscaler, Kops and others.

Here’s an example of creating an EC2 Auto Scaling group from the AWS Management Console. Note that you don’t actually have to do this when using eksctl or kops, because these tools set up the ASGs for you.

Creating a new EC2 Auto Scaling group in the AWS console. I selected 7 instance types that have similar vCPU and Memory specifications, and will run this cluster in 3 availability zones, for a total of 24 capacity pools, thus increasing the chance that I’ll be able to get my desired Spot capacity, and also increasing the chance of keeping the desired capacity if some pools are interrupted when EC2 needs the capacity back. The Spot allocation strategy is capacity-optimized.

And finally: how can I know which instances are best for my Spot usage? Use the Spot Instance Advisor tool to check the historical interruption rate (in the last 30 days) of each of the instance types in your region of choice. Our Kubernetes cluster is going to be set up to be fully fault-tolerant to Spot interruptions by catching the Spot 2-minute interruption warning and draining the worker nodes that are going to be terminated, but it’s still a good idea to focus on instance types with lower interruption rates.

Two considerations for Auto Scaling groups around Multi-AZ:

  • If you use Persistent Volumes with EBS then you’re going to need to run your node group / Auto Scaling group for that application in a single Availability zone, because EBS Volumes are zonal and you don’t want your pod to be scheduled on the wrong AZ where the EBS volume does not exist. If your use-case allows for using Amazon Elastic File System (EFS) which spans multiple-AZs, then you can ignore this limitation and run your pods that work against the EFS mount in multiple-AZs, and same for if you use Amazon FSx for Lustre.
  • Auto Scaling groups strive to keep the same number of instances in all the AZs it’s running in. This might cause worker nodes to terminate when ASG is trying to scale-down the number of instances in an AZ. If you run a tool to automatically drain instances upon ASG scale-in activities (which we will touch on later in this post) then you shouldn’t worry about this. Otherwise, you can simply disable this functionality by suspending the AZRebalance process, but you’re risking a situation where your capacity will become unbalanced across AZs. Note that cluster-autoscaler itself picks the instance to terminate, so this ASG scale-in concern is not relevant for the operations of cluster-autoscaler, but only for the AZRebalance process.

Two last words about Mixed ASGs: Launch Templates. These are a requirement if you want to run an ASG with multiple purchase options and instance types. Conceptually, these are the same as Launch Configurations in that they allow you to configure things like the AMI, storage, networking, User data and other settings, and use the template to Launch an instance, or an ASG, Spot Fleet, or use in AWS Batch and possibly more services in the future. Launch Templates are also more advanced because they support versioning, but we won’t dive deep into LTs. Read the docs if you want to learn more.

If you want to get hands-on experience with running stateless applications on EC2 Auto Scaling groups (not necessarily with Kubernetes), have a look at https://www.ec2spotworkshops.com

I highly recommend the Spot / Kubernetes deep dive workshop that will help you implement the best practices described in this article and achieve more learning points along the way https://ec2spotworkshops.com/using_ec2_spot_instances_with_eks.html

Let’s start: adding Spot Instances to your Kubernetes cluster

In AWS, it’s widely accepted that a node group translates to an EC2 ASG — it’s how eksctl (the official CLI tool for EKS) and kops provision instances, it’s how EKS documentation recommends adding instances to your EKS cluster using CFn, and cluster-autoscaler is also integrated with it — so this, along with the ASG benefits that I described in the previous section, make ASG a perfect choice for running and managing our worker nodes.

EKS (via eksctl) and Kops are common and easy ways to launch and manage Kubernetes clusters on AWS. So in this section, I will describe how we add Spot Instances as worker nodes for both these options.

Adding Spot Instances to EKS clusters with eksctl

AWS has a step-by-step guide for this as part of the https://ec2spotworkshops.com site, and this will also work for non-EKS clusters but I do also talk about the kops option later in this section. You can find the instructions under the Spot module: https://ec2spotworkshops.com/using_ec2_spot_instances_with_eks/spotworkers.html
You can read through the module for the full details.

If you’re just interested in setting up a Mixed ASG with Spot Instances your EKS cluster, you can simply use eksctl with a configuration file that contains the required parameters for a Mixed ASG.

eksctl create nodegroup -f eksctl-mixed.yaml
And my configuration file which is similar to the example in the EKS/Spot workshop:

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: <your existing cluster name>
region: <AWS Region where you started your EKS cluster>
nodeGroups:
— name: spot-node-group-2vcpu-8gb
minSize: 3
maxSize: 5
desiredCapacity: 3
instancesDistribution:
instanceTypes: [“m5.large”, “m5d.large”, “m4.large”,”m5a.large”,”m5ad.large”,”m5n.large”,”m5dn.large”]
onDemandBaseCapacity: 0
onDemandPercentageAboveBaseCapacity: 0
spotAllocationStrategy: “capacity-optimized”
labels:
lifecycle: Ec2Spot
iam:
withAddonPolicies:
autoScaler: true
— name: spot-node-group-4vcpu-16gb
minSize: 3
maxSize: 5
desiredCapacity: 3
instancesDistribution:
instanceTypes: [“m5.xlarge”, “m5d.xlarge”, “m4.xlarge”,”m5a.xlarge”,”m5ad.xlarge”,”m5n.xlarge”,”m5dn.xlarge”]
onDemandBaseCapacity: 0
onDemandPercentageAboveBaseCapacity: 0
spotAllocationStrategy: “capacity-optimized”
labels:
lifecycle: Ec2Spot
iam:
withAddonPolicies:
autoScaler: true

What we’re doing here is adding two Spot nodegroup (ASG), running 100% Spot with 7 instance types that have the same vCPU:Memory count, as this is important for using cluster-autoscaler that assumes homogeneous nodegroups. Also the Spot allocation strategy is “capacity-optimized”, so every time the ASG will scale out, an instance will be selected from the Spot capacity pools that have the most spare capacity.
To further diversify my usage, I can add more nodegroups with other sizes, for example all *.2xl instance types. The EKS/Spot workshop module has a more detailed example.

Within a couple minutes eksctl will bring up a new Mixed ASG according to your configuration file and you’ll have your new Spot Instances running in your EKS cluster.

Adding Spot Instances to Kubernetes clusters using Kops

Kops also supports adding Mixed ASGs out of the box. So once I had my kops cluster running, I ran kops edit ig --name=<cluster-name> nodesand added my Spot instance types under the MixedInstancePolicy just like the GitHub readme file described, and then ran kops updateto apply it.

mixedInstancesPolicy:
instances:
— m5.large
— m4.large
— m5a.large
— m5d.large
— m5ad.large
— m5n.large
— m5dn.large
— t3.large
onDemandAboveBase: 0
spotAllocationStrategy: capacity-optimized

This configuration means that I will have 0 On-Demand instances in my ASG (onDemandAboveBase: 0), and ASG will select four of the cheapest instance types from the types I specified (spotInstancePools: 4) in each of the configured availability zones in my cluster. I set 4 here in order to get the most diversity in my ASG because that’s the number of instance types I specified.

Autoscaling Spot worker nodes

[To autoscale your applications in the Kubernetes cluster in response to variable load, you will probably use Horizontal Pod Autoscaler, but I’m not going to touch on this subject here simply because it’s well-documented. Instead, the focus of this post is only scaling the underlying hardware (worker nodes) when running on Spot Instances.]

Enter Kuberentes cluster-autoscaler

Cluster-autoscaler is an opensource tool that ships outside of the core Kubernetes code. You install and run it in your cluster just like any deployment, and point it at your ASGs. When it sees pods in status “pending” due to lack of available resources (CPU/Mem) in the cluster, it’ll send an API command to the ASG to increase the desired capacity of the group by the number of instances it calculates that are required to fit the pending pods on. When there are idle worker nodes in the cluster, it’ll decrease the size of the ASGs.

If you are reading this post after already doing some research on how to use Spot with cluster-autoscaler (CA from here on), then you know that the plot thickens here. Officially, CA only supports homogeneous node groups — in our context, it means that it needs to be pointed at ASGs that only run the same instance type (which is how ASGs worked from when they were launched by AWS in 2010 until late 2018 when the new type of Mixed ASG was launched). This is what the official documentation says, and users/contributors discussing in CA’s GitHub repo are also strongly opinionated that you should not use CA with an ASG that has multiple instance types in it.

The reason for CA’s lack of support for heterogeneous node groups is its decision making algorithm (aka simulation). It assumes that all nodes in the group have the same hardware specs (CPU/Mem), so when it needs to make a scaling decision it knows exactly how much CPU/Mem it’ll be adding to the cluster or removing from it.

So what’s the solution? you guessed it (or just have read the previous section about creating your Mixed ASG node group) — simply use instance types that are similar in size from different EC2 instance families and generations. This will allow you to follow the Spot best practice of diversifying in order to achieve and maintain your desired scale, and should not have any impact on the way CA works with ASG.

Cluster-autoscaler with a Mixed ASG in action

I followed the Autoscaling module on ec2spotworkshops.com to set up CA on my EKS cluster and pointed it to the Mixed ASG with multiple Spot instance types that I created in the previous step. I configured my Mixed ASG name in the nodes parameter and also changed the image parameter to use the latest version (1.14.6 that works with EKS 1.14).

spec:
containers:
— command:
— ./cluster-autoscaler
— — v=4
— — stderrthreshold=info
— — cloud-provider=aws
— — skip-nodes-with-local-storage=false
— — nodes=2:15:eksworkshop-eksctl9-mixed-spot-NodeGroup-10AYDZ7OBGUAG
env:
— name: AWS_REGION
value: eu-west-1
image: k8s.gcr.io/cluster-autoscaler:v1.14.6

I then scaled my deployment to 20 replicas:
kubectl scale — replicas=20 deployment/nginx-to-scaleout

Which caused some pods in the cluster to remain in Pending state due to lack of CPU resources in the cluster:

and followed the cluster-autoscaler log:
kubectl logs -f deployment/cluster-autoscaler -n kube-system

cluster-autoscaler identifying that there are unschedulable pods in the cluster and increasing the size of the Mixed ASG from 7 to 14 instances in order to schedule the pods.

The result of the CA scaling activity in the EC2 Instances console:

EC2 Instances console filtered by the Auto Scaling group name

My scaled deployment is now running on Spot Instances on a diversified set of instance types across three availability zones, increasing my chances of getting the desired capacity and keeping the desired capacity in case of Spot interruptions.

Here is a quick illustration that demonstrated configuring cluster-autoscaler against multiple ASGs, where each of the ASGs have similarly sized instance types. Because I’m using the capacity-optimized allocation strategy, every time that I’ll scale out an ASG, it will choose the instance type which is least likely to be interrupted, effectively increasing the stability and resilience of my cluster.

Cluster-autoscaler — the bottom line

A Mixed ASG with multiple similar instance types works with cluster-autoscaler from version 1.14. Now that you have a node group or multiple node groups based on Mixed ASGs, if you want to install and run CA against it, you can follow the Autoscaling module in ec2spotworkshops.com and just point CA to your Mixed ASG or multiple ASGs that you created with the eksctl, CFn template for EKS or with kops.

Other tools to run on your cluster

You now have Spot Instances in your cluster running inside a Mixed ASG, which is autoscaled by cluster-autoscaler.

There are a few more recommended tools to run in order to have a robust and resilient cluster:

  1. The AWS Node Termination Handler is an operational DaemonSet built to run on any Kubernetes cluster using AWS EC2 Spot Instances. When a user starts the termination handler, the handler watches the AWS instance metadata service for spot instance interruptions within a customer's account. If a termination notice is received for an instance that’s running on the cluster, the termination handler begins a multi-step cordon and drain process for the node.
    https://github.com/aws/aws-node-termination-handler
  2. Amazon EKS Node Drainer provides a means to gracefully terminate nodes of an EKS cluster when an ASG initiates a “scale-in” event — i.e if you did not suspend the AZrebalance process in ASG (this is covered earlier in the post). Note that cluster-autoscaler does its own draining when scaling down nodes.
    The code provides an AWS Lambda function that integrates as an Amazon EC2 Auto Scaling groups lifecycle hook. When called, the Lambda function calls the Kubernetes API to cordon, and evict all pods from the node being terminated. The Auto Scaling group continues to terminate the EC2 instance once all pods placed on it are evicted.
    https://github.com/aws-samples/amazon-eks-node-drainer
    Note: this tool is built for EKS but can be adapted to use with Kops. Please leave a response if you have a premade tool for Kops for this purpose or have any questions about implementing this.
  3. Kubernetes Descheduler will work to increase the efficiency of the cluster by continuously performing evaluations, and stopping pods for them to be rescheduled on under-utilized nodes. This is necessary because the Kubernetes scheduler only makes a one-time decision whenever a pod needs to be scheduled, but clusters are dynamic and Descheduler will help keep similar utilization of worker nodes. This is a necessary tool if you are going to autoscale your cluster based on resource reservations, but can also be used when using cluster-autoscaler.
    https://github.com/kubernetes-incubator/descheduler
  4. If you are going to use Cluster-autoscaler and are looking to keep some headroom (over provisioning) in your cluster to allow for faster scale-out, and to allow pods from interrupted Spot Instances to quickly be rescheduled on other worker nodes, the official recommendation to achieve headroom from the CA FAQ is:
    Overprovisioning can be configured using deployment running pause pods with very low assigned priority (see Priority Preemption) which keeps resources that can be used by other pods. If there is not enough resources then pause pods are preempted and new pods take their place. Next pause pods become unschedulable and force CA to scale up the cluster.”
    The cluster-overprovisioner tool can help you implement this approach:
    https://github.com/helm/charts/tree/master/stable/cluster-overprovisioner

Buy instead of build

AWS customers are running large scale and mission critical Kubernetes clusters with Spot Instances using EC2 Auto Scaling groups, but some of the steps required in order to run a resilient Kubernetes cluster on Spot Instances might not fit all users. If you would like to use a paid, turn-key solution, I strongly recommend Spotinst Ocean. Spotinst are an AWS Advanced Technology Partner that have been focusing on cost optimizing AWS customers’ workloads with Spot Instances since 2015, and have made huge strides in cost optimizing Kubernetes workloads with their Ocean service. This service will basically manage everything discussed in this post for you, including intelligent selection of Spot Instances according to pod specs as well as price and availability, keeping automatic and precise headroom in the cluster, all the while running a heterogeneous node group in order to meet the Spot best practices of diversifying across many capacity pools. It also has other useful features like allowing you to prioritize the desired instance types for your cluster, pod right sizing recommendation tool, and it will also provide the costs of your running deployments for showback/chargeback purposes. To get started, you can visit the Ocean module on eksworkshop.com

Summary and takeaways

I hope you found this post useful and that it will help you adopt Spot Instances in your Kubernetes clusters to achieve some significant cost savings. Please feel free to leave a response or contact me on LinkedIn for any feedback, comments, questions, or other things that you’d like to see covered in this or future posts.

--

--

I’m a Solutions Architect at AWS. This blog represents my own views.