EKS GPU Cluster from Zero to Hero

Alexei Ledenev
ITNEXT
Published in
9 min readJul 20, 2019

--

Introduction

If you ever tried to run a GPU workload on Kubernetes cluster, you know that this task requires a non-trivial configuration and comes with high cost tag (GPU instances are quite expensive).

This post shows how to run a GPU workload on Kubernetes cluster in cost effective way, using Amazon EKS cluster, AWS Auto Scaling, Amazon EC2 Spot Instances, and some Kubernetes resources and configurations.

EKS Cluster Plan

First, we need to create a Kubernetes cluster that consists from mixed nodes: non-GPU nodes for management and generic Kubernetes workload and more expensive, GPU-powered, nodes to run GPU intensive tasks, like machine learning, medical analysis, seismic exploration, video transcoding, and others.

These node groups should be able to scale on demand (scale out and scale in) for generic nodes, and from 0 to required number and back to 0 for expensive GPU instances. More than that, to do it in cost effective way, we are going to use Amazon EC2 Spot Instances both for generic nodes and GPU nodes.

AWS EC2 Spot Instances

With Amazon EC2 Spot Instances instances you can save up to 90% comparing to On-Demand price. Previously, Spot instances were terminated in ascending order of bids. The market prices fluctuated frequently because of this. In the current model, the Spot prices are more predictable, updated less frequently, and are determined by Amazon EC2 spare capacity, not bid prices. AWS EC2 service can reclaim Spot instances when there is not enough capacity for a specific instance in specific Availability Zone. Spot instances receive a 2-minute alert when are about to be reclaimed by Amazon EC2 service and can use this time for graceful shutdown and state change.

The Workflow

1. Create EKS Cluster

It is possible to create Amazon EKS cluster, using Amazon EKS CLI, CloudFormation or Terraform, AWS CDK or eksctl.

eksctl CLI tool

In this post eksctl (a CLI tool for creating clusters on EKS) is used. It is possible to pass all parameters to the tool as CLI flags or configuration file. Using configuration file makes process more repeatable and automation friendly.

eksctl can create or update EKS cluster and additional required AWS resources, using CloudFormation stacks.

Customize your cluster by using a config file. Just run

eksctl create cluster -f cluster.yaml

to apply a cluster.yaml file:

apiVersion: eksctl.io/v1alpha5 
kind: ClusterConfig
metadata:
name: test-cluster
region: us-west-2
nodeGroups:
- name: ng
instanceType: t3.micro
desiredCapacity: 10

A new EKS cluster with 10 t3.micro On-Demand EC2 worker nodes will be created and cluster credentials will be added to ~/.kube/config file.

2. Creating node groups

As planned, we are going to create two node groups for Kubernetes worker nodes:

  1. General node group — autoscaling group with Spot instances to run Kubernetes system workload and non-GPU workload
  2. GPU node groups — autoscaling group with GPU-powered Spot Instances, that can scale from 0 to required number of instances and back to 0.

Fortunately, the eksctl supports adding Kubernetes node groups to EKS cluster and these groups can be composed from Spot-only instances or a mixture of Spot and On-Demand instances.

2.1 General node group

The eksctl configuration file contains EKS cluster in us-west-2 across 3 Availability Zones and the first General autoscaling (from 2 to 20 nodes) node group running on diversified Spot instances.

---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: gaia-kube
region: us-west-2
availabilityZones: ["us-west-2a", "us-west-2b", "us-west-2c"]nodeGroups:
# spot workers NG - multi AZ, scale from 3
- name: spot-ng
ami: auto
instanceType: mixed
desiredCapacity: 2
minSize: 2
maxSize: 20
volumeSize: 100
volumeType: gp2
volumeEncrypted: true
iam:
attachPolicyARNs:
- arn:aws:iam::aws:policy/service-role/AmazonEC2RoleforSSM
- arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy
- arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
- arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
withAddonPolicies:
autoScaler: true
ebs: true
albIngress: true
cloudWatch: true
instancesDistribution:
onDemandPercentageAboveBaseCapacity: 0
instanceTypes:
- m4.2xlarge
- m4.4xlarge
- m5.2xlarge
- m5.4xlarge
- m5a.2xlarge
- m5a.4xlarge
- c4.2xlarge
- c4.4xlarge
- c5.2xlarge
- c5.4xlarge
spotInstancePools: 15
tags:
k8s.io/cluster-autoscaler/enabled: 'true'
labels:
lifecycle: Ec2Spot
privateNetworking: true
availabilityZones: ["us-west-2a", "us-west-2b", "us-west-2c"]
# next: GPU node groups ...

Now it is time to explain some parameters used in the above configuration file.

  • ami: auto - eksctl automatically discover latest EKS-Optimized AMI image for worker nodes, based on specified AWS region, EKS version and instance type. See Amazon EKS-Optimized AMI chapter in User Guide
  • instanceType: mixed - specify that actual instance type will be one of instance types defined in instancesDistribution section
  • iam contains list of predefined and in-place IAM policies; eksctl creates a new IAM Role with specified policies and attaches this role to every EKS worker node. There are several IAM policies you are required to attach to every EKS worker node, read Amazon EKS Worker Node IAM Role section in User Guide and eksctl IAM policies documentation
  • instancesDistribution - specify mixed instance policy for EC2 Auto Scaling Groups, read AWS MixedInstancesPolicy documentation
  • spotInstancePools - specifies number of Spot instance pools to use, read more
  • tags - AWS tags added to EKS worker nodes
  • k8s.io/cluster-autoscaler/enabled will use this tag for Kubernetes Cluster Autoscaler auto-discovery
  • privateNetworking: true - all EKS worker nodes will be placed into private subnets

Spot Instance Pools

When you are using Spot instances as worker nodes you need to diversify usage to as many Spot Instance pools as possible. A Spot Instance pool is a set of unused EC2 instances with the same instance type (for example, m5.large), operating system, Availability Zone, and network platforms.

The eksctl currently supports single Spot provisioning model: lowestPrice allocation strategy. This strategy allows the creation of a fleet of Spot Instances that is both cheap and diversified. ASG automatically deploys the cheapest combination of instance types and Availability Zones based on the current Spot price across the number of Spot pools that you specify. This combination allows avoiding the most expensive Spot Instances.

The Spot instance diversification also increases worker nodes availability, typically not all Spot Instance pools will be interrupted at the same time, so only a small portion of your workload will be interrupted and EC2 Auto-scaling group will replace interrupted instances from others Spot Instance pools.

Please, also consider to use a dedicated volume for datasets, training progress (checkpoints) and logs. This volume should be persistent and not be affected by instance termination.

2.2 GPU-powered node group

The next part of our eksctl configuration file contains first GPU autoscaling (from 0 to 10 nodes) node group running on diversified GPU-powered Spot instances.

When using GPU-powered Spot instances, it’s recommended to create GPU node group per Availability Zone and configure Kubernetes Cluster Autoscaler to avoid automatic ASG rebalancing.

Why is it important? Some of GPU-powered EC2 Spot Instances have relatively high Frequency of interruption rate (>20% for some GPU instance types; check Spot Instance Advisor) and using multiple AZ and disabling automatic Cluster Autoscaler balancing can help to minimize GPU workload interruptions.

# ... EKS cluster and General node group ...# spot GPU NG - west-2a AZ, scale from 0
- name: gpu-spot-ng-a
ami: auto
instanceType: mixed
desiredCapacity: 0
minSize: 0
maxSize: 10
volumeSize: 100
volumeType: gp2
volumeEncrypted: true
iam:
attachPolicyARNs:
- arn:aws:iam::aws:policy/service-role/AmazonEC2RoleforSSM
- arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy
- arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
- arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
withAddonPolicies:
autoScaler: true
ebs: true
fsx: true
efs: true
albIngress: true
cloudWatch: true
instancesDistribution:
onDemandPercentageAboveBaseCapacity: 0
instanceTypes:
- p3.2xlarge
- p3.8xlarge
- p3.16xlarge
- p2.xlarge
- p2.8xlarge
- p2.16xlarge
spotInstancePools: 5
tags:
k8s.io/cluster-autoscaler/node-template/taint/dedicated: nvidia.com/gpu=true
k8s.io/cluster-autoscaler/node-template/label/nvidia.com/gpu: 'true'
k8s.io/cluster-autoscaler/enabled: 'true'
labels:
lifecycle: Ec2Spot
nvidia.com/gpu: 'true'
k8s.amazonaws.com/accelerator: nvidia-tesla
taints:
nvidia.com/gpu: "true:NoSchedule"
privateNetworking: true
availabilityZones: ["us-west-2a"]
# create additional node groups for `us-west-2b` and `us-west-2c` availability zones ...

Now, it is time to explain some parameters used to configure GPU-powered node group.

  • ami: auto - eksctl automatically discover latest EKS-Optimized AMI image with GPU support for worker nodes, based on specified AWS region, EKS version and instance type. See Amazon EKS-Optimized AMI with GPU support User Guide
  • iam: withAddonPolicies - if a planned workload requires access to AWS storage services, it is important to include additional IAM policies (auto-generated by eksctl)
  • tags - AWS tags added to EKS worker nodes
  • k8s.io/cluster-autoscaler/node-template/taint/dedicated: nvidia.com/gpu=true - Kubernetes node taint
  • k8s.io/cluster-autoscaler/node-template/label/nvidia.com/gpu: 'true' - Kubernetes node label used by Cluster Autoscaler to scale ASG from/to 0
  • nvidia.com/gpu: “true:NoSchedule” — Kubernetes GPU node taint; helps to avoid placement on non-GPU workload on expensive GPU nodes

EKS Optimized AMI image with GPU support

In addition to the standard Amazon EKS-optimized AMI configuration, the GPU AMI includes the following:

  • NVIDIA drivers
  • The nvidia-docker2 package
  • The nvidia-container-runtime (as the default runtime)

Scaling a node group to/from 0

From Kubernetes Cluster Autoscaler 0.6.1 — it is possible to scale a node group to/from 0, assuming that all scale-up and scale-down conditions are met.

If you are using nodeSelector you need to tag the ASG with a node-template key k8s.io/cluster-autoscaler/node-template/label/ and k8s.io/cluster-autoscaler/node-template/taint/ if you are using taints.

3. Scheduling GPU workload

3.1 Schedule based on GPU resources

The NVIDIA device plugin for Kubernetes exposes the number of GPUs on each nodes of your cluster. Once the plugin is installed, it’s possible to use nvidia/gpu Kubernetes resource on GPU nodes and for Kubernetes workloads.

Run this command to apply the Nvidia Kubernetes device plugin as a daemonset running only on AWS GPU-powered worker nodes, using tolerations and nodeAffinity

kubectl create -f kubernetes/nvidia-device-plugin.yamlkubectl get daemonset -nkube-system

using nvidia-device-plugin.yaml Kubernetes resource file

apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset-1.12
namespace: kube-system
spec:
updateStrategy:
type: RollingUpdate
template:
metadata:
labels:
name: nvidia-device-plugin-ds
spec:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- image: nvidia/k8s-device-plugin:1.11
name: nvidia-device-plugin-ctr
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: beta.kubernetes.io/instance-type
operator: In
values:
- p3.2xlarge
- p3.8xlarge
- p3.16xlarge
- p3dn.24xlarge
- p2.xlarge
- p2.8xlarge
- p2.16xlarge

3.2 Taints and Tolerations

Kubernetes taints allow a node to repel a set of pods. Taints and tolerations work together to ensure that pods are not scheduled onto inappropriate nodes. One or more taints are applied to a node; this marks that the node should not accept any pods that do not tolerate the taints. Tolerations are applied to pods, and allow (but do not require) the pods to schedule onto nodes with matching taints.

See Kubernetes Taints and Tolerations documentation for more details.

To run a GPU workload on GPU-powered Spot instance nodes, with nvidia.com/gpu: "true:NoSchedule" taint, the workload must include both matching tolerations and nodeSelector configurations.

Kubernetes deployment with 10 pod replicas with nvidia/gpu: 1 limit:

apiVersion: apps/v1
kind: Deployment
metadata:
name: cuda-vector-add
labels:
app: cuda-vector-add
spec:
replicas: 10
selector:
matchLabels:
app: cuda-vector-add
template:
metadata:
name: cuda-vector-add
labels:
app: cuda-vector-add
spec:
nodeSelector:
nvidia.com/gpu: "true"
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
containers:
- name: cuda-vector-add
image: "k8s.gcr.io/cuda-vector-add:v0.1"
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPU

Deploy cuda-vector-add deployment and see how new GPU-powered nodes are added to the EKS cluster.

# list Kubernetes nodes before running GPU workload
kubectl get nodes --output="custom-
columns=NAME:.metadata.name,TYPE:.metadata.labels.beta\.kubernetes\.io\/instance-type"
NAME TYPE
ip-192-168-151-104.us-west-2.compute.internal c4.4xlarge
ip-192-168-171-140.us-west-2.compute.internal c4.4xlarge
# deploy GPU workload on EKS cluster with tolerations for nvidia/gpu=true
kubectl create -f kubernetes/examples/vector/vector-add-dpl.yaml
# list Kubernetes nodes after several minutes to see new GPU nodes added to the cluster
kubectl get nodes --output="custom-columns=NAME:.metadata.name,TYPE:.metadata.labels.beta\.kubernetes\.io\/instance-type"
NAME TYPE
ip-192-168-101-60.us-west-2.compute.internal p2.16xlarge
ip-192-168-139-227.us-west-2.compute.internal p2.16xlarge
ip-192-168-151-104.us-west-2.compute.internal c4.4xlarge
ip-192-168-171-140.us-west-2.compute.internal c4.4xlarge
ip-192-168-179-248.us-west-2.compute.internal p2.16xlarge

As you can see, 3 new GPU-powered nodes ( p2.16xlarge), across 3 AZ, had been added to the cluster. When you delete GPU workload, the cluster will scale down GPU node group to 0 after 10 minutes.

Summary

Follow this tutorial to create an EKS (Kubernetes) cluster with GPU-powered node group, running on Spot instances and scalable from/to 0 nodes.

References

- EKS Spot Cluster GitHub repository with code for this blog
- The definitive guide to running EC2 Spot Instances as Kubernetes worker nodes by Ran Sheinberg
- Kubernetes Cluster Autoscaler
- Taints and Tolerations Kubernetes documentation

Disclaimer

It does not matter where I work, all my opinions are my own.

Draft published at https://alexei-led.github.io.

--

--