Using overlay mounts with Kubernetes

Published in

ITNEXT

6 min readApr 18, 2019

Have you ever wanted to give a Kubernetes pod read/write access to a persistent volume but not have the pods’ changes persisted in the backed storage? This is different from simply mounting the volume read-only in that the pod needs for some reason or another to be able to write to the filesystem.

My Use Case

The specific use case I ran across had to do with providing a standard set of base conda environments while letting the pod customize the environments. For those not familiar, conda is a package manager often used to distribute scientific software packages. Each environment is a self contained set of packages — similar to what you’d get with Python environments. In this use case, we provide a stock set of environments, but should the pod (user) decide we’re missing a package they really want, they can simply install the package ephemerally within their pod and not have it visible to other users of the stock environments. That said, this solution can apply to many other use cases — and you need know nothing about conda to continue reading.

Core Idea

The standard Linux overlay filesystem does exactly this. It lets you take a common (typically read-only) filesystem, and mount it allowing the user to interact with all the contents in the common filesystem in a read/write mode without affecting the original filesystem.

Simple Solution

The simplest thing to do would be to use an overlay mount within the container itself. The container would access the common volume read-only, and create an overlay mount. But unfortunately, this doesn’t meet my requirements as this requires the end-user’s container to have escalated privileges (SYS_ADMIN) and knowledge of the overlay mount. In my case, the end-user’s container gets to run arbitrary end-user code and so needs to be tightly locked down.

Full Solution

Sidecar / Setup Container: You need a new sidecar container for the running image. This sidecar is responsible for

Creating an overlay mount that overlays the read only original conda environment with an emptyDir ephemeral mount.
Propagating the mount so the runtime image has access to the overlay.
Installing additional packages or creating a brand new env — based on whatever the end-user has defined.

Runtime Container: The runtime container gets access to the same conda-mount volume as the setup container, with all changes visible. In my use case I chose to make the resulting mount read-only to avoid the user further filling up the overlay volume.

Details

Volume Mounts

local-vol — this is a hostPath volume to access the base conda environments on the host. Note that this could be any type of persistent volume — I just happen to use hostPath for simplicity.
overlay — this is an emptyDir (ephemeral to the pod, but stored on the host) that contains the overlay upper and work directory. This is where all changes to the local-vol get stored.
NOTE: You could make this backed by Memory if you really didn’t want to persist on disk, but that will eat up the host memory, so avoid this.
conda-mount — this is an emptyDir that actually holds zero data. We’re using it as a mount point for the eventual conda environment (/opt/anaconda3). This volume is shared between the two containers in the pod, and leverages mount propagation.

Mount Propagation

volumeMounts:
- mountPath: /opt/anaconda3
  name: conda-mount
  mountPropagation: Bidirectional

Any mount changes made to the conda-mount volume in the setup container gets propagated to the host, and therefore allows the overlay mount to be seen by the runtime container as well. This is the real magic.

Note that the mount that is propagated is the conda-mount, which contains zero data. It is simply the overlay mount point. The original local-vol content is not affected.

It does require the setup container to run in Privileged mode (k8s enforces this).

securityContext:
  privileged: true

Note that the runtime container does not need to run privileged. This is what we want since the runtime container can run arbitrary code.

Setup Container

Tasks: The setup container does:

args:
— mkdir -p /data/{upper,work}; 
  mount -t overlay -o lowerdir=/condaro/,upperdir=/data/upper,workdir=/data/work overlay /opt/anaconda3; 
  . /opt/anaconda3/etc/profile.d/conda.sh; 
  conda activate dlipy3; 
  conda install agate -y; 
  conda deactivate; 
  touch /opt/anaconda3/setup_complete; 
  tail -f /dev/null;

Setup the overlay mount. It uses the ephemeral overlay volume to host the upper and workdir of the overlay mount (this is where all changes persist).
Makes ephemeral modifications — in this case it activates the dlipy3 conda environment, installs the agate package, and then deactivates the environment (to avoid blocking a future un-mount).
Adds a flag that indicates the setup is done. This is a file in the shared overlay mount to allow the runtime container to wait until the setup container is done.
Now it waits — we’ve got to keep the container running.

preStop hook: When you use mount propagation, you’ve got to un-mount things. If you don’t, Kubernetes will leave the pod in Terminating state because the volume didn’t clean up. So, we add a preStop hook to make sure that the overlay mount is removed.

lifecycle:
  preStop:
    exec:
      command: [“umount”, “/opt/anaconda3”]

Runtime Container

The runtime container simply mounts the conda-mount volume (the overlay mount point), and checks for the existence of the setup_complete file in the shared mount for its readinessProbe. This is important so the container doesn’t start consuming the mount until it’s ready.

readinessProbe:
  exec:
    command:
    — cat
    — /opt/anaconda3/setup_complete
  initialDelaySeconds: 0
  periodSeconds: 1
  failureThreshold: 300

Putting It All Together

This is the complete yaml example. A simple kubernetes create -f example.yaml should do the trick. You may wish to modify the hostPath path.

kind: Deployment
metadata:
  name: k8s-overlay-mounts
  labels:
    app: test-me
spec:
  selector:
    matchLabels:
      app: test-me
  template:
    metadata:
      labels:
        app: test-me          
    spec:            
      containers:
      - securityContext:
          privileged: true
        image: ubuntu:18.04
        name: setup
        command: [ "/bin/bash", "-c", "--" ]
        args:
        - mkdir -p /data/{upper,work}; 
          mount -t overlay -o lowerdir=/condaro/,upperdir=/data/upper,workdir=/data/work overlay /opt/anaconda3; 
          . /opt/anaconda3/etc/profile.d/conda.sh; 
          conda activate dlipy3; 
          conda install agate -y; 
          conda deactivate; 
          touch /opt/anaconda3/setup_complete; 
          tail -f /dev/null;
        volumeMounts:
        - mountPath: /condaro
          name: local-vol
          readOnly: true
        - mountPath: /data
          name: overlay
        - mountPath: /opt/anaconda3
          name: conda-mount
          mountPropagation: Bidirectional
        lifecycle:
          preStop:
            exec:
              command: ["umount", "/opt/anaconda3"]
      - image: ubuntu:18.04
        name: run
        command: [ "tail", "-f", "/dev/null" ]
        volumeMounts:
        - mountPath: /opt/anaconda3
          name: conda-mount
          readOnly: true
        readinessProbe:
          exec:
            command:
            - cat
            - /opt/anaconda3/setup_complete
          initialDelaySeconds: 0
          periodSeconds: 1
          failureThreshold: 300
      volumes:
      - name: local-vol
        hostPath: 
          path: /opt/anaconda3/
      - name: overlay
        emptyDir: {}
        #          medium: Memory
      - name: conda-mount
        emptyDir: {}

Why Didn’t You Do … ?

Why didn’t you use an initContainer for the setup?
Because somebody has to unmount the overlay. If we don’t, Kubernetes can’t clean up the pod due to a hanging mount. So we need to cleanup the mount on pod exit.

We could unmount it in the run container, but then the run container would need to be a privileged container — which we do not want.

Why didn’t you use the run container to install the new stuff?
Because I wanted the runtime container to not have any smarts about modifying the underlying conda environments. There’s no requirement here — just a preference to keep things common whether the user requests changes to the conda environments or not. We also run arbitrary code in the runtime container, and want to keep the runtime container as locked down as possible.

Why didn’t you just use an initContainer to copy files into the the ephemeral volume?
Because it’s slow. It would achieve the same technical result, but would be very expensive.

Why didn’t you just mount the conda environment read only and take advantage of Conda’s use of ~/.conda?
If your base conda install is read-only, then conda will automatically put new environments and package caches into ~/.conda. There are two reasons this wasn’t a complete solution for us:

This doesn’t work for modifying an existing environment (eg: conda install X). Updating existing environments is much faster than creating a brand new one with the same content (depending on the environment size).
Conda can’t leverage hard-links across filesystems. So when you create a new environment in ~/.conda/, it will make a copy of the files instead of just doing hard-links, increasing the storage size and time.