Breaking Down Containers | Part 0 — System Architecture

Emmanuel Bakare
ITNEXT
Published in
14 min readNov 12, 2019

--

Things to note in this article

1. This is not an absolute beginner article on Containers, you've read too many of them. This is more of a past beginner intro.
2. You're not learning about Docker and what it does in its simplest form here, you're learning about containers and container runtimes.
3. Pre-requisite knowledge of Linux is important but not necessary.
4. Are you using Windows as a developer in 2019??????
5. I cover a lot more about linux system architecture here, do take a moment or two to read on something new!
6. There will be code demos in C following other tutorials, nothing complex but just a heads up!

What are Containers?

Containers are the headline of these cloud computing days with the advent of Kubernetes, Docker Compose, Mesos OS, Consul etc.

To actually understand the skeletal composite of containers, you need to know a couple things first:

  1. Linux Kernel User & System Space
  2. Syscalls and Capabilities
  3. Cgroups
  4. Namespaces
  5. EIAF (Everything Is A File), a description of the Unix based Filesystem

These 5 things are important to really understand how containers work and why we need a Linux VM to run containers on Windows and Mac, despite the Mac kernel being POSIX compliant and based off OpenBSD.

Why Am I Starting Off Like This?

I’m going in-depth cause articles on containers focus on things like Docker, Dockerfiles and infamous commands likedocker build -t . to be quite petty. The basic parts would still pop up in later parts but the underlying layers behind how they work is what I’d be covering in the series to come.

So I’m taking a different approach and heading straight to what makes containers work and the full envelop of their makeup and functionality.

Let’s start off with understanding the Linux Kernel System & User Space to better our decisions and general depth of what containers really do.

Linux Kernel Space

In Linux, we have two spaces where applications generally run, the kernel system space and the user space. Generally, with default kernel configuration, the user space takes the 0–3GB space whilst the kernel space takes the 3–4GB space, more in-depth details here.

The kernel space is where we have system memory for low level applications on the kernel running. The user space is the environment where our user processes function and execute.

How User Space Interacts With Kernel Space Through System Calls

source: https://redhat.com/en/blog/architecting-containers-part-1-why-understanding-user-space-vs-kernel-space-matters

The two memory spaces are separated by a finely tuned permission layer called Rings. These Rings define how privileged or unprivileged the requirements of an application need to be before certain actions can be granted.

Ring Layers for X86 Systems

source: https://en.wikipedia.org/wiki/Protection_ring

These rings are not peculiar to Linux but are a well defined layout in operating systems , although the functional area of each level is assigned based on the CPU architecture that the OS is running on. To switch between the user and kernel space, we apply an operation through a System Call, simply called a syscall.

This uses defined kernel functions accessible from user space applications to request access to kernel level functionality. The diagram below perfectly explains how this order is defined.

A hierarchical overview of the operating system layers

Whenever an applications makes a request to a kernel level function, an interrupt is sent which tells the processor to stop whatever it is doing and attend to that particular request, you can think about it like context switching if it makes it easier to understand. Provided the user space application has relevant permission, there’s a context switch to the kernel space, the user space application awaits a response back after the context switch has started and the required program/functionality in the kernel space is executed through the aid of the appropriate interrupt handler.

tmp_buf = mmap(file, len); # mmap here is from a C library# This is called a memory map and it's a C function
# It allocates a certain amount of memory for a task, file etc.
# Since memory is a kernel space resource, a syscall is made to the mmap syscall in the linux kernel to make this request possible
An example of the kernel to user space operation for mmap zero-copy

source:https://web.archive.org/web/20190808074654/https://www.linuxjournal.com/article/6345

Next Up Is SysCalls And Capabilities

Systems Calls aka syscall is an API which allows a small part of kernel functionality to be exposed to user level applications. A small part is really stressed to inform whoever reading that syscalls are limited and are generic to serve a purpose. They are not the same across every operating system and do differ in both definition and mode of access.

System Calls For Unix And Windows Respectively

source: https://www.tutorialspoint.com/system-calls-in-unix-and-windows

Tracing back to the previous mmap example, it is not listed in the image as that is just a small list, the full list of syscalls in linux are obtainable here.

Sometimes, we have a group of syscalls that we want to group together, we do this using a linux kernel feature called Capabilities. These are predefined sets of privileges which a running program can have access to or be limited by.

Capabilities further enhance syscalls by grouping related ones into defined privileges that can be granted or denied at once. This prevents even root level applications from exploiting restricted kernel spaces with reserved permissions.

There are several linux capabilities, most will be visited in a later article on how they integrate with containers using profiles like SecComp and more specifically LSMs (Linux Security Modules) like AppArmor, SELinux etc. but you can reference the list from the man page here.

Cgroups

Control groups, usually referred to as cgroups, are a Linux kernel
feature which allow processes to be organized into hierarchical
groups whose usage of various types of resources can then be limited
and monitored. The kernel's cgroup interface is provided through a
pseudo-filesystem called cgroupfs. Grouping is implemented in the
core cgroup kernel code, while resource tracking and limits are
implemented in a set of per-resource-type subsystems (memory, CPU,
and so on).

source: http://man7.org/linux/man-pages/man7/cgroups.7.html

In simple terms, cgroups control what we can use. A list of their functions are shown below:

  • Resource limiting: a group can be configured not to exceed a specified memory limit or use more than the desired amount of processors or be limited to specific peripheral devices.
  • Prioritization: one or more groups may be configured to utilize fewer or more CPUs or disk I/O throughput.
  • Accounting: a group's resource usage is monitored and measured.
  • Control: groups of processes can be frozen or stopped and restarted.

source:https://web.archive.org/web/20190808230154/https://www.linuxjournal.com/content/everything-you-need-know-about-linux-containers-part-i-linux-control-groups-and-process

Cgroups function through the use of subsystems /controllers which modify the runtime environment of a process. There are several controllers available across two versions, v1 and v2.

In the v1 controller space, we have the following:

  • blkio — this subsystem sets limits on input/output access to and from block devices such as physical drives (disk, solid state, or USB).
  • cpu — this subsystem uses the scheduler to provide cgroup tasks access to the CPU.
  • cpuacct — this subsystem generates automatic reports on CPU resources used by tasks in a cgroup.
  • cpuset — this subsystem assigns individual CPUs (on a multicore system) and memory nodes to tasks in a cgroup.
  • devices — this subsystem allows or denies access to devices by tasks in a cgroup.
  • freezer — this subsystem suspends or resumes tasks in a cgroup.
  • memory — this subsystem sets limits on memory use by tasks in a cgroup and generates automatic reports on memory resources used by those tasks.
  • net_cls — this subsystem tags network packets with a class identifier (classid) that allows the Linux traffic controller (tc) to identify packets originating from a particular cgroup task.
  • net_prio — this subsystem provides a way to dynamically set the priority of network traffic per network interface.
  • ns — the namespace subsystem.
  • perf_event — this subsystem identifies cgroup membership of tasks and can be used for performance analysis.
  • hugetlb — this supports limiting the use of huge pages by cgroups.
  • pids — this controller permits limiting the number of process that
    may be created in a cgroup (and its descendants).
  • rdma — The RDMA (Remote DMA) controller permits limiting the use of RDMA / IB-specific resources per cgroup.

For the v2 controller space, we have some functionality from v1 as some control groups were not implemented, linux systems can use both but the v2 system is more condensed with less cgroups.

  • io — This is the successor of the version 1 blkio controller.
  • memory — This is the successor of the version 1 memory controller.
  • pids — This is the same as the version 1 pids controller.
  • perf_event — This is the same as the version 1 perf_event controller.
  • rdma — This is the same as the version 1 rdma controller.
  • cpu — This is the successor to the version 1 cpu and cpuacct controllers.

As you’d notice, it’s the same as the version 1 controller in terms of functionality. Each cgroup provides functionality in restricting one or more resources. The libraries and tooling around this will be revisited in upcoming parts.

Now Namespaces

@jpetazzo is my favorite developer on this topic and this tweets sums up the relationship between cgroups, namespaces and the file-based process filesystem which I’d cover in the next section.

To get a very in depth understanding of what namespaces and cgroups do in Containers, watch this video:

A full breakdown of cgroups, namespaces and containers

Now back to namespaces, namespaces make the container believe that they exist in a completely isolated environment not within the main host system. More specifically, the processes within the container see themselves as the only processes within the system.

You can consider it as being in a box whilst being in a box, you think you own the box but you’re just playing dreamland in another person's box who actually owns the two boxes. Due to this feature, it’s possible to run containers in containers, although there are some issues with that which I’d cover later.

For the namespace feature, it’s embedded in the linux makeup.


Name . CLONE FLAG . MAN DOC . FUNCTION .
IPC CLONE_NEWIPC ipc_namespaces(7) System V IPC,
POSIX message queues
Network CLONE_NEWNET network_namespaces(7) Network devices,
stacks, ports, etc.
Mount CLONE_NEWNS mount_namespaces(7) Mount pointsPID CLONE_NEWPID pid_namespaces(7) Process IDsUser CLONE_NEWUSER user_namespaces(7) User and group IDsUTS CLONE_NEWUTS uts_namespaces(7) Hostname and NIS
domain name

source: http://man7.org/linux/man-pages/man7/namespaces.7.html

These namespaces provide different functionality.

IPC — Isolates inter-process communication, a big word which means that processes can share messages within each other along a channel or pipe, just like flowing water through a pipe. Without namespaces, it’s like our main plumbing, one tank(process) can supply data to other processes using that same pipe. With namespaces, that pipe is distinct and restricted to only certain processes within the namespace . In linux, this is shared from the host using the /dev/shm(shared memory)or /dev/mqueue(message queue) block file.

NOTE: The /dev/mqueue block file is heavily used in queue based application, you can literally build your own queue, it’s very easy!. Details here, man page here, verify if you have mqueue support using this guide

Network — Responsible for isolating ip addresses, interfaces, network requests, ports etc.

Mount — Restricts the use of volumes and external data mounts on host. The process in namespace is running within its own native filesystem.

PID — Isolates process runtime, gives pure restriction between process on host and process in namespace. Therefore, the instance of bash on the host is different from the one in the container. This is the peculiar one that allows us run applications in containers that are not on the host itself.

User — Restricts UID (UserID) and GID(Group ID) assignments for the container user. This effectively allows the host to be secure as the container cannot read from the host.

UTS — Used to set or get hostnames, quite straightforward.

All these namespaces are implemented using the unshare system call to isolate resources.

Finally The Linux FileSystem

First thing to note in linux, everything is a file. I kid you not, right down from the storage, serial devices etc which are all /dev/* to the list of filesystems in /proc/fileystems , all the way to even cgroups running on a host. Most of the interactions between different filesystems are handled by the Virtual FilesSystem driver (VFS) but this is a topic for another day.

List of all the supported filesystems on my vagrant box

Because everything is a file, I can literally cat (which is a command to post all the text in a file) to see the configuration supported.

So where does this fill into containers?

Containers as hinted before are processes. The way we’ve all learnt about the difference between containers and VMs is that containers share the kernel of the host and some of its resources. The main hint here is the resource, containers use a different root filesystem to start off their own operations, the actual filesystem that the container (hint: they’re processes) uses to start off is the image.

The image is a linux filesystem, mostly tarzipped until it needs to run, from which it executes using some COW(Copy On Write) filesystem such as AUFS, device-mapper, Btrfs, XFS etc…, there’s a couple.

For those with docker installed, you can run the command below to see the internals of a docker image (not containers, containers are processes, running images etc….)

mkdir rootfs && \

docker export $(docker create ubuntu:18.04) | tar -C rootfs -xvf -

Linux VM on Ubuntu:18.04

The main linux filesystem is quite peculiar as you’d notice the vmlinuz and initrd.img which I’d head back to in a bit.

Image of the filesystem for the image of a Ubuntu 18.04 instance, same as the host

Here we notice we don’t have the initrd and vmlinuz file as seen on the main filesystem, this is because those two files are kernel files.

InitRD — Init Ram Disk

The initial RAM disk (initrd) is an initial root file system that is mounted prior to when the real root file system is available. The initrd is bound to the kernel and loaded as part of the kernel boot procedure. The kernel then mounts this initrd as part of the two-stage boot process to load the modules to make the real file systems available and get at the real root file system.

source: https://developer.ibm.com/articles/l-initrd/

VMLinuz — Virtual Memory LINUx gZip

vmlinuz is the name of the Linux kernel executable. vmlinuz is a compressed Linux kernel, and it is capable of loading the operating system into memory so that the computer becomes usable and application programs can be run.

On linux, you might come across either vmlinux or vmlinuz. They’re the same but one of them is compressed.

vmlinuz = Virtual Memory LINUx gZip = Compressed Linux kernel Executable

vmlinux = Virtual Memory LINUX = Non-compressed Linux Kernel Executable

Both the vmlinuz and initrd file are used at boot time.

At the head of this kernel image (vmlinuz) is a routine that does some minimal amount of hardware setup and then decompresses the kernel contained within the kernel image and places it into high memory. If an initial RAM disk image (initrd) is present, this routine moves it into memory (or we can say extract the compressed ramdisk image in to the real memory) and notes it for later use. The routine then calls the kernel and the kernel boot begins.

source: https://developer.ibm.com/articles/l-initrd/

Now here’s the main reason we don’t have these two files in the containers filesystem or image, if “container filesystem” is too many syllables.

The container uses the hosts kernel!

It doesn’t need a boot sequence to get a kernel, all possible requests from applications within the container are made through the host kernel via syscalls enforced through rings, capabilities, seccomp, LSMs etc… all the way back like any normal linux program.

The main idea here is that the containers just use an entirely different filesystem but they share the same linux kernel. For those of us who know linux a bit, we know we can chroot into a foreign linux filesystem and operate in it as though the filesystem was booted up, provided all the necessary files from the main host are mounted within that folder through a bind.

A clean example of how chroot differs from actual containers

If you’ve gone through certain parts of this, you discover containers are just chroot on steroids packaged with namespaces, cgroups and a lot of other cool features to make application sandboxing as secure as possible on the same host.

In summary, what’s a container?

A container is a runtime process executed within a namespace which is resource managed by cgroups and various other LSMs and security features to ensure complete process isolation during runtime. These processes in the container are automated amongst other things with container runtimes like Docker which simplifies a lot of the things discussed but the main underlying layers which I’ve explained are all still the same irrespective.

So what’s next?

In the next part (Part 1), I’d be covering more indepthly on namespaces and cgroup, what unshare is and how the various cgroups come together to help with process isolation so we don’t see tail in our next pseudo chroot kinda exec.

Resources

https://www.kernel.org/doc/Documentation/filesystems/proc.txt

https://www.kernel.org/doc/Documentation/filesystems/sysfs.txt

--

--