chroot, cgroups and namespaces — An overview

Published in

ITNEXT

7 min readMay 1, 2018

Introduction

With all the talks related to Docker, containers and virtualization, it is becoming more important for programmers to know about these technologies that will help them in the day to day programming. I will try to document the technologies behind Docker kind of applications.

root and chroot

In a Unix-like OS, root directory(/) is the top directory. root file system sits on the same disk partition where root directory is located. And it is on top of this root file system that all other file systems are mounted. All file system entries branch out of this root. This is the system’s actual root.

But each process has its own idea of what the root directory is. By default, it is actual system root but we can change this by using chroot()system call. We can have a different root so that we can create a separate environment to run so that it becomes easier to run and debug the process. Or it may also be to use legacy dependencies and libraries for the process.

chroot changes apparent root directory for current running process and its children.

It may appear that by separating a process using chroot()we ensure security by restricting the process not to access outside its environment. But in reality that is not very true. chroot() simply modifies pathname lookups for a process and its children , prepending the new root path to any name starting with /.Current directory is not modified and relative paths can refer any locations outside of new root.

So, chroot() do NOT provide a secure sandbox to test a software.

cgroups- Isolate and manage resources

Control groups(cgroups) is a Linux kernel feature which limits, isolates and measures resource usage of a group of processes. Resources quotas for memory, CPU, network and IO can be set. These were made part of Linux kernel in Linux 2.6.24.

Though Linux is excellent at handling and sharing available resources between processes, sometimes we want better control over resources.We want to allocate or guarantee a certain amount of resources to a group of processes. We do this with cgroups. This isolates an application/group’s resources.

Suppose we have an application we want to isolate usage for. Lets call it A1. Lets call rest of system as S. We will create a control group and assign resource limits on it: say 3GB of memory limit and 70% of CPU. Then we can add requisite application’s process id to the group and application resource usage now is throttled. Though the application may exceed the limits in normal scenarios, it will be throttled back to pre set limits in case system is facing resource crunch. This makes even more sense when we are handling many VMs running on a machine-have a cgroup for VMs and throttle them individually to a set limit when resource contention happens.

Define the solution to problem
Create a cgroup to handle the allocation
Add applications to the group.
Keep monitoring the group(happens as part of cgroups, we need not handle explicitly)

To install cgroups,

sudo apt-get install cgroup-bin cgroup-lite cgroup-tools cgroupfs-mount libcgroup1

We can see a /cgroup directory created: this is used as mount point for cgroup virtual filesystems. etc/cgconfig.conf file gives info on what all mounts to expect. All controllers are mounted to /cgroup followed by controller name. eg/- /cgroup/memory.To mount the requisite controllers, run sudo service cgconfig restart .Following this we see directories in /cgroup, each of which can be used to manage a cgroup subsystem.

creating cgroups

Now that we saw controllers mounted on /cgroup , lets cd into any directory in it ,say, /cgroup/memory and create a subdirectory in it.

mkdir mytest

When we cd into this subdirectory, it will be almost similar to parent directory except for a file, release_agent.Now we have a memory subsystem under cgroup, which has as its child mytestcgroup. Lets create a test function to test our hypothesis of throttling:

#include<iostream>#include <new>#include <cstdlib>int main(){int i=0;char* ptr =NULL;while(i<50){if ((ptr =(char*)malloc(1048576)) == NULL) {///1MB allocatedstd::cout << "Allocation fails at " << i << "MB\n";return 0;}std::cout << "Allocated "<< i+1 << "MB\n";i++;}std::cout << "Finished allocation";return 0;}

Compile and run the above C++ code and we see all 50 MB allocated and final line “Finished allocation” being printed. This was without any memory quota allocated. Let’s cd into cgroup/memory/mytest. And we will set memory limits on both physical memory and swap memory to 2MB by using 2 files: mytest/memory.limit_in_bytes and mytest/memory.memsw.limit_in_bytes respectively. Swap limit is set to enforce a hard limit so that once 2MB memory limit is hit, it wont start consuming swap too.

echo 2097152 > /cgroup/memory/mytest/memory.limit_in_bytes
echo 2097152 > /cgroup/memory/mytest/memory.memsw.limit_in_bytes

Now we can run our above program attached to this memory cgroup using cgexec command.

cgexec -g memory:mytesttest ./<binary_name>

This runs the code in memory testgroup and the process will be killed when it hits 2MB allocation limit.

Like memory, we can have cgroups for CPU, disk IO or network IO and throttle applications according to pre defined limits.

Linux Namespaces

Linux processes form a single hierarchy, with all processes rooting at init. Usually privileged processes in this tree can trace or kill other processes.Linux namespace enables us to have many hierarchies of processes with their own “subtrees” such that processes in one subtree cant access or even know of those in another.

A namespace wraps a global resource such that it appears to processes in that namespace have their own isolated instance of the said resource. Lets take PID namespace as an example. Without namespace involved, all processes descend hierarchically from PID 1(init). If we create a PID namespace and run a process in it, that first process becomes PID 1 in that namespace. In this case, we wrapped a global system resource(process IDs). The process that creates namespace still remains in parent namespace, but makes its child the root of new process tree.

But this only means that the processes within the new namespace can not see parent process but the parent process namespace can see the child namespace. And the processes within new namespace now have 2 PIDs: one for new namespace and one for global namespace.

Linux kernel now tracks process’s PIDs using upid structure instead of a single pid value. The upid structure tells about the pid and the namespaces where that pid is valid.

struct upid {
	int nr;			/* moved from struct pid */
	struct pid_namespace *ns;	/* the namespace this value
						 * is visible in
						 */
	...
    };
    struct pid {
	atomic_t count;
	struct hlist_head tasks[PIDTYPE_MAX];
	struct rcu_head rcu;
	int level;		/* the number of upids */
	struct upid numbers[0];
    };

Types of namespaces

Linux provides following namespaces:

cgroup:This isolates Cgroup root directory(CLONE_NEWCGROUP)
IPC: isolates System V IPC, POSIX message queues(CLONE_NEWIPC)
Network: isolates Network devices, ports etc(CLONE_NEWNET)
Mount: isolates mountpoints(CLONE_NEWNS)
PID: isolated process IDs(CLONE_NEWPID)
User : isolates User and group IDs(CLONE_NEWUSER)
UTS: isolates Hostname and NIS domain name(CLONE_NEWUTS)

As part of namespace management, Linux provides below APIs:

clone()- plain old clone() creates a new process. If we pass one or more CLONE_NEW* flags to clone(),then new namespaces are created for each
flag, and the child process is made a member of those
namespaces.
setns()- allows a process to join an existing namespace. The namspace is specified by a file descriptor reference to one of proc/[pid]/nsfiles.
unshare()- moves calling process to a new namespace created according to CLONE_NEW* arguments. More than one such flags can be specified.

Note:For PID namespace, clone() MUST be called as that can be created only while a new process is created. Spawned process as a result of clone() has pid 1. For PID namespace , unshare() wont be useful.

Network namespace

Lets say we have new PID namespace. The process sitting in this new namespace be listening on port 80 for incoming requests.This means that all other processes, in the entire system., are prevented from listening on it. This is not very helpful isolation. This is where network namespaces come.

Network namespaces helps inner processes to see different set of network interfaces-including lo interface! But this is half the story. When we have new network namespaces, we MUST setup virtual network interfaces that span many namespaces along with a routing process running in global namespace to handle and route traffic to correct namespace.

Like above ones, other namespaces isolate a specific global resource and restricts access of inner process to its own sandbox.

Conclusion

We saw a brief overview of chroot, cgroups and namespaces which provide Linux developers means to isolate processes into their own “containers”. These technologies are building blocks of now ubiquitous Docker or Linux containers. I will try to follow up this article with more specific internals of Docker.

References:

cgroups - ArchWiki

cgroups (aka control groups) is a Linux kernel feature to limit, police and account the resource usage for a set of…

wiki.archlinux.org

Introduction to Linux Control Groups (Cgroups)

In this episode we are going to review Control Groups (cgroups), which provide a mechanism for easily managing and…

sysadmincasts.com

namespaces(7) - Linux manual page

A namespace wraps a global system resource in an abstraction that makes it appear to the processes within the namespace…

man7.org

PID namespaces in the 2.6.24 kernel [LWN.net]

One of the new features in the upcoming 2.6.24 kernel will be the PID namespaces support developed by the OpenVZ team…

lwn.net