Container Deep Diving: Part 2
Okay, so here we are in part 2 of the container post series.
At the end of part 1 we were able to identify the problems of just using chroot to achieve process isolation on a machine.
With this post the goal is to have the same functionality - running bash with test-root
as the new root directory with the same technologies as containers.
Once that is running we will adress the problems of seeing all the network interfaces as well as still being able to kill arbitrary processes on the machine.
A new root dir
This time around the complexity will increase quite a bit.
While we will ultimately still use chroot
the idea is to execute it inside a pre-secured environment. A container.
So how can this secure environment be created?
Enter kernel NAMESPACES(7)
:
A namespace wraps a global system resource in an abstraction that makes it appear to the processes within the namespace that they have their own isolated instance of the global re‐
source. Changes to the global resource are visible to other processes that are members of the namespace, but are invisible to other processes. One use of namespaces is to imple‐
ment containers.
....
The following table shows the namespace types available on Linux. The second column of the table shows the flag value that is used to specify the namespace type in various APIs.
The third column identifies the manual page that provides details on the namespace type. The last column is a summary of the resources that are isolated by the namespace type.
Namespace Flag Page Isolates
Cgroup CLONE_NEWCGROUP cgroup_namespaces(7) Cgroup root directory
IPC CLONE_NEWIPC ipc_namespaces(7) System V IPC, POSIX message queues
Network CLONE_NEWNET network_namespaces(7) Network devices, stacks, ports, etc.
Mount CLONE_NEWNS mount_namespaces(7) Mount points
PID CLONE_NEWPID pid_namespaces(7) Process IDs
Time CLONE_NEWTIME time_namespaces(7) Boot and monotonic clocks
User CLONE_NEWUSER user_namespaces(7) User and group IDs
UTS CLONE_NEWUTS uts_namespaces(7) Hostname and NIS domain name
The namespace that we need for the intial step is Mounts. The first of all the namespaces that were introduced to Linux.
According to mount_namespaces(7)
this namespace was introduced in Linux 2.4.19 while the fs/namespace.c
file seems to be introduced in 2.4.12 by Al Viro and somehow made it into the changelog in 2.4.11.
This happended more than 20 years ago and is the base for the modern container stack.
But how exactly can Mount namespaces help us out in restricting a process to a new root directory?
By using a technique called bind-mount. With bind-mounts it is possible to have a directory on the machine show up as a mount point. Basically a portal into a directory which we can then move into our secure environment before starting a chroot in it.
On the root namespace of the machine - every process in a modern linux system is inside a namespace - it is just a normal directory that can be inspected.
So how does this look in commands?:
# Precheck all the available mounts in the root namespace
findmnt
# Step 1: create test-root with bash binary and dependencies
cd /
mkdir -p test-root/{bin,proc,old-root}
cp /bin/bash test-root/bin/bash
cp -a /usr /lib /lib64 test-root
# Step 2: jump into a new namespaced environment -> our secure environment
unshare --mount
# Step 3: create a bind mount for test-root and mount process information
mount --bind test-root test-root
# Step 4: switch the root folder to test-root and keep the old root mounted at old-root
## also mount the process information via the procfs after switching the root
cd test-root
pivot_root . old-root
mount -t proc proc /proc
# Step 5: unmount the old root from the secure environment so that only the new root is available
## `--lazy` is needed as our /bin/bash process is still attached to the old mount
## if we made the namespace persistent and executed a process in it later, then the mount will not available in it
umount --lazy old-root
# Step 6: chroot into the new secure environment
exec chroot . /bin/bash
Here is the process as a small video:
Nice, the first step is done and we are on par with part 1. But now lets reap some benefits when we get to the unsolved isolation issues.
Processes
When creating the secure environment the binary that actually did the namespace switching is unshare
.
In Step 2 it is called with --mount
, which will only create a new Mount namespace.
To further restrict the environment we can now add the PID namespace, which separates all the processes into a new namespace of their own.
At this point it is important to note that the information for PID and Mount namespaces are stored in separate structures inside the kernel. This applies to all namespaces and means that you can mix and match them as you desire, giving the environments the exact configurations that you want.
For our example this means that we can simply continue after the last command before entering the chroot
.
To create the new PID namespace lets simply call unshare
again but use the --pid
flag this time.
Hm, so what is the issue here? Lets go digging.
UNSHARE(1)
-p, --pid[=file]
Unshare the PID namespace. If file is specified, then a persistent namespace is created by a bind mount. (Creation of a persistent PID namespace will fail if the --fork option
is not also specified.)
See also the --fork and --mount-proc options.
....
-f, --fork
Fork the specified program as a child process of unshare rather than running it directly. This is useful when creating a new PID namespace. Note that when unshare is waiting
for the child process, then it ignores SIGINT and SIGTERM and does not forward any signals to the child. It is necessary to send signals to the child process.
....
--mount-proc[=mountpoint]
Just before running the program, mount the proc filesystem at mountpoint (default is /proc). This is useful when creating a new PID namespace. It also implies creating a new
mount namespace since the /proc mount would otherwise mess up existing programs on the system. The new proc filesystem is explicitly mounted as private (with
MS_PRIVATE|MS_REC).
--mount-proc
is not relevant for us as we have already created a new Mount namespace and do not want another one. But more on that in a second.
The fix is in the --fork
flag.
So why is this?
Checking the man page of pid_namespaces(7)
gives the answer:
The namespace init process
The first process created in a new namespace (i.e., the process created using clone(2) with the CLONE_NEWPID flag, or the first child created by a process after a call to un‐
share(2) using the CLONE_NEWPID flag) has the PID 1, and is the "init" process for the namespace (see init(1)). This process becomes the parent of any child processes that are or‐
phaned because a process that resides in this PID namespace terminated (see below for further details).
The important part is or the first child created by a process after a call to unshare(2)
.
The first process that was called after unshare is ls
. So ls
will execute in the new namespace and then exit.
That exit will trigger the new namespace to be deleted.
In order to get the correct behaviour we want bash
to be forked when calling unshare
, therefore making it our new parent process in the namespace.
So the correct way is to run unshare --pid --fork
:
But at this point there is still one thing missing and that relates to the --mount-proc
flag mentioned above.
When checking the active processes with ps -aux
the following is printed on a Fedora Workstation:
-bash-5.1# ps -aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
0 1 0.7 0.8 171724 16900 ? Ss 11:35 0:01 /usr/lib/systemd/systemd rhgb --switched-root --sys
...........
0 156 0.0 0.0 0 0 ? I< 11:35 0:00 [ipv6_addrconf]
0 157 0.0 0.0 0 0 ? I 11:35 0:00 [kworker/u8:8]
0 701 0.0 0.6 393636 13424 ? Ssl 11:35 0:00 /usr/libexec/udisks2/udisksd
983 703 0.0 0.1 84648 3676 ? S 11:35 0:00 /usr/sbin/chronyd -F 2
0 705 0.0 0.4 238196 8248 ? Ssl 11:35 0:00 /usr/libexec/upowerd
0 706 0.0 0.7 255924 14712 ? Ssl 11:35 0:00 /usr/sbin/abrtd -d -s
70 708 0.0 0.0 8516 356 ? S 11:35 0:00 avahi-daemon: chroot helper
1000 1351 0.6 0.5 310964 10624 ? Sl 11:35 0:01 ibus-daemon --panel disable -r --xim
1000 1353 0.1 3.5 581004 71036 ? Ssl 11:35 0:00 /usr/libexec/gsd-xsettings
...........
0 1798 0.0 0.4 18684 9744 ? Ss 11:37 0:00 /usr/lib/systemd/systemd-hostnamed
0 1822 0.0 0.2 8804 5684 ? S 11:38 0:00 -bash
0 1848 0.0 0.2 8804 5720 ? S 11:38 0:00 -bash
0 1877 0.2 0.2 8804 5680 ? S 11:38 0:00 -bash
0 1904 0.2 0.2 8804 5656 ? S 11:38 0:00 -bash
0 1933 0.0 0.0 5580 252 ? S 11:39 0:00 unshare --pid --fork
0 1934 0.0 0.1 7380 3984 ? S 11:39 0:00 -bash
0 1935 0.0 0.1 9888 2368 ? R+ 11:39 0:00 ps -aux
All processes are displayed.
Why does this happen? Because we mounted /procfs in the previous PID namespace.
Skipping the --mount-proc
flag leaves us in the desired Mount namespace but also does not automatically mount the /procfs that belongs to our new PID namespace.
Simply unmounting the old /procfs and mounting the new one with umount /proc && mount -t proc proc /proc
will fix this and voila, ps -aux
now prints:
-bash-5.1# ps -aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
0 1 0.0 0.2 7380 4088 ? S 11:39 0:00 -bash
0 5 0.0 0.1 9888 2380 ? R+ 11:41 0:00 ps -aux
and we will not be able to kill arbitrary processes anymore.
Now there is only 1 thing left to address from part 1.
Network
A quick man unshare
reveals:
-n, --net[=file]
Unshare the network namespace. If file is specified, then a persistent namespace is created by a bind mount.
And thats it. We are now in a completely isolated environment to run our program, notably without any network access.
Here are all the commands again:
# Step 1: create test-root with bash binary and dependencies
cd /
mkdir -p test-root/{bin,proc,old-root}
cp /bin/bash test-root/bin/bash
cp -a /usr /lib /lib64 test-root
# Step 2: jump into a new namespaced environment -> our secure environment
unshare --mount
# Step 3: create a bind mount for test-root
mount --bind test-root test-root
# Step 4: switch the root folder to test-root and keep the old root mounted at old-root
## also mount the process information via the procfs after switching the root
cd test-root
pivot_root . old-root
mount -t proc proc /proc
# Step 5: unmount the old root from the secure environment so that only the new root is available
## `--lazy` is needed as our /bin/bash process is still attached to the old mount
umount --lazy old-root
# Step 6: enter a new PID namespace
unshare --pid --fork
# Step 7: enter a new network namespace
unshare --net
# Step 8: chroot into the new secure environment
exec chroot . /bin/bash
Recap
Lets look at the environment we have now, a folder mounted as the root, a new process hierarchy and an isolated network:
Quite a few moving parts, but really elegant when they work together like this and become the modern container.
What now?
The current Linux Kernel 5.16 has 8 namespaces and this post only covered 3 of them.
The User
and Cgroup
are especially important for modern containers.
While the first one can map unprivileged users from the root namespace to the root user inside a namespace, which is really important for a safe container, the Cgroup namespace
can manage the amount of ressources the namespace will get (RAM and CPU for example).
You should definitely try extending the example from this post with those two. Check man namespaces
for all the documentation needed for this.
As far as this series of posts is concerned the basics are now covered. Part 3 is going to dig into the current ecosystem. Think podman vs docker.
Until then, happy hacking!.