User Namespaces in Kubernetes, Part II: Mappings and File Ownership

This blog post is part of a series on user namespaces in Kubernetes.

Although userns has been in Linux for a long time, limited support for volumes has held back wider adoption in the container world.

Mappings and files

When we create a userns, we need to specify a mapping: which UIDs and GIDs inside the container correspond to which ones outside. For example:

UID inside userns	UID outside userns	count
0	100000	1

This maps UID 0 inside the userns to UID 100k outside. Processes inside the userns see themselves as UID 0 (even whoami says root), but from the host’s point of view they run as UID 100k. This also lets you run tools like apt inside the container.

Let’s create a userns with those mappings for UIDs/GIDs and play in the console. As root, run:

# unshare --user --map-users 0:100000:1 --map-groups 0:100000:1 --setuid 0 --setgid 0
-bash: /root/.bash_profile: Permission denied
# whoami 
root
# cd /tmp
# touch a
# ls -l a
-rw-r--r-- 1 root root 0 Apr  7 22:02 a

So, inside the userns it seems we are running as root. But let’s check from the host what we see:

$ ps faux
...
100000    771844  0.0  0.0   8432  5268 pts/22   S+   22:01   0:00  \_ -bash
$ cd /tmp
$ ls -l a
-rw-r--r-- 1 100000 100000 0 Apr  7 22:02 a

Files created inside the userns are stored with UID 100k on disk, even though inside the userns they appear owned by root. Two things are happening here:

When we write, the UID 0 inside the container is mapped to 100k and written to disk
When we read (the first ls command), the mapping is used in the other direction: from 100k (UID in the inode) to 0. Then, we see it as owned by root inside the container.

The immediate question that comes to mind, at least for me, is: what happens if we ls a file owned by a UID not in our mapping?

Let’s create a file outside the userns:

$ touch not-mapped
$ ls -ln not-mapped
-rw-rw-r-- 1 1000 1000 0 Apr  7 22:13 not-mapped

Let’s check how we read it from the userns:

# ls -l not-mapped 
-rw-rw-r-- 1 nobody nogroup 0 Apr  7 22:13 not-mapped
# ls -ln not-mapped 
-rw-rw-r-- 1 65534 65534 0 Apr  7 22:13 not-mapped

This is because when a UID/GID is not mapped, the overflow uid/gid is used to represent it (configured in /proc/sys/kernel/overflowuid). My next question was: so, if I run as UID 65534 inside the userns, can I just change any file? The answer is: of course not. That would be a very simple security exploit. But you can try it yourself now, if you want.

The problem

Userns improves isolation. But it’s not obvious how to use it with containers that share files.

The first solution that comes to mind is to use the same mapping for all containers that share files. This way, all containers will use the same UID on disk and all will work fine, as long as the UID they run inside the container is the same.

This is a valid option, but it has several downsides:

Flag day (a change that requires all consumers to update at once): If we want to share files with containers not using userns, or just be able to toggle userns on and off, we can’t easily do it. We’d need to chown the volumes every time, which can be very expensive.
Lateral movement: if different containers use the same mappings and one escapes, it can read the other’s files, send signals, etc.

This is not perfect, but it is still far better than running as plain root. Docker userns remap does exactly this: one mapping for all containers. This is also what the first incarnations of the userns KEP tried to do in 2016 (as did Phase I of the revisited userns KEP, to some extent).

Because of these downsides, userns was not adopted in Kubernetes for a long time. To address these limitations, we need changes in the kernel.

The solution: idmap mounts

Shortly after we started the work in userns for Kubernetes, Christian Brauner (Linux VFS maintainer and co-founder of Amutable, where I work) merged support for idmap mounts.

An idmap mount is a mount that does some ID transformation for UIDs and GIDs, only when accessed via that mount. So it’s localized to only accesses via that location and it only lasts as long as the mount.

Remember that a userns transforms UIDs in one direction when writing files and in the other when reading them? An idmap mount does the same kind of transformation. We can combine them to “revert” the effect userns has on file UID/GIDs. We can have UID 0 inside the container write to a file and have UID 0 on the inode of that file written to disk.

Let’s create some files and directories:

$ sudo mkdir src dst
$ sudo touch src/a
$ ls -l src/
total 0
-rw-r--r-- 1 root root 0 Apr  7 22:32 a

Now, let’s create a new userns and bind-mount src into dst, using an ID transformation (idmap) for that mount.

# mount -o bind,X-mount.idmap=b:0:100000:1 ./src/ ./dst/
# unshare --user --map-users 0:100000:1 --map-groups 0:100000:1 --setuid 0 --setgid 0
-bash: /root/.bash_profile: Permission denied
# ls -l dst/
total 0
-rw-r--r-- 1 root root 0 Apr  7 22:32 a
# touch dst/b
# ls -l dst/b
-rw-r--r-- 1 root root 0 Apr  7 22:42 dst/b

The file we created as root on the host still shows as root inside the container, even though we’re running with userns and a mapping. Let’s see what the host shows for the file we just created inside the container:

$ ls -l src/
total 0
-rw-r--r-- 1 root root 0 Apr  7 22:32 a
-rw-r--r-- 1 root root 0 Apr  7 22:42 b
$ ls -l dst/
total 0
-rw-r--r-- 1 100000 100000 0 Apr  7 22:32 a
-rw-r--r-- 1 100000 100000 0 Apr  7 22:42 b

Through src (no idmap), both files belong to root. Through dst (with idmap), they show as 100k. That’s why inside the userns (where 100k maps back to 0) they appear owned by root.

The details can get tricky, but the underlying idea is simple: a userns transforms UIDs/GIDs when we read/write files, and idmap mounts let us undo those transformations.

We can use idmap mounts for other stuff too. In fact, other tools (like systemd-homed) use them in a different way. But that is out of scope for this post :)

Using idmap mounts, containers can share files no matter which userns mappings a container is using. There are still more pieces of the puzzle to have a full solution for Kubernetes, but don’t worry, I’ll explain them in the next post.

Mappings and files#

The problem#

The solution: idmap mounts#

Mappings and files

The problem

The solution: idmap mounts