This blog post is part of a series on user namespaces in Kubernetes.

Although userns have been in Linux for a long time, limited support for volumes has held back wider adoption in the container world.

Mappings and files

When we create a userns, we need to specify a mapping: which UIDs and GIDs inside the container correspond to which ones outside. For example:

UID inside usernsUID outside usernscount
01000001

This maps UID 0 inside the userns to UID 100k outside. Processes inside the userns see themselves as UID 0 (even whoami says root), but from the host’s point of view they run as UID 100k.

Let’s create a userns with those mappings for UIDs/GIDs and play in the console. As root run:

# unshare --user --map-users 0:100000:1 --map-groups 0:100000:1 --setuid 0 --setgid 0
-bash: /root/.bash_profile: Permission denied
# whoami 
root
# touch a
# ls -l a
-rw-r--r-- 1 root root 0 Apr  7 22:02 a

So, inside the userns it seems we are running as root. But let’s check from the host what we see:

$ ps faux
...
100000    771844  0.0  0.0   8432  5268 pts/22   S+   22:01   0:00  \_ -bash
$ ls -l a
-rw-r--r-- 1 100000 100000 0 Apr  7 22:02 a

Files created inside the userns are stored with UID 100k on disk, even though inside the userns they appear owned by root. Two things are happening here:

  • When we write, the UID 0 inside the container is mapped to 100k and written to disk
  • When we read (like the first ls command), the UID of the file is 100k and the reverse mapping is used to show it as UID 0 inside the userns

The immediate question that comes to mind after this is: what happens if we ls a file owned by a UID not in our mapping?

Let’s create a file outside the userns:

$ touch not-mapped
$ ls -ln not-mapped
-rw-rw-r-- 1 1000 1000 0 Apr  7 22:13 not-mapped

Let’s check how we read it from the userns:

# ls -l not-mapped 
-rw-rw-r-- 1 nobody nogroup 0 Apr  7 22:13 not-mapped
# ls -ln not-mapped 
-rw-rw-r-- 1 65534 65534 0 Apr  7 22:13 not-mapped

This is because when a UID/GID is not mapped, the overflow uid/gid is used to represent it (configured in /proc/sys/kernel/overflowuid). My next question was: so, if I run as UID 65534 inside the userns, can I just change any file? The answer is: of course not. That would be a very simple security exploit. But you can try it yourself now, if you want.

The problem

Userns gives us better security isolation. But it’s not obvious how to use it with containers that share files.

The first solution that comes to mind is to use the same mapping for all containers that share files. This way, all containers will use the same UID on disk and all will work fine, as long as the UID they run inside the container is the same.

This is completely an option, but it has several downsides:

  • Flag day (a change that requires all consumers to update at once): If we want to share files with containers not using userns, or just be able to toggle userns on and off, we can’t easily do it. We’d need to chown the volumes every time, which can be very expensive.
  • Lateral movement: if different containers use the same mappings and one escapes, it can read the other’s files, send signals, etc.

This is not perfect, but this is what first incarnations of the userns KEP in 2016 tried to do. Docker userns remap does exactly this: one mapping for all containers.

The solution: idmap mounts

Shortly after we started the work in userns for Kubernetes, Christian Brauner (Linux VFS maintainer and co-founder of Amutable, where I work) merged support for idmap mounts.

An idmap mount is a mount that does some ID transformation for UIDs and GIDs, just when accessed via that mount. So it’s localized to only accesses via that location and it only lasts as long as the mount.

Remember that a userns transforms UIDs in one direction when writing files and in the other when reading them? An idmap mount does the same kind of transformation. We can combine them to “revert” the effect userns has on file UID/GIDs. We can have UID 0 inside the container write to a file and have UID 0 on the inode of that file written on disk.

Let’s create some files and directories:

$ mkdir src dst
$ sudo touch src/a
$ ls -l src/
total 0
-rw-r--r-- 1 root root 0 Apr  7 22:32 a

Let’s create a userns and bind-mount src into dst, using an ID transformation (idmap) for that mount.

# mount -o bind,X-mount.idmap=b:0:100000:1 ./src/ ./dst/
# unshare --user --map-users 0:100000:1 --map-groups 0:100000:1 --setuid 0 --setgid 0
-bash: /root/.bash_profile: Permission denied
# ls -l dst/
total 0
-rw-r--r-- 1 root root 0 Apr  7 22:32 a
# touch dst/b
# ls -l dst/b
-rw-r--r-- 1 root root 0 Apr  7 22:42 dst/b

The file we created as root on the host still shows as root inside the container, even though we’re running with userns and a mapping. Let’s see what we see from the host for the file we just created inside the container:

$ ls -l src/
total 0
-rw-r--r-- 1 root root 0 Apr  7 22:32 a
-rw-r--r-- 1 root root 0 Apr  7 22:42 b
$ ls -l dst/
total 0
-rw-r--r-- 1 100000 100000 0 Apr  7 22:32 a
-rw-r--r-- 1 100000 100000 0 Apr  7 22:42 b

Through src (no idmap), both files belong to root. Through dst (with idmap), they show as 100k. That’s why inside the userns — where 100k maps back to 0 — they appear owned by root.

The details can get tricky, but the underlying idea is simple: a userns transforms UID/GIDs when we read/write files, and idmap mounts let us undo those transformations.

Using idmap mounts, containers can share files no matter which userns mappings a container is using. There are still more pieces of the puzzle to have a full solution for Kubernetes, but don’t worry, I’ll explain them in the next post.