User Namespaces in Kubernetes, Part III: The Implementation

This blog post is part of a series on user namespaces in Kubernetes.

In the previous post, we saw how idmap mounts let containers with different userns mappings share volumes. Now let’s see what other questions we needed to answer for a Kubernetes implementation:

Who decides the mapping: the kubelet or the runtime? Kubernetes supports running different runtimes on one node, so the kubelet needs to decide the mappings. Otherwise, runtimes have no way to know if a range is already used by another runtime.
How large should the mapping be for each pod? Most container images already use IDs up to 65535. If a UID in use is not mapped, it will be shown as the overflow id and you can’t modify it. Using 0-65535 is a sensible choice here and also divides the 32-bit UID space evenly.
How to choose which ID (UIDs/GIDs) range a pod will use? Pods have a unique ID range chosen on the node at pod creation time that doesn’t overlap with ranges used by other pods. After the last post, we know we can do that without issues if we use idmap mounts. This gives pods better isolation in case of a container breakout: they can’t read/write inodes owned by a different UID/GID (unless the inodes have permissions for others), they can’t send signals to other pods, etc. Furthermore, we can also reserve a separate range for the host’s files and processes, extending the same isolation to the host.

The implementation

The UID/GID space in Linux is 32 bits. We divide the ID space into chunks of 16 bits each:

The range 0-65535 (the first 16 bits) is reserved for the host. This is so the host’s files and processes have no overlap with pods’ files and processes.
The rest is available for pods in chunks of 16 bits each. This allows running ~65k pods (2³² / 2¹⁶ = 2¹⁶) per node. Far more than the default of 110 that Kubernetes has today.

Using fixed-size chunks lets us avoid fragmentation of the ID space. With variable-size ranges, three pods could claim small consecutive ranges, and after deletion the gaps might be too small for a new pod that needs a larger range. Fixed-size chunks guarantee that every freed slot fits any new pod.

To track which 16-bit range is used by a pod, we use a bitmap. We can track the full 32-bit ID space in ~8KB (2³² / 2¹⁶ / 8 / 1024 = 8).

Once we allocate a range for a specific pod, we send the information to the container runtime. Let’s use containerd and runc as an example, although with CRI-O and crun the story is basically the same.

Containerd downloads the container image, creates the root filesystem (rootfs) using the mappings requested and puts all the information in the config.json for runc to create the containers (these files follow the OCI runtime specification). This includes using idmap mounts for the container’s mounts.

That’s the happy path from kubelet to a running container. Two things along the way deserve a closer look.

Avoid privilege escalation using new binaries

The rootfs typically uses overlayfs. Overlayfs takes the container base image and adds a scratch space (called the upper_dir) to track any modifications done by the container. The base image can be shared across containers and stays untouched; runtime modifications go into the upper_dir. Overlayfs exposes a merged view of both to the container: the original container image with the modifications made in the upper_dir.

When adding idmap mount support for the rootfs, we made sure the rootfs image sits in a directory only accessible by the GID allocated for the container on the host. No other process on the host can see the files created in the container image (because no other process has this GID).

We also don’t create an idmap mount of the upper_dir. So any binary the container drops into the rootfs is seen from the host as owned by the container GID, not root.

Because the rootfs is only visible to that GID and any new binaries aren’t owned by root on the host, we close the door on privilege-escalation attacks where a host-side process tries to exec a set-user-ID binary from the rootfs.

Honor idmap mounts or fail

The OCI runtime-spec mandates that runtimes ignore unknown fields. One field we recently added is the per-mount mapping needed to create an idmap mount. If that field is ignored, no idmap mounts are used. And as we saw in the previous post, the file ownership on disk will be whatever unique UID the kubelet found for us. The net result is that the container can’t access its files.

To avoid that scenario, we need to know whether the runc version running supports idmap mounts. So we added idmap mount information to the features subcommand. Running runc features shows a mountExtensions field that indicates whether idmap is supported (an absent field is treated as not supported).

This way, containerd can query runc and only ask it to create the container if idmap mounts are supported. Otherwise, it just returns an error to avoid the volume ownership problem.

As far as I know, the spec decision to ignore unknown fields is historical. At the time, a lot of innovation was happening at the OCI runtime level. It was expected for different runtimes to expose runtime-specific options that other runtimes could safely ignore for portability. This is not common nowadays, but it was hard to predict at the time.

A note on some configuration knobs

The description above makes the implementation sound rigid, but that is only to keep things simple. There are a few things that are configurable:

The mapping size doesn’t need to be 16-bit, it can be configured to be larger. This is useful to run Kubernetes inside Kubernetes, for example. The size is still fixed (to avoid fragmentation), and the first range (0 to mapping size) is reserved for the host, so its files and processes don’t overlap with any container.
You can restrict the kubelet to allocate IDs in a specific range. This doesn’t change much for us. The bitmap is efficient enough to model the whole 32 bits.

Wrap-up

Hopefully now you have a better sense of how userns has been implemented in Kubernetes and why we made several decisions along the way. Maybe it was a fun read too? :)

The implementation#

Avoid privilege escalation using new binaries#

Honor idmap mounts or fail#

A note on some configuration knobs#

Wrap-up#

The implementation

Avoid privilege escalation using new binaries

Honor idmap mounts or fail

A note on some configuration knobs

Wrap-up