User Namespaces in Kubernetes: The Implementation

This blog post is part of a series on user namespaces in Kubernetes.

In the previous post, we saw how idmap mounts let containers with different userns mappings share volumes. Now let’s see what other questions we need to answer for the implementation:

Who decides the mapping: the kubelet or the runtime? Kubernetes supports running different runtimes on one node, so the simplest approach is for the kubelet to decide the mappings. Otherwise, runtimes have no way to know if a range is already used by another runtime.
How large should the mapping be for each pod? Most container images already use IDs up to 65535. If a UID in use is not mapped, it will be shown as the overflow id and you can’t modify it. So using 0-65535 seems like a simple choice here.

The implementation

The UID/GID space in Linux is 32 bits. We divide the ID space into chunks of 16 bits each:

The range 0-65535 (the first 16 bits) is reserved for the host. This is so the host’s files and processes have no overlap with pods’ files and processes.
The rest is available for pods in chunks of 16 bits each. This allows running ~65k pods per node. Far more than the default of 110 that Kubernetes has today.

Using fixed-size chunks lets us avoid fragmentation of the ID space. With variable-size ranges, three pods could claim small consecutive ranges, and after deletion the gaps might be too small for a new pod that needs a larger range. Fixed-size chunks guarantee that every freed slot fits any new pod.

To track which 16-bit range is used by a pod, we use a bitmap. We can track the full 32-bit ID space in ~8MB.

Once we allocate a range for a specific pod, we send the information to the container runtime. Let’s use containerd and runc as an example.

Containerd then creates the rootfs using the mappings requested and puts all the information in the config.json for runc to create the containers (these files follow the OCI runtime specification). This includes requesting idmap mounts. Unfortunately, the OCI runtime spec ignores unknown settings instead of failing.

If we start a container with userns and the idmap is ignored by runc, the IDs used in the volume will be completely different from what we expect. And it’s quite hard to recover from that.

So, when Alexey added support in the runtime-spec to specify mappings for mounts, we also needed to add the information to the features subcommand. Running runc features shows a mountExtensions field that indicates whether idmap is supported.

This way, containerd can query the runtime and only ask it to create the container if idmap mounts are supported. Otherwise, it just returns an error.

Configuration knobs

We allow some configurations too:

The mapping size doesn’t need to be 16-bit — it can be configured to be larger. This is useful to run Kubernetes inside Kubernetes, for example. The size is still fixed (to avoid fragmentation), and the first range (0 to mapping size) is reserved for the host, so its files and processes don’t overlap with any container.
You can restrict the kubelet to allocate IDs in a specific range. This doesn’t change much for us. The bitmap is efficient enough to model the whole 32 bits.

Conclusion

Hopefully now you have a better sense of how userns has been implemented in Kubernetes and why we made several decisions along the way.

The implementation#

Configuration knobs#

Conclusion#

The implementation

Configuration knobs

Conclusion