This blog post is part of a series that will deep dive into user-namespaces support in Kubernetes.
New to user namespaces? See question 6 for introductory material.
User-namespaces (userns) support reached GA in Kubernetes 1.36. This means you can have pods that run inside a user-namespace. The most common reasons people want to do that are:
- Improve isolation: Adopting it will significantly increase the host isolation and reduce lateral movement. UIDs/GIDs don’t overlap with any other pod or the host, and capabilities are only valid inside the pod.
- Secure nested containers: It’s possible to create a container inside a container with userns, so you can run dockerd inside a Kubernetes pod (with some other adjustments, but all available now), you can build container images, etc.
How to use it
One of the design goals was to make it trivial to adopt. All you need to do is set hostUsers to
false in your pod spec. If you have the right versions of the stack, all will just work:
apiVersion: v1
kind: Pod
metadata:
name: userns
spec:
hostUsers: false # <-- add this field.
containers:
- name: shell
command: ["sleep", "infinity"]
image: debian
This tells Kubernetes not to use the host’s UIDs/GIDs, and instead a user namespace is created for the pod. Most applications will work just fine with this, completely unmodified.
Setting that field also takes care of using non-overlapping UIDs/GIDs for a pod’s processes and
files. You can use runAsUser and other fields that just affect the user inside the container.
Inside the container nothing changes: if you use runAsUser: 0 you will still see that, but from
the host point of view (e.g. if you run ps) you will see the pod running as an unprivileged user.
When using volumes, the files created there will belong to the user you choose to run the container as.
For example, if you use runAsUser: 0, it will create files owned by root in the volume. This means
you can easily share volumes with containers that are not running with user-namespaces too.
Common questions
1. Can I set hostUsers: false in pods with volumes and then remove it if I see some problems?
Yes, you can turn it on and off. Everything, including volumes, will continue to work just fine.
2. Are there any considerations with the filesystems used?
Yes, the filesystems used by the pod need to support idmap mounts on the kernel you are using. Support for idmap mounts is per filesystem, and most popular filesystems support idmap mounts already. The notable exception is NFS, which still doesn’t support idmap mounts.
For example, Linux 6.3 supports idmap mounts with: xfs, ext4, fat, btrfs, ntfs3, f2fs, erofs, overlayfs, squashfs and tmpfs.
You can check which Linux version added support for each fs in the NOTES section for the
mount_setattr manpage. While we try to update the manpage, it is sometimes out of
date. For an authoritative list, clone the Linux repo and grep for FS_ALLOW_IDMAP in the fs/
folder.
Keep in mind that the service account token each pod has by default is usually a tmpfs
filesystem. Also, because files like /etc/resolv.conf and similar are bind-mounted from
/var/lib/kubelet/..., you need idmap mounts support on that fs too.
3. Is there anything else I need to take into account to use this?
If you meet the stack requirements and have fs support, you are almost there. There are a few other things to take into account, but they are probably a no-op for most apps:
Running inside a userns makes some operations completely impossible (like loading kernel modules). But unless you need to do something very privileged on the host, you can probably enable userns and just use it without any other changes.
The container must use UIDs/GIDs from the range 0-65535 for processes and files.
4. Are there any PSS changes when using userns?
Yes, check out the docs for this. Basically, the Pod Security Standards (PSS) checks for “does this pod run as root?” are relaxed. You can run as root inside the container and the PSS won’t complain if you are using userns.
Please note that while capabilities are also namespaced (valid only inside the pod userns, not the
host), they are not relaxed. Capabilities gate access to specific kernel code paths. CAP_SYS_ADMIN,
for example, allows mounts and many other syscalls. Userns does restrict some of what the capability
lets you do (e.g., you can only mount a subset of filesystems inside a userns), but holding it still
lets you reach more kernel code paths from inside the pod. A use-after-free bug that can be used to
elevate privileges is still a potential risk. That’s why we kept capability checks unchanged in PSS.
In other words, if you want CAP_SYS_ADMIN, even if it’s much safer to grant it with userns, the
pod still needs to be privileged to get it.
5. How can I change the kubelet configuration for userns on running nodes?
If you want to change which UID/GID range is used for pods with userns, or how many IDs are allocated per pod, you need to drain the node first.
The kubelet guarantees that no two pods with userns use the same range. When these settings change, the pods need to be recreated so the kubelet can honor the new setting and guarantee no two pods overlap.
6. Where can I find more documentation?
I also wrote the Kubernetes user namespaces documentation. It covers what userns are, kubelet configuration options, PSS integration in more detail, metrics, and a step-by-step example.
While there is a little bit of overlap, I tried to make this post and the docs complementary: each covers things the other doesn’t. Most of the stuff here doesn’t fit in the Kubernetes documentation. The next posts in this series cover the problems with volumes and the technical details of the implementation.
I’ve also written a few introductory blog posts on the Kubernetes site. One worth highlighting is this one, which includes several demos showing how userns mitigates high-severity CVEs. For those demos I built an exploit from the public CVE info, adapted it to work in Kubernetes, and then showed how userns mitigates it.
For completeness, here are all the introductory blog posts I wrote or co-authored, along with a KubeCon talk in the same vein. None of them dive into the technical details the way the next posts in this series do (part II and part III). Also, bear in mind some of them are quite old and details may have changed since:
- K8s blog - Kubernetes 1.25: alpha support for running Pods with user namespaces
- K8s blog - User Namespaces: Now Supports Running Stateful Pods in Alpha!
- K8s blog - Kubernetes 1.30: Beta Support For Pods With User Namespaces
- K8s blog - Kubernetes v1.33: User Namespaces enabled by default!
- K8s blog - Kubernetes v1.36: User Namespaces in Kubernetes are finally GA
- KubeCon NA 2022 talk - Run As “Root”, Not Root: User Namespaces In K8s
- Kinvolk blog - Improving Kubernetes and container security with user namespaces: written by Alban, a coworker at Kinvolk. I was already working with him on the project at the time.
- Kinvolk blog - Tips and tricks for user namespaces with Kubernetes and containerd
Stack requirements
To make a feature that significantly improves the isolation and is so simple to adopt, we needed to make changes in every layer of the stack: the Linux kernel, OCI runtimes (runc, crun), high-level container runtimes (containerd, cri-o) and Kubernetes.
| Component | Version | Notes |
|---|---|---|
| Kubernetes | 1.25 | Stateless pods support, enable alpha feature gate UserNamespacesStatelessPodsSupport |
| Kubernetes | 1.27 | Stateless pods support reworked to use idmap mounts |
| Kubernetes | 1.28 | Stateless and stateful pods support, enable alpha feature gate UserNamespacesSupport |
| Kubernetes | 1.30 | Beta, enable beta feature gate UserNamespacesSupport |
| Kubernetes | 1.33 | Beta, enabled by default, no need to enable feature gate |
| Kubernetes | 1.36 | GA, no need to have beta features enabled |
| containerd | 1.7 | Only works with Kubernetes 1.25–1.26 |
| containerd | 2.0+ | Needed for Kubernetes 1.27+ |
| CRI-O | 1.25+ | Supports all features of the same Kubernetes version |
| runc | 1.2 | Support for idmap mounts |
| crun | 1.9 | 1.13+ recommended for better userns-related error messages |
| Linux | 6.3 | Most popular filesystems support idmap mounts. With care, you can also use 5.19 and 5.12 |
How does this compare with other security mechanisms?
It’s quite different from the other security mechanisms, even compared to “regular unprivileged pods” (pods without user-namespaces that run as an unprivileged user, with restricted capabilities).
Pods with seccomp/apparmor
If we compare with pods just using seccomp/apparmor to secure them, it’s quite different:
- The pod still runs as root on the host and all capabilities are valid on the host
- We use seccomp/apparmor to limit what an already very privileged pod can do
But scenarios like container breakouts can still have a very big impact.
Unprivileged pods (without userns)
Running as an unprivileged user is a significant improvement over running as root. However, it’s still different from userns:
- It’s harder than it seems: not even the regular “nginx” image works unprivileged. Google
engineers had a Kubecon talk about all the problems they faced trying to
adopt it for GKE components. It’s not intuitive: sometimes they needed to split the app into
several
initContainers, rearchitect parts, and even create new KEPs. They ended up choosing userns whenever possible. - Capabilities are still valid on the host: any capabilities you grant are usable after a container breakout. With userns, capabilities are only valid inside the pod.
- No lateral movement protection: most people pick the same UID (e.g. 65534), so all unprivileged pods share it. With userns, each pod gets a unique range on the host.
Let’s think about it for a moment
Running processes as different UIDs/GIDs is probably one of the most basic security measures we can take. Yet in the container world, we run as root on the host, giving a lot of privileges, and then try to restrict what root can do with seccomp/apparmor/others. It’s like playing whack-a-mole. It’s not that hard to find an escape as root! A LOT of CVEs rated HIGH happened because of running as root.
User-namespaces allow us to change this and instead of giving permissions to processes that shouldn’t have them, it just doesn’t give them permissions. Or it gives permissions just within the container. That is exactly what we want.
Linux distros learned this lesson years ago: services like bind don’t run as root. Systemd added more options to the table (e.g. socket activation). When containers started we moved fast and left some of those lessons behind. I don’t want to criticize the past, we had reasons to do that. But it’s time to revisit those decisions.
A personal note
I’m honestly super-happy that finally userns is a GA feature in Kubernetes. Giuseppe Scrivano and I have been working on this for the last 6 years. I can’t believe it’s finally there and it’s available in all major clouds!
Conclusion
I’ve shared, I hope, everything you need to know to use user-namespaces in Kubernetes and how userns compares to what was already available.
If you got curious about how pods that run with non-overlapping UIDs can still share volumes without issues, check out the next post. I’ll explore the problems and solutions we had at all layers of the stack.
 on [Unsplash](https://unsplash.com)](https://blog.sdfg.com.ar/posts/userns-in-kubernetes-part-i/cover.jpg)