[{"content":" This blog post is part of a series on user namespaces in Kubernetes.\nIn the previous post, we saw how idmap mounts let containers with different userns mappings share volumes. Now let\u0026rsquo;s see what other questions we need to answer for the implementation:\nWho decides the mapping: the kubelet or the runtime? Kubernetes supports running different runtimes on one node, so the simplest approach is for the kubelet to decide the mappings. Otherwise, runtimes have no way to know if a range is already used by another runtime. How large should the mapping be for each pod? Most container images already use IDs up to 65535. If a UID in use is not mapped, it will be shown as the overflow id and you can\u0026rsquo;t modify it. So using 0-65535 seems like a simple choice here. The implementation The UID/GID space in Linux is 32 bits. We divide the ID space into chunks of 16 bits each:\nThe range 0-65535 (the first 16 bits) is reserved for the host. This is so the host\u0026rsquo;s files and processes have no overlap with pods\u0026rsquo; files and processes. The rest is available for pods in chunks of 16 bits each. This allows running ~65k pods per node. Far more than the default of 110 that Kubernetes has today. Using fixed-size chunks lets us avoid fragmentation of the ID space. With variable-size ranges, three pods could claim small consecutive ranges, and after deletion the gaps might be too small for a new pod that needs a larger range. Fixed-size chunks guarantee that every freed slot fits any new pod.\nTo track which 16-bit range is used by a pod, we use a bitmap. We can track the full 32-bit ID space in ~8MB.\nOnce we allocate a range for a specific pod, we send the information to the container runtime. Let\u0026rsquo;s use containerd and runc as an example.\nContainerd then creates the rootfs using the mappings requested and puts all the information in the config.json for runc to create the containers (these files follow the OCI runtime specification). This includes requesting idmap mounts. Unfortunately, the OCI runtime spec ignores unknown settings instead of failing.\nIf we start a container with userns and the idmap is ignored by runc, the IDs used in the volume will be completely different from what we expect. And it\u0026rsquo;s quite hard to recover from that.\nSo, when Alexey added support in the runtime-spec to specify mappings for mounts, we also needed to add the information to the features subcommand. Running runc features shows a mountExtensions field that indicates whether idmap is supported.\nThis way, containerd can query the runtime and only ask it to create the container if idmap mounts are supported. Otherwise, it just returns an error.\nConfiguration knobs We allow some configurations too:\nThe mapping size doesn\u0026rsquo;t need to be 16-bit — it can be configured to be larger. This is useful to run Kubernetes inside Kubernetes, for example. The size is still fixed (to avoid fragmentation), and the first range (0 to mapping size) is reserved for the host, so its files and processes don\u0026rsquo;t overlap with any container. You can restrict the kubelet to allocate IDs in a specific range. This doesn\u0026rsquo;t change much for us. The bitmap is efficient enough to model the whole 32 bits. Conclusion Hopefully now you have a better sense of how userns has been implemented in Kubernetes and why we made several decisions along the way.\n","permalink":"https://blog.sdfg.com.ar/posts/userns-in-kubernetes-implementation/","summary":"\u003cblockquote\u003e\n\u003cp\u003eThis blog post is part of a \u003ca href=\"/tags/kubernetes-userns-series/\"\u003eseries\u003c/a\u003e on user namespaces in\nKubernetes.\u003c/p\u003e\n\u003c/blockquote\u003e\n\u003cp\u003eIn the \u003ca href=\"/posts/userns-in-kubernetes-mappings\"\u003eprevious post\u003c/a\u003e, we saw how idmap mounts let\ncontainers with different userns mappings share volumes. Now let\u0026rsquo;s see what other questions we need\nto answer for the implementation:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eWho decides the mapping: the kubelet or the runtime?\u003c/strong\u003e Kubernetes supports running different\nruntimes on one node, so the simplest approach is for the kubelet to decide the mappings.\nOtherwise, runtimes have no way to know if a range is already used by another runtime.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eHow large should the mapping be for each pod?\u003c/strong\u003e Most container images already use IDs up to\n65535. If a UID in use is not mapped, it will be shown as the overflow id and you can\u0026rsquo;t modify\nit. So using 0-65535 seems like a simple choice here.\u003c/li\u003e\n\u003c/ul\u003e\n\u003ch2 id=\"the-implementation\"\u003eThe implementation\u003c/h2\u003e\n\u003cp\u003eThe UID/GID space in Linux is 32 bits. We divide the ID space into chunks of 16 bits each:\u003c/p\u003e","title":"User Namespaces in Kubernetes: The Implementation"},{"content":" This blog post is part of a series on user namespaces in Kubernetes.\nAlthough userns have been in Linux for a long time, limited support for volumes has held back wider adoption in the container world.\nMappings and files When we create a userns, we need to specify a mapping: which UIDs and GIDs inside the container correspond to which ones outside. For example:\nUID inside userns UID outside userns count 0 100000 1 This maps UID 0 inside the userns to UID 100k outside. Processes inside the userns see themselves as UID 0 (even whoami says root), but from the host\u0026rsquo;s point of view they run as UID 100k.\nLet\u0026rsquo;s create a userns with those mappings for UIDs/GIDs and play in the console. As root run:\n# unshare --user --map-users 0:100000:1 --map-groups 0:100000:1 --setuid 0 --setgid 0 -bash: /root/.bash_profile: Permission denied # whoami root # touch a # ls -l a -rw-r--r-- 1 root root 0 Apr 7 22:02 a So, inside the userns it seems we are running as root. But let\u0026rsquo;s check from the host what we see:\n$ ps faux ... 100000 771844 0.0 0.0 8432 5268 pts/22 S+ 22:01 0:00 \\_ -bash $ ls -l a -rw-r--r-- 1 100000 100000 0 Apr 7 22:02 a Files created inside the userns are stored with UID 100k on disk, even though inside the userns they appear owned by root. Two things are happening here:\nWhen we write, the UID 0 inside the container is mapped to 100k and written to disk When we read (like the first ls command), the UID of the file is 100k and the reverse mapping is used to show it as UID 0 inside the userns The immediate question that comes to mind after this is: what happens if we ls a file owned by a UID not in our mapping?\nLet\u0026rsquo;s create a file outside the userns:\n$ touch not-mapped $ ls -ln not-mapped -rw-rw-r-- 1 1000 1000 0 Apr 7 22:13 not-mapped Let\u0026rsquo;s check how we read it from the userns:\n# ls -l not-mapped -rw-rw-r-- 1 nobody nogroup 0 Apr 7 22:13 not-mapped # ls -ln not-mapped -rw-rw-r-- 1 65534 65534 0 Apr 7 22:13 not-mapped This is because when a UID/GID is not mapped, the overflow uid/gid is used to represent it (configured in /proc/sys/kernel/overflowuid). My next question was: so, if I run as UID 65534 inside the userns, can I just change any file? The answer is: of course not. That would be a very simple security exploit. But you can try it yourself now, if you want.\nThe problem Userns gives us better security isolation. But it\u0026rsquo;s not obvious how to use it with containers that share files.\nThe first solution that comes to mind is to use the same mapping for all containers that share files. This way, all containers will use the same UID on disk and all will work fine, as long as the UID they run inside the container is the same.\nThis is completely an option, but it has several downsides:\nFlag day (a change that requires all consumers to update at once): If we want to share files with containers not using userns, or just be able to toggle userns on and off, we can\u0026rsquo;t easily do it. We\u0026rsquo;d need to chown the volumes every time, which can be very expensive. Lateral movement: if different containers use the same mappings and one escapes, it can read the other\u0026rsquo;s files, send signals, etc. This is not perfect, but this is what first incarnations of the userns KEP in 2016 tried to do. Docker userns remap does exactly this: one mapping for all containers.\nThe solution: idmap mounts Shortly after we started the work in userns for Kubernetes, Christian Brauner (Linux VFS maintainer and co-founder of Amutable, where I work) merged support for idmap mounts.\nAn idmap mount is a mount that does some ID transformation for UIDs and GIDs, just when accessed via that mount. So it\u0026rsquo;s localized to only accesses via that location and it only lasts as long as the mount.\nRemember that a userns transforms UIDs in one direction when writing files and in the other when reading them? An idmap mount does the same kind of transformation. We can combine them to \u0026ldquo;revert\u0026rdquo; the effect userns has on file UID/GIDs. We can have UID 0 inside the container write to a file and have UID 0 on the inode of that file written on disk.\nLet\u0026rsquo;s create some files and directories:\n$ mkdir src dst $ sudo touch src/a $ ls -l src/ total 0 -rw-r--r-- 1 root root 0 Apr 7 22:32 a Let\u0026rsquo;s create a userns and bind-mount src into dst, using an ID transformation (idmap) for that mount.\n# mount -o bind,X-mount.idmap=b:0:100000:1 ./src/ ./dst/ # unshare --user --map-users 0:100000:1 --map-groups 0:100000:1 --setuid 0 --setgid 0 -bash: /root/.bash_profile: Permission denied # ls -l dst/ total 0 -rw-r--r-- 1 root root 0 Apr 7 22:32 a # touch dst/b # ls -l dst/b -rw-r--r-- 1 root root 0 Apr 7 22:42 dst/b The file we created as root on the host still shows as root inside the container, even though we\u0026rsquo;re running with userns and a mapping. Let\u0026rsquo;s see what we see from the host for the file we just created inside the container:\n$ ls -l src/ total 0 -rw-r--r-- 1 root root 0 Apr 7 22:32 a -rw-r--r-- 1 root root 0 Apr 7 22:42 b $ ls -l dst/ total 0 -rw-r--r-- 1 100000 100000 0 Apr 7 22:32 a -rw-r--r-- 1 100000 100000 0 Apr 7 22:42 b Through src (no idmap), both files belong to root. Through dst (with idmap), they show as 100k. That\u0026rsquo;s why inside the userns — where 100k maps back to 0 — they appear owned by root.\nThe details can get tricky, but the underlying idea is simple: a userns transforms UID/GIDs when we read/write files, and idmap mounts let us undo those transformations.\nUsing idmap mounts, containers can share files no matter which userns mappings a container is using. There are still more pieces of the puzzle to have a full solution for Kubernetes, but don\u0026rsquo;t worry, I\u0026rsquo;ll explain them in the next post.\n","permalink":"https://blog.sdfg.com.ar/posts/userns-in-kubernetes-mappings/","summary":"\u003cblockquote\u003e\n\u003cp\u003eThis blog post is part of a \u003ca href=\"/tags/kubernetes-userns-series/\"\u003eseries\u003c/a\u003e on user namespaces in\nKubernetes.\u003c/p\u003e\n\u003c/blockquote\u003e\n\u003cp\u003eAlthough userns have been in Linux for a long time, limited support for volumes has held back wider\nadoption in the container world.\u003c/p\u003e\n\u003ch2 id=\"mappings-and-files\"\u003eMappings and files\u003c/h2\u003e\n\u003cp\u003eWhen we create a userns, we need to specify a mapping: which UIDs and GIDs inside the container\ncorrespond to which ones outside. For example:\u003c/p\u003e\n\u003ctable\u003e\n  \u003cthead\u003e\n      \u003ctr\u003e\n          \u003cth\u003eUID inside userns\u003c/th\u003e\n          \u003cth\u003eUID outside userns\u003c/th\u003e\n          \u003cth\u003ecount\u003c/th\u003e\n      \u003c/tr\u003e\n  \u003c/thead\u003e\n  \u003ctbody\u003e\n      \u003ctr\u003e\n          \u003ctd\u003e0\u003c/td\u003e\n          \u003ctd\u003e100000\u003c/td\u003e\n          \u003ctd\u003e1\u003c/td\u003e\n      \u003c/tr\u003e\n  \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003eThis maps UID 0 inside the userns to UID 100k outside. Processes inside the userns see themselves as\nUID 0 (even \u003ccode\u003ewhoami\u003c/code\u003e says root), but from the host\u0026rsquo;s point of view they run as UID 100k.\u003c/p\u003e","title":"User namespaces in Kubernetes: Mappings and File Ownership"},{"content":" This blog post is part of a series that will deep dive into user-namespaces support in Kubernetes.\nUser-namespaces (userns) support reached GA in Kubernetes 1.36. This means you can have pods that run inside a user-namespace. The most common reasons people want to do that are:\nImprove isolation: Adopting it will significantly increase the host isolation and reduce lateral movement. UIDs/GIDs don\u0026rsquo;t overlap with any other pod or the host, and capabilities are only valid inside the pod. Secure nested containers: It\u0026rsquo;s possible to create a container inside a container with userns, so you can run dockerd inside a Kubernetes pod (with some other adjustments, but all available now), you can build container images, etc. How to use it One of the design goals was to make it trivial to adopt. All you need to do is set hostUsers to false in your pod spec. If you have the right versions of the stack, all will just work:\napiVersion: v1 kind: Pod metadata: name: userns spec: hostUsers: false # \u0026lt;-- add this field. containers: - name: shell command: [\u0026#34;sleep\u0026#34;, \u0026#34;infinity\u0026#34;] image: debian This tells Kubernetes not to use the host\u0026rsquo;s UIDs/GIDs, and instead a user namespace is created for the pod. Most applications will work just fine with this, completely unmodified.\nSetting that field also takes care of using non-overlapping UIDs/GIDs for a pod\u0026rsquo;s processes and files. You can use runAsUser and other fields that just affect the user inside the container. Inside the container nothing changes: if you use runAsUser: 0 you will still see that, but from the host point of view (e.g. if you run ps) you will see the pod running as an unprivileged user.\nWhen using volumes, the files created there will belong to the user you choose to run the container as. For example, if you use runAsUser: 0, it will create files owned by root in the volume. This means you can easily share volumes with containers that are not running with user-namespaces too.\nLet\u0026rsquo;s get practical with some questions and answers:\n1. Can I set hostUsers: false in pods with volumes and then remove it if I see some problems?\nYes, you can turn it on and off — everything, including volumes, will continue to work just fine.\n2. Is there any consideration to have with the file-systems used?\nYes, the file-systems used by the pod need to support idmap mounts on the kernel you are using. Support for idmap mounts is per file-system and kernel version. You can check which Linux version added support for each fs in the NOTES section for the mount_setattr manpage.\nWhile we try to update the manpage, it is sometimes out of date. You can clone the Linux repo and grep for FS_ALLOW_IDMAP in the fs/ folder for an authoritative list.\nBear in mind that the service account token each pod has by default is usually a tmpfs file-system. Also, because files like /etc/resolv.conf and similar are bind-mounted from /var/lib/kubelet/..., you need idmap mounts support on that fs too.\n3. Is there anything else I need to take into account to use this?\nIf you meet the stack requirements and have fs support, you are almost there. There are a few other things to take into account, but are probably a no-op for most apps:\nRunning inside a userns makes some operations completely impossible (like loading kernel modules). But unless you need to do something very privileged on the host, you can probably enable userns and just use it without any other changes.\nThe container must use UIDs/GIDs from the range 0-65535 for processes and files.\n4. Are there some PSS changes when using userns?\nYes, check out the docs for this. Basically, the PSS checks for \u0026ldquo;does this pod run as root?\u0026rdquo; are relaxed. You can run as root inside the container and the PSS won\u0026rsquo;t complain if you are using userns.\nPlease note that while capabilities are also namespaced (valid inside only the pod userns, not the host), they are not relaxed. Capabilities do allow you to do more operations and potentially hit some kernel CVEs, that is why we decided to keep them unchanged with userns.\nIn other words, if you want CAP_SYS_ADMIN, even if it\u0026rsquo;s much safer to grant it with userns, the pod still needs to be privileged to get it.\n5. How can I change the kubelet configuration for userns on running nodes?\nIf you want to change which UID/GID range is used for pods with userns, or how many IDs are allocated per pod, you need to drain the node first.\nThe kubelet guarantees that no two pods with userns use the same range and when these settings change, the pods need to be recreated so the kubelet can honor the new setting and guarantee no two pods overlap.\nStack requirements To make a feature that significantly improves the isolation and is so simple to adopt, we needed to make changes in every layer of the stack: the Linux kernel, OCI runtimes (runc, crun), high-level container runtimes (containerd, cri-o) and Kubernetes.\nComponent Version Notes Kubernetes 1.25 Stateless pods support, requires to enable alpha feature gate UserNamespacesStatelessSupport Kubernetes 1.28 Stateless and stateful pods support, requires to enable alpha feature gate UserNamespacesSupport Kubernetes 1.30 Beta, requires to enable beta feature gate UserNamespacesSupport Kubernetes 1.33 Beta, enabled by default, no need to enable feature gate Kubernetes 1.36 GA, no need to have beta features enabled containerd 2.0 Needed for k8s \u0026gt;=1.27. v1.7 has limited support, works only with Kubernetes 1.25–1.26 CRI-O 1.25 Supports all features of the same Kubernetes version runc 1.2 Support for idmap mounts crun 1.9 1.13+ recommended for better error messages Linux 6.3 Most popular file-systems support idmap mounts. With care, you can also use 5.19 and 5.12 How does this compare with what we have today? It\u0026rsquo;s quite different from what we have today, in several dimensions — even compared to \u0026ldquo;regular unprivileged pods\u0026rdquo; (pods without user-namespaces that run as an unprivileged user, with restricted capabilities).\nPods with seccomp/apparmor If we compare with pods just using seccomp/apparmor to secure them, it\u0026rsquo;s quite different:\nThe pod still runs as root on the host and all capabilities are valid in the host We use seccomp/apparmor to limit what an already very privileged pod can do But\u0026hellip; scenarios like container breakouts can still have a very big impact.\nUnprivileged pods (without userns) Running as an unprivileged user is a significant improvement over running as root. However, it\u0026rsquo;s still different from userns:\nIt\u0026rsquo;s harder than it seems: not even the regular \u0026ldquo;nginx\u0026rdquo; image works unprivileged. Google engineers had a Kubecon talk about all the problems they faced trying to adopt it for GKE components. They ended up choosing userns instead. Capabilities are still valid on the host: any capabilities you grant are usable after a container breakout. With userns, capabilities are only valid inside the pod. No lateral movement protection: most people pick the same UID (e.g. 65534), so all unprivileged pods share it. With userns, each pod gets a unique range on the host. Let\u0026rsquo;s think about it for a moment Running processes as different UIDs/GIDs is probably one of the most basic security measures we can take. Yet in the container world, we run as root on the host, giving a lot of privileges, and then trying to restrict what root can do with seccomp/apparmor/others. It\u0026rsquo;s like playing whack-a-mole, it\u0026rsquo;s not that hard to find an escape as root! A LOT of CVEs rated HIGH happened because of running as root.\nUser-namespaces allow us to change this and instead of giving permissions to processes that shouldn\u0026rsquo;t have them, it just doesn\u0026rsquo;t give them permissions. Or it gives permissions just within the container. That is exactly what we want.\nLinux distros learned this lesson years ago — services like bind don\u0026rsquo;t run as root. When containers started we moved fast and left some of those lessons behind. I don\u0026rsquo;t want to criticize the past, we had reasons to do that. But it\u0026rsquo;s time to revisit those decisions.\nA personal note I\u0026rsquo;m honestly super-happy that finally userns is a GA feature in Kubernetes. I\u0026rsquo;ve been working on this for the last 6 years. I can\u0026rsquo;t believe it\u0026rsquo;s finally there and it\u0026rsquo;s available in all major clouds!\nConclusion I\u0026rsquo;ve shared, I hope, everything you need to know to use user-namespaces in Kubernetes and how userns compares to what was already available.\nIf you got curious about how pods that run with non-overlapping UIDs can still share volumes without issues, check out the next post. I\u0026rsquo;ll explore the problems and solutions we had at all layers of the stack.\n","permalink":"https://blog.sdfg.com.ar/posts/all-about-userns-in-kubernetes/","summary":"\u003cblockquote\u003e\n\u003cp\u003eThis blog post is part of a \u003ca href=\"/tags/kubernetes-userns-series/\"\u003eseries\u003c/a\u003e that will deep dive into\nuser-namespaces support in Kubernetes.\u003c/p\u003e\n\u003c/blockquote\u003e\n\u003cp\u003eUser-namespaces (userns) support reached GA in Kubernetes 1.36. This means you can have pods that\nrun inside a user-namespace. The most common reasons people want to do that are:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eImprove isolation\u003c/strong\u003e: Adopting it will significantly increase the host isolation and reduce lateral\nmovement. UIDs/GIDs don\u0026rsquo;t overlap with any other pod or the host, and capabilities are only\nvalid inside the pod.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eSecure nested containers\u003c/strong\u003e: It\u0026rsquo;s possible to create a container inside a container with userns,\nso you can run dockerd inside a Kubernetes pod (with some other adjustments, but all available\nnow), you can build container images, etc.\u003c/li\u003e\n\u003c/ul\u003e\n\u003ch2 id=\"how-to-use-it\"\u003eHow to use it\u003c/h2\u003e\n\u003cp\u003eOne of the design goals was to make it trivial to adopt. All you need to do is set \u003ccode\u003ehostUsers\u003c/code\u003e to\n\u003ccode\u003efalse\u003c/code\u003e in your pod spec. If you have the right versions of the stack, all will just work:\u003c/p\u003e","title":"All You Need to Know to Use User Namespaces in Kubernetes"}]