blog: Add post about userns for stateful pods

Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
pull/42778/head
Rodrigo Campos 2023-08-29 16:30:21 +02:00
parent 58c519905d
commit 93d8268b09
1 changed files with 148 additions and 0 deletions

View File

@ -0,0 +1,148 @@
---
layout: blog
title: "User Namespaces: Now Supports Running Stateful Pods in Alpha!"
date: 2023-09-13
slug: userns-alpha
---
**Authors:** Rodrigo Campos Catelin (Microsoft), Giuseppe Scrivano (Red Hat), Sascha Grunert (Red Hat)
Kubernetes v1.25 introduced support for user namespaces for only stateless
pods. Kubernetes 1.28 lifted that restriction, after some design changes were
done in 1.27.
The beauty of this feature is that:
* it is trivial to adopt (you just need to set a bool in the pod spec)
* doesn't need any changes for **most** applications
* improves security by _drastically_ enhancing the isolation of containers and
mitigating CVEs rated HIGH and CRITICAL.
This post explains the basics of user namespaces and also shows:
* the changes that arrived in the recent Kubernetes v1.28 release
* a **demo of a vulnerability rated as HIGH** that is not exploitable with user namespaces
* the runtime requirements to use this feature
* what you can expect in future releases regarding user namespaces.
## What is a user namespace?
A user namespace is a Linux feature that isolates the user and group identifiers
(UIDs and GIDs) of the containers from the ones on the host. The indentifiers
in the container can be mapped to indentifiers on the host in a way where the
host UID/GIDs used for different containers never overlap. Even more, the
identifiers can be mapped to *unprivileged* non-overlapping UIDs and GIDs on the
host. This basically means two things:
* As the UIDs and GIDs for different containers are mapped to different UIDs
and GIDs on the host, containers have a harder time to attack each other even
if they escape the container boundaries. For example, if container A is running
with different UIDs and GIDs on the host than container B, the operations it
can do on container B's files and process are limited: only read/write what a
file allows to others, as it will never have permission for the owner or
group (the UIDs/GIDs on the host are guaranteed to be different for
different containers).
* As the UIDs and GIDs are mapped to unprivileged users on the host, if a
container escapes the container boundaries, even if it is running as root
inside the container, it has no privileges on the host. This greatly
protects what host files it can read/write, which process it can send signals
to, etc.
Furthermore, capabilities granted are only valid inside the user namespace and
not on the host.
Without using a user namespace a container running as root, in the case of a
container breakout, has root privileges on the node. And if some capabilities
were granted to the container, the capabilities are valid on the host too. None
of this is true when using user namespaces (modulo bugs, of course 🙂).
## Changes in 1.28
As already mentioned, starting from 1.28, Kubernetes supports user namespaces
with stateful pods. This means that pods with user namespaces can use any type
of volume, they are no longer limited to only some volume types as before.
The feature gate to activate this feature was renamed, it is no longer
`UserNamespacesStatelessPodsSupport` but from 1.28 onwards you should use
`UserNamespacesSupport`. There were many changes done and the requirements on
the node hosts changed. So with Kubernetes 1.28 the feature flag was renamed to
reflect this.
## Demo
Rodrigo created a demo which exploits [CVE 2022-0492][cve-link] and shows how
the exploit can occur without user namespaces. He also shows how it is not
possible to use this exploit from a Pod where the containers are using this
feature.
This vulnerability is rated **HIGH** and allows **a container with no special
privileges to read/write to any path on the host** and launch processes as root
on the host too.
{{< youtube id="M4a2b4KkXN8" title="Mitigation of CVE-2022-0492 on Kubernetes by enabling User Namespace support">}}
Most applications in containers run as root today, or as a semi-predictable
non-root user (user ID 65534 is a somewhat popular choice). When you run a Pod
with containers using a userns, Kubernetes runs those containers as unprivileged
users, with no changes needed in your app.
This means two containers running as user 65534 will effectively be mapped to
different users on the host, limiting what they can do to each other in case of
an escape, and if they are running as root, the privileges on the host are
reduced to the one of an unprivileged user.
[cve-link]: https://unit42.paloaltonetworks.com/cve-2022-0492-cgroups/
## Node system requirements
There are requirements on the Linux kernel version as well as the container
runtime to use this feature.
On Linux you need Linux 6.3 or greater. This is because the feature relies on a
kernel feature named idmap mounts, and support to use idmap mounts with tmpfs
was merged in Linux 6.3.
If you are using CRI-O with crun, this is [supported in CRI-O
1.28.1][CRIO-release] and crun 1.9 or greater. If you are using CRI-O with runc,
this is still not supported.
containerd support is currently targeted for containerd 2.0; it is likely that
it won't matter if you use it with crun or runc.
Please note that containerd 1.7 added _experimental_ support for user
namespaces as implemented in Kubernetes 1.25 and 1.26. The redesign done in 1.27
is not supported by containerd 1.7, therefore it only works, in terms of user
namespaces support, with Kubernetes 1.25 and 1.26.
One limitation present in containerd 1.7 is that it needs to change the
ownership of every file and directory inside the container image, during Pod
startup. This means it has a storage overhead and can significantly impact the
container startup latency. Containerd 2.0 will probably include a implementation
that will eliminate the startup latency added and the storage overhead. Take
this into account if you plan to use containerd 1.7 with user namespaces in
production.
None of these containerd limitations apply to [CRI-O 1.28][CRIO-release].
[CRIO-release]: https://github.com/cri-o/cri-o/releases/tag/v1.28.1
## Whats next?
Looking ahead to Kubernetes 1.29, the plan is to work with SIG Auth to integrate user
namespaces to Pod Security Standards (PSS) and the Pod Security Admission. For
the time being, the plan is to relax checks in PSS policies when user namespaces are
in use. This means that the fields `spec[.*].securityContext` `runAsUser`,
`runAsNonRoot`, `allowPrivilegeEscalation` and `capabilities` will not trigger a
violation if user namespaces are in use. The behavior will probably be controlled by
utilizing a API Server feature gate, like `UserNamespacesPodSecurityStandards`
or similar.
## How do I get involved?
You can reach SIG Node by several means:
- Slack: [#sig-node](https://kubernetes.slack.com/messages/sig-node)
- [Mailing list](https://groups.google.com/forum/#!forum/kubernetes-sig-node)
- [Open Community Issues/PRs](https://github.com/kubernetes/community/labels/sig%2Fnode)
You can also contact us directly:
- GitHub: @rata @giuseppe @saschagrunert
- Slack: @rata @giuseppe @sascha