blog: Add post about userns for stateful pods
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>pull/42778/head
parent
58c519905d
commit
93d8268b09
|
@ -0,0 +1,148 @@
|
|||
---
|
||||
layout: blog
|
||||
title: "User Namespaces: Now Supports Running Stateful Pods in Alpha!"
|
||||
date: 2023-09-13
|
||||
slug: userns-alpha
|
||||
---
|
||||
|
||||
**Authors:** Rodrigo Campos Catelin (Microsoft), Giuseppe Scrivano (Red Hat), Sascha Grunert (Red Hat)
|
||||
|
||||
Kubernetes v1.25 introduced support for user namespaces for only stateless
|
||||
pods. Kubernetes 1.28 lifted that restriction, after some design changes were
|
||||
done in 1.27.
|
||||
|
||||
The beauty of this feature is that:
|
||||
* it is trivial to adopt (you just need to set a bool in the pod spec)
|
||||
* doesn't need any changes for **most** applications
|
||||
* improves security by _drastically_ enhancing the isolation of containers and
|
||||
mitigating CVEs rated HIGH and CRITICAL.
|
||||
|
||||
This post explains the basics of user namespaces and also shows:
|
||||
* the changes that arrived in the recent Kubernetes v1.28 release
|
||||
* a **demo of a vulnerability rated as HIGH** that is not exploitable with user namespaces
|
||||
* the runtime requirements to use this feature
|
||||
* what you can expect in future releases regarding user namespaces.
|
||||
|
||||
## What is a user namespace?
|
||||
|
||||
A user namespace is a Linux feature that isolates the user and group identifiers
|
||||
(UIDs and GIDs) of the containers from the ones on the host. The indentifiers
|
||||
in the container can be mapped to indentifiers on the host in a way where the
|
||||
host UID/GIDs used for different containers never overlap. Even more, the
|
||||
identifiers can be mapped to *unprivileged* non-overlapping UIDs and GIDs on the
|
||||
host. This basically means two things:
|
||||
|
||||
* As the UIDs and GIDs for different containers are mapped to different UIDs
|
||||
and GIDs on the host, containers have a harder time to attack each other even
|
||||
if they escape the container boundaries. For example, if container A is running
|
||||
with different UIDs and GIDs on the host than container B, the operations it
|
||||
can do on container B's files and process are limited: only read/write what a
|
||||
file allows to others, as it will never have permission for the owner or
|
||||
group (the UIDs/GIDs on the host are guaranteed to be different for
|
||||
different containers).
|
||||
|
||||
* As the UIDs and GIDs are mapped to unprivileged users on the host, if a
|
||||
container escapes the container boundaries, even if it is running as root
|
||||
inside the container, it has no privileges on the host. This greatly
|
||||
protects what host files it can read/write, which process it can send signals
|
||||
to, etc.
|
||||
|
||||
Furthermore, capabilities granted are only valid inside the user namespace and
|
||||
not on the host.
|
||||
|
||||
Without using a user namespace a container running as root, in the case of a
|
||||
container breakout, has root privileges on the node. And if some capabilities
|
||||
were granted to the container, the capabilities are valid on the host too. None
|
||||
of this is true when using user namespaces (modulo bugs, of course 🙂).
|
||||
|
||||
## Changes in 1.28
|
||||
|
||||
As already mentioned, starting from 1.28, Kubernetes supports user namespaces
|
||||
with stateful pods. This means that pods with user namespaces can use any type
|
||||
of volume, they are no longer limited to only some volume types as before.
|
||||
|
||||
The feature gate to activate this feature was renamed, it is no longer
|
||||
`UserNamespacesStatelessPodsSupport` but from 1.28 onwards you should use
|
||||
`UserNamespacesSupport`. There were many changes done and the requirements on
|
||||
the node hosts changed. So with Kubernetes 1.28 the feature flag was renamed to
|
||||
reflect this.
|
||||
|
||||
## Demo
|
||||
|
||||
Rodrigo created a demo which exploits [CVE 2022-0492][cve-link] and shows how
|
||||
the exploit can occur without user namespaces. He also shows how it is not
|
||||
possible to use this exploit from a Pod where the containers are using this
|
||||
feature.
|
||||
|
||||
This vulnerability is rated **HIGH** and allows **a container with no special
|
||||
privileges to read/write to any path on the host** and launch processes as root
|
||||
on the host too.
|
||||
|
||||
{{< youtube id="M4a2b4KkXN8" title="Mitigation of CVE-2022-0492 on Kubernetes by enabling User Namespace support">}}
|
||||
|
||||
Most applications in containers run as root today, or as a semi-predictable
|
||||
non-root user (user ID 65534 is a somewhat popular choice). When you run a Pod
|
||||
with containers using a userns, Kubernetes runs those containers as unprivileged
|
||||
users, with no changes needed in your app.
|
||||
|
||||
This means two containers running as user 65534 will effectively be mapped to
|
||||
different users on the host, limiting what they can do to each other in case of
|
||||
an escape, and if they are running as root, the privileges on the host are
|
||||
reduced to the one of an unprivileged user.
|
||||
|
||||
[cve-link]: https://unit42.paloaltonetworks.com/cve-2022-0492-cgroups/
|
||||
|
||||
## Node system requirements
|
||||
|
||||
There are requirements on the Linux kernel version as well as the container
|
||||
runtime to use this feature.
|
||||
|
||||
On Linux you need Linux 6.3 or greater. This is because the feature relies on a
|
||||
kernel feature named idmap mounts, and support to use idmap mounts with tmpfs
|
||||
was merged in Linux 6.3.
|
||||
|
||||
If you are using CRI-O with crun, this is [supported in CRI-O
|
||||
1.28.1][CRIO-release] and crun 1.9 or greater. If you are using CRI-O with runc,
|
||||
this is still not supported.
|
||||
|
||||
containerd support is currently targeted for containerd 2.0; it is likely that
|
||||
it won't matter if you use it with crun or runc.
|
||||
|
||||
Please note that containerd 1.7 added _experimental_ support for user
|
||||
namespaces as implemented in Kubernetes 1.25 and 1.26. The redesign done in 1.27
|
||||
is not supported by containerd 1.7, therefore it only works, in terms of user
|
||||
namespaces support, with Kubernetes 1.25 and 1.26.
|
||||
|
||||
One limitation present in containerd 1.7 is that it needs to change the
|
||||
ownership of every file and directory inside the container image, during Pod
|
||||
startup. This means it has a storage overhead and can significantly impact the
|
||||
container startup latency. Containerd 2.0 will probably include a implementation
|
||||
that will eliminate the startup latency added and the storage overhead. Take
|
||||
this into account if you plan to use containerd 1.7 with user namespaces in
|
||||
production.
|
||||
|
||||
None of these containerd limitations apply to [CRI-O 1.28][CRIO-release].
|
||||
|
||||
[CRIO-release]: https://github.com/cri-o/cri-o/releases/tag/v1.28.1
|
||||
|
||||
## What’s next?
|
||||
|
||||
Looking ahead to Kubernetes 1.29, the plan is to work with SIG Auth to integrate user
|
||||
namespaces to Pod Security Standards (PSS) and the Pod Security Admission. For
|
||||
the time being, the plan is to relax checks in PSS policies when user namespaces are
|
||||
in use. This means that the fields `spec[.*].securityContext` `runAsUser`,
|
||||
`runAsNonRoot`, `allowPrivilegeEscalation` and `capabilities` will not trigger a
|
||||
violation if user namespaces are in use. The behavior will probably be controlled by
|
||||
utilizing a API Server feature gate, like `UserNamespacesPodSecurityStandards`
|
||||
or similar.
|
||||
|
||||
## How do I get involved?
|
||||
|
||||
You can reach SIG Node by several means:
|
||||
- Slack: [#sig-node](https://kubernetes.slack.com/messages/sig-node)
|
||||
- [Mailing list](https://groups.google.com/forum/#!forum/kubernetes-sig-node)
|
||||
- [Open Community Issues/PRs](https://github.com/kubernetes/community/labels/sig%2Fnode)
|
||||
|
||||
You can also contact us directly:
|
||||
- GitHub: @rata @giuseppe @saschagrunert
|
||||
- Slack: @rata @giuseppe @sascha
|
Loading…
Reference in New Issue