diff --git a/content/en/docs/concepts/workloads/pods/user-namespaces.md b/content/en/docs/concepts/workloads/pods/user-namespaces.md new file mode 100644 index 0000000000..aa376d7657 --- /dev/null +++ b/content/en/docs/concepts/workloads/pods/user-namespaces.md @@ -0,0 +1,164 @@ +--- +title: User Namespaces +reviewers: +content_type: concept +weight: 160 +min-kubernetes-server-version: v1.25 +--- + + +{{< feature-state for_k8s_version="v1.25" state="alpha" >}} + +This page explains how user namespaces are used in Kubernetes pods. A user +namespace allows to isolate the user running inside the container from the one +in the host. + +A process running as root in a container can run as a different (non-root) user +in the host; in other words, the process has full privileges for operations +inside the user namespace, but is unprivileged for operations outside the +namespace. + +You can use this feature to reduce the damage a compromised container can do to +the host or other pods in the same node. There are [several security +vulnerabilities][KEP-vulns] rated either **HIGH** or **CRITICAL** that were not +exploitable when user namespaces is active. It is expected user namespace will +mitigate some future vulnerabilities too. + +[KEP-vulns]: https://github.com/kubernetes/enhancements/tree/217d790720c5aef09b8bd4d6ca96284a0affe6c2/keps/sig-node/127-user-namespaces#motivation + + +## {{% heading "prerequisites" %}} + +{{% thirdparty-content single="true" %}} + + +This is a Linux only feature. In addition, support is needed in the +{{< glossary_tooltip text="container runtime" term_id="container-runtime" >}} +to use this feature with Kubernetes stateless pods: + +* CRI-O: v1.25 has support for user namespaces. + +* containerd: support is planned for the 1.7 release. See containerd + issue [#7063][containerd-userns-issue] for more details. + +Support for this in [cri-dockerd is not planned][CRI-dockerd-issue] yet. + +[CRI-dockerd-issue]: https://github.com/Mirantis/cri-dockerd/issues/74 +[containerd-userns-issue]: https://github.com/containerd/containerd/issues/7063 + +## Introduction + +User namespaces is a Linux feature that allows to map users in the container to +different users in the host. Furthermore, the capabilities granted to a pod in +a user namespace are valid only in the namespace and void outside of it. + +A pod can opt-in to use user nameapces by setting the `pod.spec.hostUsers` field +to `false`. + +The kubelet will pick host UIDs/GIDs a pod is mapped to, and will do so in a way +to guarantee that no two stateless pods on the same node use the same mapping. + +The `runAsUser`, `runAsGroup`, `fsGroup`, etc. fields in the `pod.spec` always +refer to the user inside the container. + +The valid UIDs/GIDs when this feature is enabled is the range 0-65535. This +applies to files and processes (`runAsUser`, `runAsGroup`, etc.). + +Files using a UID/GID outside this range will be seen as belonging to the +overflow ID, usually 65534 (configured in `/proc/sys/kernel/overflowuid` and +`/proc/sys/kernel/overflowgid`). However, it is not possible to modify those +files, even by running as the 65534 user/group. + +Most applications that need to run as root but don't access other host +namespaces or resources, should continue to run fine without any changes needed +if user namespaces is activated. + +## Understanding user namespaces for stateless pods + +Several container runtimes with their default configuration (like Docker Engine, +containerd, CRI-O) use Linux namespaces for isolation. Other technologies exist +and can be used with those runtimes too (e.g. Kata Containers uses VMs instead of +Linux namespaces). This page is applicable for container runtimes using Linux +namespaces for isolation. + +When creating a pod, by default, several new namespaces are used for isolation: +a network namespace to isolate the network of the container, a PID namespace to +isolate the view of processes, etc. If a user namespace is used, this will +isolate the users in the container from the users in the node. + +This means containers can run as root and be mapped to a non-root user on the +host. Inside the container the process will think it is running as root (and +therefore tools like `apt`, `yum`, etc. work fine), while in reality the process +doesn't have privileges on the host. You can verify this, for example, if you +check the user the container process is running `ps` from the host. The user +`ps` shows is not the same as the user you see if you execute inside the +container the command `id`. + +This abstraction limits what can happen, for example, if the container manages +to escape to the host. Given that the container is running as a non-privileged +user on the host, it is limited what it can do to the host. + +Furthermore, as users on each pod will be mapped to different non-overlapping +users in the host, it is limited what they can do to other pods too. + +Capabilities granted to a pod are also limited to the pod user namespace and +mostly invalid out of it, some are even completely void. Here are two examples: +- `CAP_SYS_MODULE` does not have any effect if granted to a pod using user +namespaces, the pod isn't able to load kernel modules. +- `CAP_SYS_ADMIN` is limited to the pod's user namespace and invalid outside +of it. + +Without using a user namespace a container running as root, in the case of a +container breakout, has root privileges on the node. And if some capability were +granted to the container, the capabilities are valid on the host too. None of +this is true when we use user namespaces. + +If you want to know more details about what changes when user namespaces are in +use, see `man 7 user_namespaces`. + +## Set up a node to support user namespaces + +It is recommended that the host's files and host's processes use UIDs/GIDs in +the range of 0-65535. + +The kubelet will assign UIDs/GIDs higher than that to pods. Therefore, to +guarantee as much isolation as possible, the UIDs/GIDs used by the host's files +and host's processes should be in the range 0-65535. + +Note that this recommendation is important to mitigate the impact of CVEs like +[CVE-2021-25741][CVE-2021-25741], where a pod can potentially read arbitrary +files in the hosts. If the UIDs/GIDs of the pod and the host don't overlap, it +is limited what a pod would be able to do: the pod UID/GID won't match the +host's file owner/group. + +[CVE-2021-25741]: https://github.com/kubernetes/kubernetes/issues/104980 + +## Limitations + +When using a user namespace for the pod, it is disallowed to use other host +namespaces. In particular, if you set `hostUsers: false` then you are not +allowed to set any of: + + * `hostNetwork: true` + * `hostIPC: true` + * `hostPID: true` + +The pod is allowed to use no volumes at all or, if using volumes, only these +volume types are allowed: + + * configmap + * secret + * projected + * downwardAPI + * emptyDir + +To guarantee that the pod can read the files of such volumes, volumes are +created as if you specified `.spec.securityContext.fsGroup` as `0` for the Pod. +If it is specified to a different value, this other value will of course be +honored instead. + +As a by-product of this, folders and files for these volumes will have +permissions for the group, even if `defaultMode` or `mode` to specific items of +the volumes were specified without permissions to groups. For example, it is not +possible to mount these volumes in a way that its files have permissions only +for the owner. diff --git a/content/en/docs/tasks/configure-pod-container/user-namespaces.md b/content/en/docs/tasks/configure-pod-container/user-namespaces.md new file mode 100644 index 0000000000..e03f38c633 --- /dev/null +++ b/content/en/docs/tasks/configure-pod-container/user-namespaces.md @@ -0,0 +1,100 @@ +--- +title: Use a User Namespace With a Pod +reviewers: +content_type: task +weight: 160 +min-kubernetes-server-version: v1.25 +--- + + +{{< feature-state for_k8s_version="v1.25" state="alpha" >}} + +This page shows how to configure a user namespace for stateless pods. This +allows to isolate the user running inside the container from the one in the +host. + +A process running as root in a container can run as a different (non-root) user +in the host; in other words, the process has full privileges for operations +inside the user namespace, but is unprivileged for operations outside the +namespace. + +You can use this feature to reduce the damage a compromised container can do to +the host or other pods in the same node. There are [several security +vulnerabilities][KEP-vulns] rated either **HIGH** or **CRITICAL** that were not +exploitable when user namespaces is active. It is expected user namespace will +mitigate some future vulnerabilities too. + +Without using a user namespace a container running as root, in the case of a +container breakout, has root privileges on the node. And if some capability were +granted to the container, the capabilities are valid on the host too. None of +this is true when user namespaces are used. + +[KEP-vulns]: https://github.com/kubernetes/enhancements/tree/217d790720c5aef09b8bd4d6ca96284a0affe6c2/keps/sig-node/127-user-namespaces#motivation + +## {{% heading "prerequisites" %}} + +{{< include "task-tutorial-prereqs.md" >}} {{< version-check >}} + +{{% thirdparty-content single="true" %}} + + +* The node OS needs to be Linux +* You need to exec commands in the host +* You need to be able to exec into pods +* Feature gate `UserNamespacesStatelessPodsSupport` need to be enabled. + +In addition, support is needed in the +{{< glossary_tooltip text="container runtime" term_id="container-runtime" >}} +to use this feature with Kubernetes stateless pods: + +* CRI-O: v1.25 has support for user namespaces. + +Please note that **if your container runtime doesn't support user namespaces, the +new `pod.spec` field will be silently ignored and the pod will be created without +user namespaces.** + + + +## Run a Pod that uses a user namespace {#create-pod} + +A user namespace for a stateless pod is enabled setting the `hostUsers` field of +`.spec` to `false`. For example: + +{{< codenew file="pods/user-namespaces-stateless.yaml" >}} + +1. Create the pod on your cluster: + + ```shell + kubectl apply -f https://k8s.io/examples/pods/user-namespaces-stateless.yaml + ``` + +1. Attach to the container and run `readlink /proc/self/ns/user`: + + ```shell + kubectl attach -it userns bash + ``` + +And run the command. The output is similar to this: + +```none +readlink /proc/self/ns/user +user:[4026531837] +cat /proc/self/uid_map +0 0 4294967295 +``` + +Then, open a shell in the host and run the same command. + +The output must be different. This means the host and the pod are using a +different user namespace. When user namespaces are not enabled, the host and the +pod use the same user namespace. + +If you are running the kubelet inside a user namespace, you need to compare the +output from running the command in the pod to the output of running in the host: + +```none +readlink /proc/$pid/ns/user +user:[4026534732] + + +replacing `$pid` with the kubelet PID. diff --git a/content/en/examples/pods/user-namespaces-stateless.yaml b/content/en/examples/pods/user-namespaces-stateless.yaml new file mode 100644 index 0000000000..d254259f66 --- /dev/null +++ b/content/en/examples/pods/user-namespaces-stateless.yaml @@ -0,0 +1,10 @@ +apiVersion: v1 +kind: Pod +metadata: + name: userns +spec: + hostUsers: false + containers: + - name: shell + command: ["sleep", "infinity"] + image: debian