Merge pull request #35235 from kinvolk/rata/userns

Add docs for user namespaces in pods - phase 1 (KEP-127)
2022-08-18 07:28:23 -07:00 · 2022-08-18 07:28:23 -07:00 · fae72f6cc0
parent 438b42689d 501cde25c7
commit fae72f6cc0
3 changed files with 274 additions and 0 deletions
--- a/content/en/docs/concepts/workloads/pods/user-namespaces.md
+++ b/content/en/docs/concepts/workloads/pods/user-namespaces.md
@ -0,0 +1,164 @@
+---
+title: User Namespaces
+reviewers:
+content_type: concept
+weight: 160
+min-kubernetes-server-version: v1.25
+---
+
+<!-- overview -->
+{{< feature-state for_k8s_version="v1.25" state="alpha" >}}
+
+This page explains how user namespaces are used in Kubernetes pods. A user
+namespace allows to isolate the user running inside the container from the one
+in the host.
+
+A process running as root in a container can run as a different (non-root) user
+in the host; in other words, the process has full privileges for operations
+inside the user namespace, but is unprivileged for operations outside the
+namespace.
+
+You can use this feature to reduce the damage a compromised container can do to
+the host or other pods in the same node. There are [several security
+vulnerabilities][KEP-vulns] rated either **HIGH** or **CRITICAL** that were not
+exploitable when user namespaces is active. It is expected user namespace will
+mitigate some future vulnerabilities too.
+
+[KEP-vulns]: https://github.com/kubernetes/enhancements/tree/217d790720c5aef09b8bd4d6ca96284a0affe6c2/keps/sig-node/127-user-namespaces#motivation
+
+<!-- body -->
+## {{% heading "prerequisites" %}}
+
+{{% thirdparty-content single="true" %}}
+<!-- if adding another runtime in the future, omit the single setting -->
+
+This is a Linux only feature. In addition, support is needed in the 
+{{< glossary_tooltip text="container runtime" term_id="container-runtime" >}}
+to use this feature with Kubernetes stateless pods:
+
+* CRI-O: v1.25 has support for user namespaces.
+
+* containerd: support is planned for the 1.7 release. See containerd
+  issue [#7063][containerd-userns-issue] for more details.
+
+Support for this in [cri-dockerd is not planned][CRI-dockerd-issue] yet.
+
+[CRI-dockerd-issue]: https://github.com/Mirantis/cri-dockerd/issues/74
+[containerd-userns-issue]: https://github.com/containerd/containerd/issues/7063
+
+## Introduction
+
+User namespaces is a Linux feature that allows to map users in the container to
+different users in the host. Furthermore, the capabilities granted to a pod in
+a user namespace are valid only in the namespace and void outside of it.
+
+A pod can opt-in to use user nameapces by setting the `pod.spec.hostUsers` field
+to `false`.
+
+The kubelet will pick host UIDs/GIDs a pod is mapped to, and will do so in a way
+to guarantee that no two stateless pods on the same node use the same mapping.
+
+The `runAsUser`, `runAsGroup`, `fsGroup`, etc. fields in the `pod.spec` always
+refer to the user inside the container.
+
+The valid UIDs/GIDs when this feature is enabled is the range 0-65535. This
+applies to files and processes (`runAsUser`, `runAsGroup`, etc.).
+
+Files using a UID/GID outside this range will be seen as belonging to the
+overflow ID, usually 65534 (configured in `/proc/sys/kernel/overflowuid` and
+`/proc/sys/kernel/overflowgid`). However, it is not possible to modify those
+files, even by running as the 65534 user/group.
+
+Most applications that need to run as root but don't access other host
+namespaces or resources, should continue to run fine without any changes needed
+if user namespaces is activated.
+
+## Understanding user namespaces for stateless pods
+
+Several container runtimes with their default configuration (like Docker Engine,
+containerd, CRI-O) use Linux namespaces for isolation. Other technologies exist
+and can be used with those runtimes too (e.g. Kata Containers uses VMs instead of
+Linux namespaces). This page is applicable for container runtimes using Linux
+namespaces for isolation.
+
+When creating a pod, by default, several new namespaces are used for isolation:
+a network namespace to isolate the network of the container, a PID namespace to
+isolate the view of processes, etc. If a user namespace is used, this will
+isolate the users in the container from the users in the node.
+
+This means containers can run as root and be mapped to a non-root user on the
+host. Inside the container the process will think it is running as root (and
+therefore tools like `apt`, `yum`, etc. work fine), while in reality the process
+doesn't have privileges on the host. You can verify this, for example, if you
+check the user the container process is running `ps` from the host. The user
+`ps` shows is not the same as the user you see if you execute inside the
+container the command `id`.
+
+This abstraction limits what can happen, for example, if the container manages
+to escape to the host. Given that the container is running as a non-privileged
+user on the host, it is limited what it can do to the host.
+
+Furthermore, as users on each pod will be mapped to different non-overlapping
+users in the host, it is limited what they can do to other pods too.
+
+Capabilities granted to a pod are also limited to the pod user namespace and
+mostly invalid out of it, some are even completely void. Here are two examples:
+- `CAP_SYS_MODULE` does not have any effect if granted to a pod using user
+namespaces, the pod isn't able to load kernel modules.
+- `CAP_SYS_ADMIN` is limited to the pod's user namespace and invalid outside
+of it.
+
+Without using a user namespace a container running as root, in the case of a
+container breakout, has root privileges on the node. And if some capability were
+granted to the container, the capabilities are valid on the host too. None of
+this is true when we use user namespaces.
+
+If you want to know more details about what changes when user namespaces are in
+use, see `man 7 user_namespaces`.
+
+## Set up a node to support user namespaces
+
+It is recommended that the host's files and host's processes use UIDs/GIDs in
+the range of 0-65535.
+
+The kubelet will assign UIDs/GIDs higher than that to pods. Therefore, to
+guarantee as much isolation as possible, the UIDs/GIDs used by the host's files
+and host's processes should be in the range 0-65535.
+
+Note that this recommendation is important to mitigate the impact of CVEs like
+[CVE-2021-25741][CVE-2021-25741], where a pod can potentially read arbitrary
+files in the hosts. If the UIDs/GIDs of the pod and the host don't overlap, it
+is limited what a pod would be able to do: the pod UID/GID won't match the
+host's file owner/group.
+
+[CVE-2021-25741]: https://github.com/kubernetes/kubernetes/issues/104980
+
+## Limitations
+
+When using a user namespace for the pod, it is disallowed to use other host
+namespaces. In particular, if you set `hostUsers: false` then you are not
+allowed to set any of:
+
+ * `hostNetwork: true`
+ * `hostIPC: true`
+ * `hostPID: true`
+
+The pod is allowed to use no volumes at all or, if using volumes, only these
+volume types are allowed:
+
+ * configmap
+ * secret
+ * projected
+ * downwardAPI
+ * emptyDir
+
+To guarantee that the pod can read the files of such volumes, volumes are
+created as if you specified `.spec.securityContext.fsGroup` as `0` for the Pod.
+If it is specified to a different value, this other value will of course be
+honored instead.
+
+As a by-product of this, folders and files for these volumes will have
+permissions for the group, even if `defaultMode` or `mode` to specific items of
+the volumes were specified without permissions to groups. For example, it is not
+possible to mount these volumes in a way that its files have permissions only
+for the owner.
--- a/content/en/docs/tasks/configure-pod-container/user-namespaces.md
+++ b/content/en/docs/tasks/configure-pod-container/user-namespaces.md
@ -0,0 +1,100 @@
+---
+title: Use a User Namespace With a Pod
+reviewers:
+content_type: task
+weight: 160
+min-kubernetes-server-version: v1.25
+---
+
+<!-- overview -->
+{{< feature-state for_k8s_version="v1.25" state="alpha" >}}
+
+This page shows how to configure a user namespace for stateless pods. This
+allows to isolate the user running inside the container from the one in the
+host.
+
+A process running as root in a container can run as a different (non-root) user
+in the host; in other words, the process has full privileges for operations
+inside the user namespace, but is unprivileged for operations outside the
+namespace.
+
+You can use this feature to reduce the damage a compromised container can do to
+the host or other pods in the same node. There are [several security
+vulnerabilities][KEP-vulns] rated either **HIGH** or **CRITICAL** that were not
+exploitable when user namespaces is active. It is expected user namespace will
+mitigate some future vulnerabilities too.
+
+Without using a user namespace a container running as root, in the case of a
+container breakout, has root privileges on the node. And if some capability were
+granted to the container, the capabilities are valid on the host too. None of
+this is true when user namespaces are used.
+
+[KEP-vulns]: https://github.com/kubernetes/enhancements/tree/217d790720c5aef09b8bd4d6ca96284a0affe6c2/keps/sig-node/127-user-namespaces#motivation
+
+## {{% heading "prerequisites" %}}
+
+{{< include "task-tutorial-prereqs.md" >}} {{< version-check >}}
+
+{{% thirdparty-content single="true" %}}
+<!-- if adding another runtime in the future, omit the single setting -->
+
+* The node OS needs to be Linux
+* You need to exec commands in the host
+* You need to be able to exec into pods
+* Feature gate `UserNamespacesStatelessPodsSupport` need to be enabled.
+
+In addition, support is needed in the
+{{< glossary_tooltip text="container runtime" term_id="container-runtime" >}}
+to use this feature with Kubernetes stateless pods:
+
+* CRI-O: v1.25 has support for user namespaces.
+
+Please note that **if your container runtime doesn't support user namespaces, the
+new `pod.spec` field will be silently ignored and the pod will be created without
+user namespaces.**
+
+<!-- steps -->
+
+## Run a Pod that uses a user namespace {#create-pod}
+
+A user namespace for a stateless pod is enabled setting the `hostUsers` field of
+`.spec` to `false`. For example:
+
+{{< codenew file="pods/user-namespaces-stateless.yaml" >}}
+
+1. Create the pod on your cluster:
+
+   ```shell
+   kubectl apply -f https://k8s.io/examples/pods/user-namespaces-stateless.yaml
+   ```
+
+1. Attach to the container and run `readlink /proc/self/ns/user`:
+
+   ```shell
+   kubectl attach -it userns bash
+   ```
+
+And run the command. The output is similar to this:
+
+```none
+readlink /proc/self/ns/user
+user:[4026531837]
+cat /proc/self/uid_map
+0          0 4294967295
+```
+
+Then, open a shell in the host and run the same command.
+
+The output must be different. This means the host and the pod are using a
+different user namespace. When user namespaces are not enabled, the host and the
+pod use the same user namespace.
+
+If you are running the kubelet inside a user namespace, you need to compare the
+output from running the command in the pod to the output of running in the host:
+
+```none
+readlink /proc/$pid/ns/user
+user:[4026534732]
+
+
+replacing `$pid` with the kubelet PID.
--- a/content/en/examples/pods/user-namespaces-stateless.yaml
+++ b/content/en/examples/pods/user-namespaces-stateless.yaml
@ -0,0 +1,10 @@
+apiVersion: v1
+kind: Pod
+metadata:
+  name: userns
+spec:
+  hostUsers: false
+  containers:
+  - name: shell
+    command: ["sleep", "infinity"]
+    image: debian