Merge pull request #35235 from kinvolk/rata/userns

Add docs for user namespaces in pods - phase 1 (KEP-127)
2022-08-18 07:28:23 -07:00 · 2022-08-18 07:28:23 -07:00 · fae72f6cc0
parent 438b42689d 501cde25c7
commit fae72f6cc0
3 changed files with 274 additions and 0 deletions
--- a/content/en/docs/concepts/workloads/pods/user-namespaces.md
+++ b/content/en/docs/concepts/workloads/pods/user-namespaces.md
@ -0,0 +1,164 @@
 ---
 title: User Namespaces
 reviewers:
 content_type: concept
 weight: 160
 min-kubernetes-server-version: v1.25
 ---
 <!-- overview -->
 {{< feature-state for_k8s_version="v1.25" state="alpha" >}}
 This page explains how user namespaces are used in Kubernetes pods. A user
 namespace allows to isolate the user running inside the container from the one
 in the host.
 A process running as root in a container can run as a different (non-root) user
 in the host; in other words, the process has full privileges for operations
 inside the user namespace, but is unprivileged for operations outside the
 namespace.
 You can use this feature to reduce the damage a compromised container can do to
 the host or other pods in the same node. There are [several security
 vulnerabilities][KEP-vulns] rated either **HIGH** or **CRITICAL** that were not
 exploitable when user namespaces is active. It is expected user namespace will
 mitigate some future vulnerabilities too.
 [KEP-vulns]: https://github.com/kubernetes/enhancements/tree/217d790720c5aef09b8bd4d6ca96284a0affe6c2/keps/sig-node/127-user-namespaces#motivation
 <!-- body -->
 ## {{% heading "prerequisites" %}}
 {{% thirdparty-content single="true" %}}
 <!-- if adding another runtime in the future, omit the single setting -->
 This is a Linux only feature. In addition, support is needed in the 
 {{< glossary_tooltip text="container runtime" term_id="container-runtime" >}}
 to use this feature with Kubernetes stateless pods:
 * CRI-O: v1.25 has support for user namespaces.
 * containerd: support is planned for the 1.7 release. See containerd
  issue [#7063][containerd-userns-issue] for more details.
 Support for this in [cri-dockerd is not planned][CRI-dockerd-issue] yet.
 [CRI-dockerd-issue]: https://github.com/Mirantis/cri-dockerd/issues/74
 [containerd-userns-issue]: https://github.com/containerd/containerd/issues/7063
 ## Introduction
 User namespaces is a Linux feature that allows to map users in the container to
 different users in the host. Furthermore, the capabilities granted to a pod in
 a user namespace are valid only in the namespace and void outside of it.
 A pod can opt-in to use user nameapces by setting the `pod.spec.hostUsers` field
 to `false`.
 The kubelet will pick host UIDs/GIDs a pod is mapped to, and will do so in a way
 to guarantee that no two stateless pods on the same node use the same mapping.
 The `runAsUser`, `runAsGroup`, `fsGroup`, etc. fields in the `pod.spec` always
 refer to the user inside the container.
 The valid UIDs/GIDs when this feature is enabled is the range 0-65535. This
 applies to files and processes (`runAsUser`, `runAsGroup`, etc.).
 Files using a UID/GID outside this range will be seen as belonging to the
 overflow ID, usually 65534 (configured in `/proc/sys/kernel/overflowuid` and
 `/proc/sys/kernel/overflowgid`). However, it is not possible to modify those
 files, even by running as the 65534 user/group.
 Most applications that need to run as root but don't access other host
 namespaces or resources, should continue to run fine without any changes needed
 if user namespaces is activated.
 ## Understanding user namespaces for stateless pods
 Several container runtimes with their default configuration (like Docker Engine,
 containerd, CRI-O) use Linux namespaces for isolation. Other technologies exist
 and can be used with those runtimes too (e.g. Kata Containers uses VMs instead of
 Linux namespaces). This page is applicable for container runtimes using Linux
 namespaces for isolation.
 When creating a pod, by default, several new namespaces are used for isolation:
 a network namespace to isolate the network of the container, a PID namespace to
 isolate the view of processes, etc. If a user namespace is used, this will
 isolate the users in the container from the users in the node.
 This means containers can run as root and be mapped to a non-root user on the
 host. Inside the container the process will think it is running as root (and
 therefore tools like `apt`, `yum`, etc. work fine), while in reality the process
 doesn't have privileges on the host. You can verify this, for example, if you
 check the user the container process is running `ps` from the host. The user
 `ps` shows is not the same as the user you see if you execute inside the
 container the command `id`.
 This abstraction limits what can happen, for example, if the container manages
 to escape to the host. Given that the container is running as a non-privileged
 user on the host, it is limited what it can do to the host.
 Furthermore, as users on each pod will be mapped to different non-overlapping
 users in the host, it is limited what they can do to other pods too.
 Capabilities granted to a pod are also limited to the pod user namespace and
 mostly invalid out of it, some are even completely void. Here are two examples:
 - `CAP_SYS_MODULE` does not have any effect if granted to a pod using user
 namespaces, the pod isn't able to load kernel modules.
 - `CAP_SYS_ADMIN` is limited to the pod's user namespace and invalid outside
 of it.
 Without using a user namespace a container running as root, in the case of a
 container breakout, has root privileges on the node. And if some capability were
 granted to the container, the capabilities are valid on the host too. None of
 this is true when we use user namespaces.
 If you want to know more details about what changes when user namespaces are in
 use, see `man 7 user_namespaces`.
 ## Set up a node to support user namespaces
 It is recommended that the host's files and host's processes use UIDs/GIDs in
 the range of 0-65535.
 The kubelet will assign UIDs/GIDs higher than that to pods. Therefore, to
 guarantee as much isolation as possible, the UIDs/GIDs used by the host's files
 and host's processes should be in the range 0-65535.
 Note that this recommendation is important to mitigate the impact of CVEs like
 [CVE-2021-25741][CVE-2021-25741], where a pod can potentially read arbitrary
 files in the hosts. If the UIDs/GIDs of the pod and the host don't overlap, it
 is limited what a pod would be able to do: the pod UID/GID won't match the
 host's file owner/group.
 [CVE-2021-25741]: https://github.com/kubernetes/kubernetes/issues/104980
 ## Limitations
 When using a user namespace for the pod, it is disallowed to use other host
 namespaces. In particular, if you set `hostUsers: false` then you are not
 allowed to set any of:
 * `hostNetwork: true`
 * `hostIPC: true`
 * `hostPID: true`
 The pod is allowed to use no volumes at all or, if using volumes, only these
 volume types are allowed:
 * configmap
 * secret
 * projected
 * downwardAPI
 * emptyDir
 To guarantee that the pod can read the files of such volumes, volumes are
 created as if you specified `.spec.securityContext.fsGroup` as `0` for the Pod.
 If it is specified to a different value, this other value will of course be
 honored instead.
 As a by-product of this, folders and files for these volumes will have
 permissions for the group, even if `defaultMode` or `mode` to specific items of
 the volumes were specified without permissions to groups. For example, it is not
 possible to mount these volumes in a way that its files have permissions only
 for the owner.
--- a/content/en/docs/tasks/configure-pod-container/user-namespaces.md
+++ b/content/en/docs/tasks/configure-pod-container/user-namespaces.md
@ -0,0 +1,100 @@
 ---
 title: Use a User Namespace With a Pod
 reviewers:
 content_type: task
 weight: 160
 min-kubernetes-server-version: v1.25
 ---
 <!-- overview -->
 {{< feature-state for_k8s_version="v1.25" state="alpha" >}}
 This page shows how to configure a user namespace for stateless pods. This
 allows to isolate the user running inside the container from the one in the
 host.
 A process running as root in a container can run as a different (non-root) user
 in the host; in other words, the process has full privileges for operations
 inside the user namespace, but is unprivileged for operations outside the
 namespace.
 You can use this feature to reduce the damage a compromised container can do to
 the host or other pods in the same node. There are [several security
 vulnerabilities][KEP-vulns] rated either **HIGH** or **CRITICAL** that were not
 exploitable when user namespaces is active. It is expected user namespace will
 mitigate some future vulnerabilities too.
 Without using a user namespace a container running as root, in the case of a
 container breakout, has root privileges on the node. And if some capability were
 granted to the container, the capabilities are valid on the host too. None of
 this is true when user namespaces are used.
 [KEP-vulns]: https://github.com/kubernetes/enhancements/tree/217d790720c5aef09b8bd4d6ca96284a0affe6c2/keps/sig-node/127-user-namespaces#motivation
 ## {{% heading "prerequisites" %}}
 {{< include "task-tutorial-prereqs.md" >}} {{< version-check >}}
 {{% thirdparty-content single="true" %}}
 <!-- if adding another runtime in the future, omit the single setting -->
 * The node OS needs to be Linux
 * You need to exec commands in the host
 * You need to be able to exec into pods
 * Feature gate `UserNamespacesStatelessPodsSupport` need to be enabled.
 In addition, support is needed in the
 {{< glossary_tooltip text="container runtime" term_id="container-runtime" >}}
 to use this feature with Kubernetes stateless pods:
 * CRI-O: v1.25 has support for user namespaces.
 Please note that **if your container runtime doesn't support user namespaces, the
 new `pod.spec` field will be silently ignored and the pod will be created without
 user namespaces.**
 <!-- steps -->
 ## Run a Pod that uses a user namespace {#create-pod}
 A user namespace for a stateless pod is enabled setting the `hostUsers` field of
 `.spec` to `false`. For example:
 {{< codenew file="pods/user-namespaces-stateless.yaml" >}}
 1. Create the pod on your cluster:
   ```shell
   kubectl apply -f https://k8s.io/examples/pods/user-namespaces-stateless.yaml
   ```
 1. Attach to the container and run `readlink /proc/self/ns/user`:
   ```shell
   kubectl attach -it userns bash
   ```
 And run the command. The output is similar to this:
 ```none
 readlink /proc/self/ns/user
 user:[4026531837]
 cat /proc/self/uid_map
 0          0 4294967295
 ```
 Then, open a shell in the host and run the same command.
 The output must be different. This means the host and the pod are using a
 different user namespace. When user namespaces are not enabled, the host and the
 pod use the same user namespace.
 If you are running the kubelet inside a user namespace, you need to compare the
 output from running the command in the pod to the output of running in the host:
 ```none
 readlink /proc/$pid/ns/user
 user:[4026534732]
 replacing `$pid` with the kubelet PID.
--- a/content/en/examples/pods/user-namespaces-stateless.yaml
+++ b/content/en/examples/pods/user-namespaces-stateless.yaml
@ -0,0 +1,10 @@
 apiVersion: v1
 kind: Pod
 metadata:
  name: userns
 spec:
  hostUsers: false
  containers:
  - name: shell
    command: ["sleep", "infinity"]
    image: debian