From a96eb1118fed5d29fd7d7d66e906c4cba0bb758e Mon Sep 17 00:00:00 2001 From: Manish Kumar Date: Mon, 11 Jul 2022 10:31:02 +0530 Subject: [PATCH] Convert CRLF to LF --- .../2020-05-21-wsl2-dockerdesktop-k8s.md | 1178 ++++++++--------- ...14-using-finalizers-to-control-deletion.md | 538 ++++---- .../docs/concepts/security/multi-tenancy.md | 542 ++++---- .../en/docs/reference/glossary/eviction.md | 36 +- .../administer-cluster/topology-manager.md | 540 ++++---- 5 files changed, 1417 insertions(+), 1417 deletions(-) diff --git a/content/en/blog/_posts/2020-05-21-wsl2-dockerdesktop-k8s.md b/content/en/blog/_posts/2020-05-21-wsl2-dockerdesktop-k8s.md index cfc0cc53fc..1166d8b766 100644 --- a/content/en/blog/_posts/2020-05-21-wsl2-dockerdesktop-k8s.md +++ b/content/en/blog/_posts/2020-05-21-wsl2-dockerdesktop-k8s.md @@ -1,589 +1,589 @@ ---- -layout: blog -title: "WSL+Docker: Kubernetes on the Windows Desktop" -date: 2020-05-21 -slug: wsl-docker-kubernetes-on-the-windows-desktop ---- - -**Authors**: [Nuno do Carmo](https://twitter.com/nunixtech) Docker Captain and WSL Corsair; [Ihor Dvoretskyi](https://twitter.com/idvoretskyi), Developer Advocate, Cloud Native Computing Foundation - -# Introduction - -New to Windows 10 and WSL2, or new to Docker and Kubernetes? Welcome to this blog post where we will install from scratch Kubernetes in Docker [KinD](https://kind.sigs.k8s.io/) and [Minikube](https://minikube.sigs.k8s.io/docs/). - - -# Why Kubernetes on Windows? - -For the last few years, Kubernetes became a de-facto standard platform for running containerized services and applications in distributed environments. While a wide variety of distributions and installers exist to deploy Kubernetes in the cloud environments (public, private or hybrid), or within the bare metal environments, there is still a need to deploy and run Kubernetes locally, for example, on the developer's workstation. - -Kubernetes has been originally designed to be deployed and used in the Linux environments. However, a good number of users (and not only application developers) use Windows OS as their daily driver. When Microsoft revealed WSL - [the Windows Subsystem for Linux](https://docs.microsoft.com/en-us/windows/wsl/), the line between Windows and Linux environments became even less visible. - - -Also, WSL brought an ability to run Kubernetes on Windows almost seamlessly! - - -Below, we will cover in brief how to install and use various solutions to run Kubernetes locally. - -# Prerequisites - -Since we will explain how to install KinD, we won't go into too much detail around the installation of KinD's dependencies. - -However, here is the list of the prerequisites needed and their version/lane: - -- OS: Windows 10 version 2004, Build 19041 -- [WSL2 enabled](https://docs.microsoft.com/en-us/windows/wsl/wsl2-install) - - In order to install the distros as WSL2 by default, once WSL2 installed, run the command `wsl.exe --set-default-version 2` in Powershell -- WSL2 distro installed from the Windows Store - the distro used is Ubuntu-18.04 -- [Docker Desktop for Windows](https://hub.docker.com/editions/community/docker-ce-desktop-windows), stable channel - the version used is 2.2.0.4 -- [Optional] Microsoft Terminal installed from the Windows Store - - Open the Windows store and type "Terminal" in the search, it will be (normally) the first option - -![Windows Store Terminal](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-windows-store-terminal.png) - -And that's actually it. For Docker Desktop for Windows, no need to configure anything yet as we will explain it in the next section. - -# WSL2: First contact - -Once everything is installed, we can launch the WSL2 terminal from the Start menu, and type "Ubuntu" for searching the applications and documents: - -![Start Menu Search](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-start-menu-search.png) - -Once found, click on the name and it will launch the default Windows console with the Ubuntu bash shell running. - -Like for any normal Linux distro, you need to create a user and set a password: - -![User-Password](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-user-password.png) - -## [Optional] Update the `sudoers` - -As we are working, normally, on our local computer, it might be nice to update the `sudoers` and set the group `%sudo` to be password-less: - -```bash -# Edit the sudoers with the visudo command -sudo visudo - -# Change the %sudo group to be password-less -%sudo ALL=(ALL:ALL) NOPASSWD: ALL - -# Press CTRL+X to exit -# Press Y to save -# Press Enter to confirm -``` - -![visudo](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-visudo.png) - -## Update Ubuntu - -Before we move to the Docker Desktop settings, let's update our system and ensure we start in the best conditions: - -```bash -# Update the repositories and list of the packages available -sudo apt update -# Update the system based on the packages installed > the "-y" will approve the change automatically -sudo apt upgrade -y -``` - -![apt-update-upgrade](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-apt-update-upgrade.png) - -# Docker Desktop: faster with WSL2 - -Before we move into the settings, let's do a small test, it will display really how cool the new integration with Docker Desktop is: - -```bash -# Try to see if the docker cli and daemon are installed -docker version -# Same for kubectl -kubectl version -``` - -![kubectl-error](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-docker-kubectl-error.png) - -You got an error? Perfect! It's actually good news, so let's now move on to the settings. - -## Docker Desktop settings: enable WSL2 integration - -First let's start Docker Desktop for Windows if it's not still the case. Open the Windows start menu and type "docker", click on the name to start the application: - -![docker-start](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-docker-start.png) - -You should now see the Docker icon with the other taskbar icons near the clock: - -![docker-taskbar](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-docker-taskbar.png) - -Now click on the Docker icon and choose settings. A new window will appear: - -![docker-settings-general](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-docker-settings-general.png) - -By default, the WSL2 integration is not active, so click the "Enable the experimental WSL 2 based engine" and click "Apply & Restart": - -![docker-settings-wsl2](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-docker-settings-wsl2-activated.png) - -What this feature did behind the scenes was to create two new distros in WSL2, containing and running all the needed backend sockets, daemons and also the CLI tools (read: docker and kubectl command). - -Still, this first setting is still not enough to run the commands inside our distro. If we try, we will have the same error as before. - -In order to fix it, and finally be able to use the commands, we need to tell the Docker Desktop to "attach" itself to our distro also: - -![docker-resources-wsl](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-docker-resources-wsl-integration.png) - -Let's now switch back to our WSL2 terminal and see if we can (finally) launch the commands: - -```bash -# Try to see if the docker cli and daemon are installed -docker version -# Same for kubectl -kubectl version -``` - -![docker-kubectl-success](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-docker-kubectl-success.png) - -> Tip: if nothing happens, restart Docker Desktop and restart the WSL process in Powershell: `Restart-Service LxssManager` and launch a new Ubuntu session - -And success! The basic settings are now done and we move to the installation of KinD. - -# KinD: Kubernetes made easy in a container - -Right now, we have Docker that is installed, configured and the last test worked fine. - -However, if we look carefully at the `kubectl` command, it found the "Client Version" (1.15.5), but it didn't find any server. - -This is normal as we didn't enable the Docker Kubernetes cluster. So let's install KinD and create our first cluster. - -And as sources are always important to mention, we will follow (partially) the how-to on the [official KinD website](https://kind.sigs.k8s.io/docs/user/quick-start/): - -```bash -# Download the latest version of KinD -curl -Lo ./kind https://github.com/kubernetes-sigs/kind/releases/download/v0.7.0/kind-linux-amd64 -# Make the binary executable -chmod +x ./kind -# Move the binary to your executable path -sudo mv ./kind /usr/local/bin/ -``` - -![kind-install](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-kind-install.png) - -## KinD: the first cluster - -We are ready to create our first cluster: - -```bash -# Check if the KUBECONFIG is not set -echo $KUBECONFIG -# Check if the .kube directory is created > if not, no need to create it -ls $HOME/.kube -# Create the cluster and give it a name (optional) -kind create cluster --name wslkind -# Check if the .kube has been created and populated with files -ls $HOME/.kube -``` - -![kind-cluster-create](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-kind-cluster-create.png) - -> Tip: as you can see, the Terminal was changed so the nice icons are all displayed - -The cluster has been successfully created, and because we are using Docker Desktop, the network is all set for us to use "as is". - -So we can open the `Kubernetes master` URL in our Windows browser: - -![kind-browser-k8s-master](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-kind-browse-k8s-master.png) - -And this is the real strength from Docker Desktop for Windows with the WSL2 backend. Docker really did an amazing integration. - -## KinD: counting 1 - 2 - 3 - -Our first cluster was created and it's the "normal" one node cluster: - -```bash -# Check how many nodes it created -kubectl get nodes -# Check the services for the whole cluster -kubectl get all --all-namespaces -``` - -![kind-list-nodes-services](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-kind-list-nodes-services.png) - -While this will be enough for most people, let's leverage one of the coolest feature, multi-node clustering: - - -```bash -# Delete the existing cluster -kind delete cluster --name wslkind -# Create a config file for a 3 nodes cluster -cat << EOF > kind-3nodes.yaml -kind: Cluster -apiVersion: kind.x-k8s.io/v1alpha4 -nodes: - - role: control-plane - - role: worker - - role: worker -EOF -# Create a new cluster with the config file -kind create cluster --name wslkindmultinodes --config ./kind-3nodes.yaml -# Check how many nodes it created -kubectl get nodes -``` - -![kind-cluster-create-multinodes](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-kind-cluster-create-multinodes.png) - -> Tip: depending on how fast we run the "get nodes" command, it can be that not all the nodes are ready, wait few seconds and run it again, everything should be ready - -And that's it, we have created a three-node cluster, and if we look at the services one more time, we will see several that have now three replicas: - - -```bash -# Check the services for the whole cluster -kubectl get all --all-namespaces -``` - -![wsl2-kind-list-services-multinodes](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-kind-list-services-multinodes.png) - -## KinD: can I see a nice dashboard? - -Working on the command line is always good and very insightful. However, when dealing with Kubernetes we might want, at some point, to have a visual overview. - -For that, the [Kubernetes Dashboard](https://github.com/kubernetes/dashboard) project has been created. The installation and first connection test is quite fast, so let's do it: - -```bash -# Install the Dashboard application into our cluster -kubectl apply -f https://raw.githubusercontent.com/kubernetes/dashboard/v2.0.0-rc6/aio/deploy/recommended.yaml -# Check the resources it created based on the new namespace created -kubectl get all -n kubernetes-dashboard -``` - -![kind-install-dashboard](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-kind-install-dashboard.png) - -As it created a service with a ClusterIP (read: internal network address), we cannot reach it if we type the URL in our Windows browser: - -![kind-browse-dashboard-error](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-kind-browse-dashboard-error.png) - -That's because we need to create a temporary proxy: - - -```bash -# Start a kubectl proxy -kubectl proxy -# Enter the URL on your browser: http://localhost:8001/api/v1/namespaces/kubernetes-dashboard/services/https:kubernetes-dashboard:/proxy/ -``` - -![kind-browse-dashboard-success](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-kind-browse-dashboard-success.png) - -Finally to login, we can either enter a Token, which we didn't create, or enter the `kubeconfig` file from our Cluster. - -If we try to login with the `kubeconfig`, we will get the error "Internal error (500): Not enough data to create auth info structure". This is due to the lack of credentials in the `kubeconfig` file. - -So to avoid you ending with the same error, let's follow the [recommended RBAC approach](https://github.com/kubernetes/dashboard/blob/master/docs/user/access-control/creating-sample-user.md). - -Let's open a new WSL2 session: - -```bash -# Create a new ServiceAccount -kubectl apply -f - < Tip: as you can see, the Terminal was changed so the nice icons are all displayed - -So let's fix the issue by installing the missing package: - -```bash -# Install the conntrack package -sudo apt install -y conntrack -``` - -![minikube-install-conntrack](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-minikube-install-conntrack.png) - -Let's try to launch it again: - -```bash -# Create a minikube one node cluster -minikube start --driver=none -# We got a permissions error > try again with sudo -sudo minikube start --driver=none -``` - -![minikube-start-error-systemd](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-minikube-start-error-systemd.png) - -Ok, this error cloud be problematic ... in the past. Luckily for us, there's a solution - -## Minikube: enabling SystemD - -In order to enable SystemD on WSL2, we will apply the [scripts](https://forum.snapcraft.io/t/running-snaps-on-wsl2-insiders-only-for-now/13033) from [Daniel Llewellyn](https://twitter.com/diddledan). - -I invite you to read the full blog post and how he came to the solution, and the various iterations he did to fix several issues. - -So in a nutshell, here are the commands: - -```bash -# Install the needed packages -sudo apt install -yqq daemonize dbus-user-session fontconfig -``` - -![minikube-systemd-packages](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-minikube-systemd-packages.png) - -```bash -# Create the start-systemd-namespace script -sudo vi /usr/sbin/start-systemd-namespace -#!/bin/bash - -SYSTEMD_PID=$(ps -ef | grep '/lib/systemd/systemd --system-unit=basic.target$' | grep -v unshare | awk '{print $2}') -if [ -z "$SYSTEMD_PID" ] || [ "$SYSTEMD_PID" != "1" ]; then - export PRE_NAMESPACE_PATH="$PATH" - (set -o posix; set) | \ - grep -v "^BASH" | \ - grep -v "^DIRSTACK=" | \ - grep -v "^EUID=" | \ - grep -v "^GROUPS=" | \ - grep -v "^HOME=" | \ - grep -v "^HOSTNAME=" | \ - grep -v "^HOSTTYPE=" | \ - grep -v "^IFS='.*"$'\n'"'" | \ - grep -v "^LANG=" | \ - grep -v "^LOGNAME=" | \ - grep -v "^MACHTYPE=" | \ - grep -v "^NAME=" | \ - grep -v "^OPTERR=" | \ - grep -v "^OPTIND=" | \ - grep -v "^OSTYPE=" | \ - grep -v "^PIPESTATUS=" | \ - grep -v "^POSIXLY_CORRECT=" | \ - grep -v "^PPID=" | \ - grep -v "^PS1=" | \ - grep -v "^PS4=" | \ - grep -v "^SHELL=" | \ - grep -v "^SHELLOPTS=" | \ - grep -v "^SHLVL=" | \ - grep -v "^SYSTEMD_PID=" | \ - grep -v "^UID=" | \ - grep -v "^USER=" | \ - grep -v "^_=" | \ - cat - > "$HOME/.systemd-env" - echo "PATH='$PATH'" >> "$HOME/.systemd-env" - exec sudo /usr/sbin/enter-systemd-namespace "$BASH_EXECUTION_STRING" -fi -if [ -n "$PRE_NAMESPACE_PATH" ]; then - export PATH="$PRE_NAMESPACE_PATH" -fi -``` - -```bash -# Create the enter-systemd-namespace -sudo vi /usr/sbin/enter-systemd-namespace -#!/bin/bash - -if [ "$UID" != 0 ]; then - echo "You need to run $0 through sudo" - exit 1 -fi - -SYSTEMD_PID="$(ps -ef | grep '/lib/systemd/systemd --system-unit=basic.target$' | grep -v unshare | awk '{print $2}')" -if [ -z "$SYSTEMD_PID" ]; then - /usr/sbin/daemonize /usr/bin/unshare --fork --pid --mount-proc /lib/systemd/systemd --system-unit=basic.target - while [ -z "$SYSTEMD_PID" ]; do - SYSTEMD_PID="$(ps -ef | grep '/lib/systemd/systemd --system-unit=basic.target$' | grep -v unshare | awk '{print $2}')" - done -fi - -if [ -n "$SYSTEMD_PID" ] && [ "$SYSTEMD_PID" != "1" ]; then - if [ -n "$1" ] && [ "$1" != "bash --login" ] && [ "$1" != "/bin/bash --login" ]; then - exec /usr/bin/nsenter -t "$SYSTEMD_PID" -a \ - /usr/bin/sudo -H -u "$SUDO_USER" \ - /bin/bash -c 'set -a; source "$HOME/.systemd-env"; set +a; exec bash -c '"$(printf "%q" "$@")" - else - exec /usr/bin/nsenter -t "$SYSTEMD_PID" -a \ - /bin/login -p -f "$SUDO_USER" \ - $(/bin/cat "$HOME/.systemd-env" | grep -v "^PATH=") - fi - echo "Existential crisis" -fi -``` - -```bash -# Edit the permissions of the enter-systemd-namespace script -sudo chmod +x /usr/sbin/enter-systemd-namespace -# Edit the bash.bashrc file -sudo sed -i 2a"# Start or enter a PID namespace in WSL2\nsource /usr/sbin/start-systemd-namespace\n" /etc/bash.bashrc -``` - -![minikube-systemd-files](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-minikube-systemd-files.png) - -Finally, exit and launch a new session. You **do not** need to stop WSL2, a new session is enough: - -![minikube-systemd-enabled](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-minikube-systemd-enabled.png) - -## Minikube: the first cluster - -We are ready to create our first cluster: - -```bash -# Check if the KUBECONFIG is not set -echo $KUBECONFIG -# Check if the .kube directory is created > if not, no need to create it -ls $HOME/.kube -# Check if the .minikube directory is created > if yes, delete it -ls $HOME/.minikube -# Create the cluster with sudo -sudo minikube start --driver=none -``` - -In order to be able to use `kubectl` with our user, and not `sudo`, Minikube recommends running the `chown` command: - -```bash -# Change the owner of the .kube and .minikube directories -sudo chown -R $USER $HOME/.kube $HOME/.minikube -# Check the access and if the cluster is running -kubectl cluster-info -# Check the resources created -kubectl get all --all-namespaces -``` - -![minikube-start-fixed](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-minikube-start-fixed.png) - -The cluster has been successfully created, and Minikube used the WSL2 IP, which is great for several reasons, and one of them is that we can open the `Kubernetes master` URL in our Windows browser: - -![minikube-browse-k8s-master](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-minikube-browse-k8s-master.png) - -And the real strength of WSL2 integration, the port `8443` once open on WSL2 distro, it actually forwards it to Windows, so instead of the need to remind the IP address, we can also reach the `Kubernetes master` URL via `localhost`: - -![minikube-browse-k8s-master-localhost](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-minikube-browse-k8s-master-localhost.png) - -## Minikube: can I see a nice dashboard? - -Working on the command line is always good and very insightful. However, when dealing with Kubernetes we might want, at some point, to have a visual overview. - -For that, Minikube embeded the [Kubernetes Dashboard](https://github.com/kubernetes/dashboard). Thanks to it, running and accessing the Dashboard is very simple: - -```bash -# Enable the Dashboard service -sudo minikube dashboard -# Access the Dashboard from a browser on Windows side -``` - -![minikube-browse-dashboard](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-minikube-browse-dashboard.png) - -The command creates also a proxy, which means that once we end the command, by pressing `CTRL+C`, the Dashboard will no more be accessible. - -Still, if we look at the namespace `kubernetes-dashboard`, we will see that the service is still created: - -```bash -# Get all the services from the dashboard namespace -kubectl get all --namespace kubernetes-dashboard -``` - -![minikube-dashboard-get-all](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-minikube-dashboard-get-all.png) - -Let's edit the service and change it's type to `LoadBalancer`: - -```bash -# Edit the Dashoard service -kubectl edit service/kubernetes-dashboard --namespace kubernetes-dashboard -# Go to the very end and remove the last 2 lines -status: - loadBalancer: {} -# Change the type from ClusterIO to LoadBalancer - type: LoadBalancer -# Save the file -``` - -![minikube-dashboard-type-loadbalancer](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-minikube-dashboard-type-loadbalancer.png) - -Check again the Dashboard service and let's access the Dashboard via the LoadBalancer: - -```bash -# Get all the services from the dashboard namespace -kubectl get all --namespace kubernetes-dashboard -# Access the Dashboard from a browser on Windows side with the URL: localhost: -``` - -![minikube-browse-dashboard-loadbalancer](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-minikube-browse-dashboard-loadbalancer.png) - -# Conclusion - -It's clear that we are far from done as we could have some LoadBalancing implemented and/or other services (storage, ingress, registry, etc...). - -Concerning Minikube on WSL2, as it needed to enable SystemD, we can consider it as an intermediate level to be implemented. - -So with two solutions, what could be the "best for you"? Both bring their own advantages and inconveniences, so here an overview from our point of view solely: - -| Criteria | KinD | Minikube | -| -------------------- | ----------------------------- | -------- | -| Installation on WSL2 | Very Easy | Medium | -| Multi-node | Yes | No | -| Plugins | Manual install | Yes | -| Persistence | Yes, however not designed for | Yes | -| Alternatives | K3d | Microk8s | - -We hope you could have a real taste of the integration between the different components: WSL2 - Docker Desktop - KinD/Minikube. And that gave you some ideas or, even better, some answers to your Kubernetes workflows with KinD and/or Minikube on Windows and WSL2. - -See you soon for other adventures in the Kubernetes ocean. - -[Nuno](https://twitter.com/nunixtech) & [Ihor](https://twitter.com/idvoretskyi) +--- +layout: blog +title: "WSL+Docker: Kubernetes on the Windows Desktop" +date: 2020-05-21 +slug: wsl-docker-kubernetes-on-the-windows-desktop +--- + +**Authors**: [Nuno do Carmo](https://twitter.com/nunixtech) Docker Captain and WSL Corsair; [Ihor Dvoretskyi](https://twitter.com/idvoretskyi), Developer Advocate, Cloud Native Computing Foundation + +# Introduction + +New to Windows 10 and WSL2, or new to Docker and Kubernetes? Welcome to this blog post where we will install from scratch Kubernetes in Docker [KinD](https://kind.sigs.k8s.io/) and [Minikube](https://minikube.sigs.k8s.io/docs/). + + +# Why Kubernetes on Windows? + +For the last few years, Kubernetes became a de-facto standard platform for running containerized services and applications in distributed environments. While a wide variety of distributions and installers exist to deploy Kubernetes in the cloud environments (public, private or hybrid), or within the bare metal environments, there is still a need to deploy and run Kubernetes locally, for example, on the developer's workstation. + +Kubernetes has been originally designed to be deployed and used in the Linux environments. However, a good number of users (and not only application developers) use Windows OS as their daily driver. When Microsoft revealed WSL - [the Windows Subsystem for Linux](https://docs.microsoft.com/en-us/windows/wsl/), the line between Windows and Linux environments became even less visible. + + +Also, WSL brought an ability to run Kubernetes on Windows almost seamlessly! + + +Below, we will cover in brief how to install and use various solutions to run Kubernetes locally. + +# Prerequisites + +Since we will explain how to install KinD, we won't go into too much detail around the installation of KinD's dependencies. + +However, here is the list of the prerequisites needed and their version/lane: + +- OS: Windows 10 version 2004, Build 19041 +- [WSL2 enabled](https://docs.microsoft.com/en-us/windows/wsl/wsl2-install) + - In order to install the distros as WSL2 by default, once WSL2 installed, run the command `wsl.exe --set-default-version 2` in Powershell +- WSL2 distro installed from the Windows Store - the distro used is Ubuntu-18.04 +- [Docker Desktop for Windows](https://hub.docker.com/editions/community/docker-ce-desktop-windows), stable channel - the version used is 2.2.0.4 +- [Optional] Microsoft Terminal installed from the Windows Store + - Open the Windows store and type "Terminal" in the search, it will be (normally) the first option + +![Windows Store Terminal](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-windows-store-terminal.png) + +And that's actually it. For Docker Desktop for Windows, no need to configure anything yet as we will explain it in the next section. + +# WSL2: First contact + +Once everything is installed, we can launch the WSL2 terminal from the Start menu, and type "Ubuntu" for searching the applications and documents: + +![Start Menu Search](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-start-menu-search.png) + +Once found, click on the name and it will launch the default Windows console with the Ubuntu bash shell running. + +Like for any normal Linux distro, you need to create a user and set a password: + +![User-Password](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-user-password.png) + +## [Optional] Update the `sudoers` + +As we are working, normally, on our local computer, it might be nice to update the `sudoers` and set the group `%sudo` to be password-less: + +```bash +# Edit the sudoers with the visudo command +sudo visudo + +# Change the %sudo group to be password-less +%sudo ALL=(ALL:ALL) NOPASSWD: ALL + +# Press CTRL+X to exit +# Press Y to save +# Press Enter to confirm +``` + +![visudo](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-visudo.png) + +## Update Ubuntu + +Before we move to the Docker Desktop settings, let's update our system and ensure we start in the best conditions: + +```bash +# Update the repositories and list of the packages available +sudo apt update +# Update the system based on the packages installed > the "-y" will approve the change automatically +sudo apt upgrade -y +``` + +![apt-update-upgrade](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-apt-update-upgrade.png) + +# Docker Desktop: faster with WSL2 + +Before we move into the settings, let's do a small test, it will display really how cool the new integration with Docker Desktop is: + +```bash +# Try to see if the docker cli and daemon are installed +docker version +# Same for kubectl +kubectl version +``` + +![kubectl-error](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-docker-kubectl-error.png) + +You got an error? Perfect! It's actually good news, so let's now move on to the settings. + +## Docker Desktop settings: enable WSL2 integration + +First let's start Docker Desktop for Windows if it's not still the case. Open the Windows start menu and type "docker", click on the name to start the application: + +![docker-start](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-docker-start.png) + +You should now see the Docker icon with the other taskbar icons near the clock: + +![docker-taskbar](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-docker-taskbar.png) + +Now click on the Docker icon and choose settings. A new window will appear: + +![docker-settings-general](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-docker-settings-general.png) + +By default, the WSL2 integration is not active, so click the "Enable the experimental WSL 2 based engine" and click "Apply & Restart": + +![docker-settings-wsl2](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-docker-settings-wsl2-activated.png) + +What this feature did behind the scenes was to create two new distros in WSL2, containing and running all the needed backend sockets, daemons and also the CLI tools (read: docker and kubectl command). + +Still, this first setting is still not enough to run the commands inside our distro. If we try, we will have the same error as before. + +In order to fix it, and finally be able to use the commands, we need to tell the Docker Desktop to "attach" itself to our distro also: + +![docker-resources-wsl](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-docker-resources-wsl-integration.png) + +Let's now switch back to our WSL2 terminal and see if we can (finally) launch the commands: + +```bash +# Try to see if the docker cli and daemon are installed +docker version +# Same for kubectl +kubectl version +``` + +![docker-kubectl-success](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-docker-kubectl-success.png) + +> Tip: if nothing happens, restart Docker Desktop and restart the WSL process in Powershell: `Restart-Service LxssManager` and launch a new Ubuntu session + +And success! The basic settings are now done and we move to the installation of KinD. + +# KinD: Kubernetes made easy in a container + +Right now, we have Docker that is installed, configured and the last test worked fine. + +However, if we look carefully at the `kubectl` command, it found the "Client Version" (1.15.5), but it didn't find any server. + +This is normal as we didn't enable the Docker Kubernetes cluster. So let's install KinD and create our first cluster. + +And as sources are always important to mention, we will follow (partially) the how-to on the [official KinD website](https://kind.sigs.k8s.io/docs/user/quick-start/): + +```bash +# Download the latest version of KinD +curl -Lo ./kind https://github.com/kubernetes-sigs/kind/releases/download/v0.7.0/kind-linux-amd64 +# Make the binary executable +chmod +x ./kind +# Move the binary to your executable path +sudo mv ./kind /usr/local/bin/ +``` + +![kind-install](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-kind-install.png) + +## KinD: the first cluster + +We are ready to create our first cluster: + +```bash +# Check if the KUBECONFIG is not set +echo $KUBECONFIG +# Check if the .kube directory is created > if not, no need to create it +ls $HOME/.kube +# Create the cluster and give it a name (optional) +kind create cluster --name wslkind +# Check if the .kube has been created and populated with files +ls $HOME/.kube +``` + +![kind-cluster-create](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-kind-cluster-create.png) + +> Tip: as you can see, the Terminal was changed so the nice icons are all displayed + +The cluster has been successfully created, and because we are using Docker Desktop, the network is all set for us to use "as is". + +So we can open the `Kubernetes master` URL in our Windows browser: + +![kind-browser-k8s-master](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-kind-browse-k8s-master.png) + +And this is the real strength from Docker Desktop for Windows with the WSL2 backend. Docker really did an amazing integration. + +## KinD: counting 1 - 2 - 3 + +Our first cluster was created and it's the "normal" one node cluster: + +```bash +# Check how many nodes it created +kubectl get nodes +# Check the services for the whole cluster +kubectl get all --all-namespaces +``` + +![kind-list-nodes-services](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-kind-list-nodes-services.png) + +While this will be enough for most people, let's leverage one of the coolest feature, multi-node clustering: + + +```bash +# Delete the existing cluster +kind delete cluster --name wslkind +# Create a config file for a 3 nodes cluster +cat << EOF > kind-3nodes.yaml +kind: Cluster +apiVersion: kind.x-k8s.io/v1alpha4 +nodes: + - role: control-plane + - role: worker + - role: worker +EOF +# Create a new cluster with the config file +kind create cluster --name wslkindmultinodes --config ./kind-3nodes.yaml +# Check how many nodes it created +kubectl get nodes +``` + +![kind-cluster-create-multinodes](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-kind-cluster-create-multinodes.png) + +> Tip: depending on how fast we run the "get nodes" command, it can be that not all the nodes are ready, wait few seconds and run it again, everything should be ready + +And that's it, we have created a three-node cluster, and if we look at the services one more time, we will see several that have now three replicas: + + +```bash +# Check the services for the whole cluster +kubectl get all --all-namespaces +``` + +![wsl2-kind-list-services-multinodes](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-kind-list-services-multinodes.png) + +## KinD: can I see a nice dashboard? + +Working on the command line is always good and very insightful. However, when dealing with Kubernetes we might want, at some point, to have a visual overview. + +For that, the [Kubernetes Dashboard](https://github.com/kubernetes/dashboard) project has been created. The installation and first connection test is quite fast, so let's do it: + +```bash +# Install the Dashboard application into our cluster +kubectl apply -f https://raw.githubusercontent.com/kubernetes/dashboard/v2.0.0-rc6/aio/deploy/recommended.yaml +# Check the resources it created based on the new namespace created +kubectl get all -n kubernetes-dashboard +``` + +![kind-install-dashboard](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-kind-install-dashboard.png) + +As it created a service with a ClusterIP (read: internal network address), we cannot reach it if we type the URL in our Windows browser: + +![kind-browse-dashboard-error](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-kind-browse-dashboard-error.png) + +That's because we need to create a temporary proxy: + + +```bash +# Start a kubectl proxy +kubectl proxy +# Enter the URL on your browser: http://localhost:8001/api/v1/namespaces/kubernetes-dashboard/services/https:kubernetes-dashboard:/proxy/ +``` + +![kind-browse-dashboard-success](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-kind-browse-dashboard-success.png) + +Finally to login, we can either enter a Token, which we didn't create, or enter the `kubeconfig` file from our Cluster. + +If we try to login with the `kubeconfig`, we will get the error "Internal error (500): Not enough data to create auth info structure". This is due to the lack of credentials in the `kubeconfig` file. + +So to avoid you ending with the same error, let's follow the [recommended RBAC approach](https://github.com/kubernetes/dashboard/blob/master/docs/user/access-control/creating-sample-user.md). + +Let's open a new WSL2 session: + +```bash +# Create a new ServiceAccount +kubectl apply -f - < Tip: as you can see, the Terminal was changed so the nice icons are all displayed + +So let's fix the issue by installing the missing package: + +```bash +# Install the conntrack package +sudo apt install -y conntrack +``` + +![minikube-install-conntrack](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-minikube-install-conntrack.png) + +Let's try to launch it again: + +```bash +# Create a minikube one node cluster +minikube start --driver=none +# We got a permissions error > try again with sudo +sudo minikube start --driver=none +``` + +![minikube-start-error-systemd](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-minikube-start-error-systemd.png) + +Ok, this error cloud be problematic ... in the past. Luckily for us, there's a solution + +## Minikube: enabling SystemD + +In order to enable SystemD on WSL2, we will apply the [scripts](https://forum.snapcraft.io/t/running-snaps-on-wsl2-insiders-only-for-now/13033) from [Daniel Llewellyn](https://twitter.com/diddledan). + +I invite you to read the full blog post and how he came to the solution, and the various iterations he did to fix several issues. + +So in a nutshell, here are the commands: + +```bash +# Install the needed packages +sudo apt install -yqq daemonize dbus-user-session fontconfig +``` + +![minikube-systemd-packages](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-minikube-systemd-packages.png) + +```bash +# Create the start-systemd-namespace script +sudo vi /usr/sbin/start-systemd-namespace +#!/bin/bash + +SYSTEMD_PID=$(ps -ef | grep '/lib/systemd/systemd --system-unit=basic.target$' | grep -v unshare | awk '{print $2}') +if [ -z "$SYSTEMD_PID" ] || [ "$SYSTEMD_PID" != "1" ]; then + export PRE_NAMESPACE_PATH="$PATH" + (set -o posix; set) | \ + grep -v "^BASH" | \ + grep -v "^DIRSTACK=" | \ + grep -v "^EUID=" | \ + grep -v "^GROUPS=" | \ + grep -v "^HOME=" | \ + grep -v "^HOSTNAME=" | \ + grep -v "^HOSTTYPE=" | \ + grep -v "^IFS='.*"$'\n'"'" | \ + grep -v "^LANG=" | \ + grep -v "^LOGNAME=" | \ + grep -v "^MACHTYPE=" | \ + grep -v "^NAME=" | \ + grep -v "^OPTERR=" | \ + grep -v "^OPTIND=" | \ + grep -v "^OSTYPE=" | \ + grep -v "^PIPESTATUS=" | \ + grep -v "^POSIXLY_CORRECT=" | \ + grep -v "^PPID=" | \ + grep -v "^PS1=" | \ + grep -v "^PS4=" | \ + grep -v "^SHELL=" | \ + grep -v "^SHELLOPTS=" | \ + grep -v "^SHLVL=" | \ + grep -v "^SYSTEMD_PID=" | \ + grep -v "^UID=" | \ + grep -v "^USER=" | \ + grep -v "^_=" | \ + cat - > "$HOME/.systemd-env" + echo "PATH='$PATH'" >> "$HOME/.systemd-env" + exec sudo /usr/sbin/enter-systemd-namespace "$BASH_EXECUTION_STRING" +fi +if [ -n "$PRE_NAMESPACE_PATH" ]; then + export PATH="$PRE_NAMESPACE_PATH" +fi +``` + +```bash +# Create the enter-systemd-namespace +sudo vi /usr/sbin/enter-systemd-namespace +#!/bin/bash + +if [ "$UID" != 0 ]; then + echo "You need to run $0 through sudo" + exit 1 +fi + +SYSTEMD_PID="$(ps -ef | grep '/lib/systemd/systemd --system-unit=basic.target$' | grep -v unshare | awk '{print $2}')" +if [ -z "$SYSTEMD_PID" ]; then + /usr/sbin/daemonize /usr/bin/unshare --fork --pid --mount-proc /lib/systemd/systemd --system-unit=basic.target + while [ -z "$SYSTEMD_PID" ]; do + SYSTEMD_PID="$(ps -ef | grep '/lib/systemd/systemd --system-unit=basic.target$' | grep -v unshare | awk '{print $2}')" + done +fi + +if [ -n "$SYSTEMD_PID" ] && [ "$SYSTEMD_PID" != "1" ]; then + if [ -n "$1" ] && [ "$1" != "bash --login" ] && [ "$1" != "/bin/bash --login" ]; then + exec /usr/bin/nsenter -t "$SYSTEMD_PID" -a \ + /usr/bin/sudo -H -u "$SUDO_USER" \ + /bin/bash -c 'set -a; source "$HOME/.systemd-env"; set +a; exec bash -c '"$(printf "%q" "$@")" + else + exec /usr/bin/nsenter -t "$SYSTEMD_PID" -a \ + /bin/login -p -f "$SUDO_USER" \ + $(/bin/cat "$HOME/.systemd-env" | grep -v "^PATH=") + fi + echo "Existential crisis" +fi +``` + +```bash +# Edit the permissions of the enter-systemd-namespace script +sudo chmod +x /usr/sbin/enter-systemd-namespace +# Edit the bash.bashrc file +sudo sed -i 2a"# Start or enter a PID namespace in WSL2\nsource /usr/sbin/start-systemd-namespace\n" /etc/bash.bashrc +``` + +![minikube-systemd-files](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-minikube-systemd-files.png) + +Finally, exit and launch a new session. You **do not** need to stop WSL2, a new session is enough: + +![minikube-systemd-enabled](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-minikube-systemd-enabled.png) + +## Minikube: the first cluster + +We are ready to create our first cluster: + +```bash +# Check if the KUBECONFIG is not set +echo $KUBECONFIG +# Check if the .kube directory is created > if not, no need to create it +ls $HOME/.kube +# Check if the .minikube directory is created > if yes, delete it +ls $HOME/.minikube +# Create the cluster with sudo +sudo minikube start --driver=none +``` + +In order to be able to use `kubectl` with our user, and not `sudo`, Minikube recommends running the `chown` command: + +```bash +# Change the owner of the .kube and .minikube directories +sudo chown -R $USER $HOME/.kube $HOME/.minikube +# Check the access and if the cluster is running +kubectl cluster-info +# Check the resources created +kubectl get all --all-namespaces +``` + +![minikube-start-fixed](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-minikube-start-fixed.png) + +The cluster has been successfully created, and Minikube used the WSL2 IP, which is great for several reasons, and one of them is that we can open the `Kubernetes master` URL in our Windows browser: + +![minikube-browse-k8s-master](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-minikube-browse-k8s-master.png) + +And the real strength of WSL2 integration, the port `8443` once open on WSL2 distro, it actually forwards it to Windows, so instead of the need to remind the IP address, we can also reach the `Kubernetes master` URL via `localhost`: + +![minikube-browse-k8s-master-localhost](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-minikube-browse-k8s-master-localhost.png) + +## Minikube: can I see a nice dashboard? + +Working on the command line is always good and very insightful. However, when dealing with Kubernetes we might want, at some point, to have a visual overview. + +For that, Minikube embeded the [Kubernetes Dashboard](https://github.com/kubernetes/dashboard). Thanks to it, running and accessing the Dashboard is very simple: + +```bash +# Enable the Dashboard service +sudo minikube dashboard +# Access the Dashboard from a browser on Windows side +``` + +![minikube-browse-dashboard](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-minikube-browse-dashboard.png) + +The command creates also a proxy, which means that once we end the command, by pressing `CTRL+C`, the Dashboard will no more be accessible. + +Still, if we look at the namespace `kubernetes-dashboard`, we will see that the service is still created: + +```bash +# Get all the services from the dashboard namespace +kubectl get all --namespace kubernetes-dashboard +``` + +![minikube-dashboard-get-all](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-minikube-dashboard-get-all.png) + +Let's edit the service and change it's type to `LoadBalancer`: + +```bash +# Edit the Dashoard service +kubectl edit service/kubernetes-dashboard --namespace kubernetes-dashboard +# Go to the very end and remove the last 2 lines +status: + loadBalancer: {} +# Change the type from ClusterIO to LoadBalancer + type: LoadBalancer +# Save the file +``` + +![minikube-dashboard-type-loadbalancer](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-minikube-dashboard-type-loadbalancer.png) + +Check again the Dashboard service and let's access the Dashboard via the LoadBalancer: + +```bash +# Get all the services from the dashboard namespace +kubectl get all --namespace kubernetes-dashboard +# Access the Dashboard from a browser on Windows side with the URL: localhost: +``` + +![minikube-browse-dashboard-loadbalancer](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-minikube-browse-dashboard-loadbalancer.png) + +# Conclusion + +It's clear that we are far from done as we could have some LoadBalancing implemented and/or other services (storage, ingress, registry, etc...). + +Concerning Minikube on WSL2, as it needed to enable SystemD, we can consider it as an intermediate level to be implemented. + +So with two solutions, what could be the "best for you"? Both bring their own advantages and inconveniences, so here an overview from our point of view solely: + +| Criteria | KinD | Minikube | +| -------------------- | ----------------------------- | -------- | +| Installation on WSL2 | Very Easy | Medium | +| Multi-node | Yes | No | +| Plugins | Manual install | Yes | +| Persistence | Yes, however not designed for | Yes | +| Alternatives | K3d | Microk8s | + +We hope you could have a real taste of the integration between the different components: WSL2 - Docker Desktop - KinD/Minikube. And that gave you some ideas or, even better, some answers to your Kubernetes workflows with KinD and/or Minikube on Windows and WSL2. + +See you soon for other adventures in the Kubernetes ocean. + +[Nuno](https://twitter.com/nunixtech) & [Ihor](https://twitter.com/idvoretskyi) diff --git a/content/en/blog/_posts/2021-05-14-using-finalizers-to-control-deletion.md b/content/en/blog/_posts/2021-05-14-using-finalizers-to-control-deletion.md index c868b1bd5c..dd5e14ee33 100644 --- a/content/en/blog/_posts/2021-05-14-using-finalizers-to-control-deletion.md +++ b/content/en/blog/_posts/2021-05-14-using-finalizers-to-control-deletion.md @@ -1,269 +1,269 @@ ---- -layout: blog -title: 'Using Finalizers to Control Deletion' -date: 2021-05-14 -slug: using-finalizers-to-control-deletion ---- - -**Authors:** Aaron Alpar (Kasten) - -Deleting objects in Kubernetes can be challenging. You may think you’ve deleted something, only to find it still persists. While issuing a `kubectl delete` command and hoping for the best might work for day-to-day operations, understanding how Kubernetes `delete` commands operate will help you understand why some objects linger after deletion. - -In this post, I’ll look at: - -- What properties of a resource govern deletion -- How finalizers and owner references impact object deletion -- How the propagation policy can be used to change the order of deletions -- How deletion works, with examples - -For simplicity, all examples will use ConfigMaps and basic shell commands to demonstrate the process. We’ll explore how the commands work and discuss repercussions and results from using them in practice. - -## The basic `delete` - -Kubernetes has several different commands you can use that allow you to create, read, update, and delete objects. For the purpose of this blog post, we’ll focus on four `kubectl` commands: `create`, `get`, `patch`, and `delete`. - -Here are examples of the basic `kubectl delete` command: - -``` -kubectl create configmap mymap -configmap/mymap created -``` - -``` -kubectl get configmap/mymap -NAME DATA AGE -mymap 0 12s -``` - -``` -kubectl delete configmap/mymap -configmap "mymap" deleted -``` - -``` -kubectl get configmap/mymap -Error from server (NotFound): configmaps "mymap" not found -``` - -Shell commands preceded by `$` are followed by their output. You can see that we begin with a `kubectl create configmap mymap`, which will create the empty configmap `mymap`. Next, we need to `get` the configmap to prove it exists. We can then delete that configmap. Attempting to `get` it again produces an HTTP 404 error, which means the configmap is not found. - -The state diagram for the basic `delete` command is very simple: - - -{{
}} - -Although this operation is straightforward, other factors may interfere with the deletion, including finalizers and owner references. - -## Understanding Finalizers - -When it comes to understanding resource deletion in Kubernetes, knowledge of how finalizers work is helpful and can help you understand why some objects don’t get deleted. - -Finalizers are keys on resources that signal pre-delete operations. They control the garbage collection on resources, and are designed to alert controllers what cleanup operations to perform prior to removing a resource. However, they don’t necessarily name code that should be executed; finalizers on resources are basically just lists of keys much like annotations. Like annotations, they can be manipulated. - -Some common finalizers you’ve likely encountered are: - -- `kubernetes.io/pv-protection` -- `kubernetes.io/pvc-protection` - -The finalizers above are used on volumes to prevent accidental deletion. Similarly, some finalizers can be used to prevent deletion of any resource but are not managed by any controller. - -Below with a custom configmap, which has no properties but contains a finalizer: - -``` -cat <}} - -So, if you attempt to delete an object that has a finalizer on it, it will remain in finalization until the controller has removed the finalizer keys or the finalizers are removed using Kubectl. Once that finalizer list is empty, the object can actually be reclaimed by Kubernetes and put into a queue to be deleted from the registry. - -## Owner References - -Owner references describe how groups of objects are related. They are properties on resources that specify the relationship to one another, so entire trees of resources can be deleted. - -Finalizer rules are processed when there are owner references. An owner reference consists of a name and a UID. Owner references link resources within the same namespace, and it also needs a UID for that reference to work. Pods typically have owner references to the owning replica set. So, when deployments or stateful sets are deleted, then the child replica sets and pods are deleted in the process. - -Here are some examples of owner references and how they work. In the first example, we create a parent object first, then the child. The result is a very simple configmap that contains an owner reference to its parent: - -``` -cat <}} +--- +layout: blog +title: 'Using Finalizers to Control Deletion' +date: 2021-05-14 +slug: using-finalizers-to-control-deletion +--- + +**Authors:** Aaron Alpar (Kasten) + +Deleting objects in Kubernetes can be challenging. You may think you’ve deleted something, only to find it still persists. While issuing a `kubectl delete` command and hoping for the best might work for day-to-day operations, understanding how Kubernetes `delete` commands operate will help you understand why some objects linger after deletion. + +In this post, I’ll look at: + +- What properties of a resource govern deletion +- How finalizers and owner references impact object deletion +- How the propagation policy can be used to change the order of deletions +- How deletion works, with examples + +For simplicity, all examples will use ConfigMaps and basic shell commands to demonstrate the process. We’ll explore how the commands work and discuss repercussions and results from using them in practice. + +## The basic `delete` + +Kubernetes has several different commands you can use that allow you to create, read, update, and delete objects. For the purpose of this blog post, we’ll focus on four `kubectl` commands: `create`, `get`, `patch`, and `delete`. + +Here are examples of the basic `kubectl delete` command: + +``` +kubectl create configmap mymap +configmap/mymap created +``` + +``` +kubectl get configmap/mymap +NAME DATA AGE +mymap 0 12s +``` + +``` +kubectl delete configmap/mymap +configmap "mymap" deleted +``` + +``` +kubectl get configmap/mymap +Error from server (NotFound): configmaps "mymap" not found +``` + +Shell commands preceded by `$` are followed by their output. You can see that we begin with a `kubectl create configmap mymap`, which will create the empty configmap `mymap`. Next, we need to `get` the configmap to prove it exists. We can then delete that configmap. Attempting to `get` it again produces an HTTP 404 error, which means the configmap is not found. + +The state diagram for the basic `delete` command is very simple: + + +{{
}} + +Although this operation is straightforward, other factors may interfere with the deletion, including finalizers and owner references. + +## Understanding Finalizers + +When it comes to understanding resource deletion in Kubernetes, knowledge of how finalizers work is helpful and can help you understand why some objects don’t get deleted. + +Finalizers are keys on resources that signal pre-delete operations. They control the garbage collection on resources, and are designed to alert controllers what cleanup operations to perform prior to removing a resource. However, they don’t necessarily name code that should be executed; finalizers on resources are basically just lists of keys much like annotations. Like annotations, they can be manipulated. + +Some common finalizers you’ve likely encountered are: + +- `kubernetes.io/pv-protection` +- `kubernetes.io/pvc-protection` + +The finalizers above are used on volumes to prevent accidental deletion. Similarly, some finalizers can be used to prevent deletion of any resource but are not managed by any controller. + +Below with a custom configmap, which has no properties but contains a finalizer: + +``` +cat <}} + +So, if you attempt to delete an object that has a finalizer on it, it will remain in finalization until the controller has removed the finalizer keys or the finalizers are removed using Kubectl. Once that finalizer list is empty, the object can actually be reclaimed by Kubernetes and put into a queue to be deleted from the registry. + +## Owner References + +Owner references describe how groups of objects are related. They are properties on resources that specify the relationship to one another, so entire trees of resources can be deleted. + +Finalizer rules are processed when there are owner references. An owner reference consists of a name and a UID. Owner references link resources within the same namespace, and it also needs a UID for that reference to work. Pods typically have owner references to the owning replica set. So, when deployments or stateful sets are deleted, then the child replica sets and pods are deleted in the process. + +Here are some examples of owner references and how they work. In the first example, we create a parent object first, then the child. The result is a very simple configmap that contains an owner reference to its parent: + +``` +cat <}} diff --git a/content/en/docs/concepts/security/multi-tenancy.md b/content/en/docs/concepts/security/multi-tenancy.md index a1c70fe370..68fa631185 100755 --- a/content/en/docs/concepts/security/multi-tenancy.md +++ b/content/en/docs/concepts/security/multi-tenancy.md @@ -1,271 +1,271 @@ ---- -title: Multi-tenancy -content_type: concept -weight: 70 ---- - - - -This page provides an overview of available configuration options and best practices for cluster multi-tenancy. - -Sharing clusters saves costs and simplifies administration. However, sharing clusters also presents challenges such as security, fairness, and managing _noisy neighbors_. - -Clusters can be shared in many ways. In some cases, different applications may run in the same cluster. In other cases, multiple instances of the same application may run in the same cluster, one for each end user. All these types of sharing are frequently described using the umbrella term _multi-tenancy_. - -While Kubernetes does not have first-class concepts of end users or tenants, it provides several features to help manage different tenancy requirements. These are discussed below. - - -## Use cases - -The first step to determining how to share your cluster is understanding your use case, so you can evaluate the patterns and tools available. In general, multi-tenancy in Kubernetes clusters falls into two broad categories, though many variations and hybrids are also possible. - -### Multiple teams - -A common form of multi-tenancy is to share a cluster between multiple teams within an organization, each of whom may operate one or more workloads. These workloads frequently need to communicate with each other, and with other workloads located on the same or different clusters. - -In this scenario, members of the teams often have direct access to Kubernetes resources via tools such as `kubectl`, or indirect access through GitOps controllers or other types of release automation tools. There is often some level of trust between members of different teams, but Kubernetes policies such as RBAC, quotas, and network policies are essential to safely and fairly share clusters. - -### Multiple customers - -The other major form of multi-tenancy frequently involves a Software-as-a-Service (SaaS) vendor running multiple instances of a workload for customers. This business model is so strongly associated with this deployment style that many people call it "SaaS tenancy." However, a better term might be "multi-customer tenancy,” since SaaS vendors may also use other deployment models, and this deployment model can also be used outside of SaaS. - - -In this scenario, the customers do not have access to the cluster; Kubernetes is invisible from their perspective and is only used by the vendor to manage the workloads. Cost optimization is frequently a critical concern, and Kubernetes policies are used to ensure that the workloads are strongly isolated from each other. - - -## Terminology - -### Tenants - -When discussing multi-tenancy in Kubernetes, there is no single definition for a "tenant". Rather, the definition of a tenant will vary depending on whether multi-team or multi-customer tenancy is being discussed. - -In multi-team usage, a tenant is typically a team, where each team typically deploys a small number of workloads that scales with the complexity of the service. However, the definition of "team" may itself be fuzzy, as teams may be organized into higher-level divisions or subdivided into smaller teams. - - -By contrast, if each team deploys dedicated workloads for each new client, they are using a multi-customer model of tenancy. In this case, a "tenant" is simply a group of users who share a single workload. This may be as large as an entire company, or as small as a single team at that company. - -In many cases, the same organization may use both definitions of "tenants" in different contexts. For example, a platform team may offer shared services such as security tools and databases to multiple internal “customers” and a SaaS vendor may also have multiple teams sharing a development cluster. Finally, hybrid architectures are also possible, such as a SaaS provider using a combination of per-customer workloads for sensitive data, combined with multi-tenant shared services. - - -{{< figure src="/images/docs/multi-tenancy.png" title="A cluster showing coexisting tenancy models" class="diagram-large" >}} - - -### Isolation - -There are several ways to design and build multi-tenant solutions with Kubernetes. Each of these methods comes with its own set of tradeoffs that impact the isolation level, implementation effort, operational complexity, and cost of service. - - -A Kubernetes cluster consists of a control plane which runs Kubernetes software, and a data plane consisting of worker nodes where tenant workloads are executed as pods. Tenant isolation can be applied in both the control plane and the data plane based on organizational requirements. - -The level of isolation offered is sometimes described using terms like “hard” multi-tenancy, which implies strong isolation, and “soft” multi-tenancy, which implies weaker isolation. In particular, "hard" multi-tenancy is often used to describe cases where the tenants do not trust each other, often from security and resource sharing perspectives (e.g. guarding against attacks such as data exfiltration or DoS). Since data planes typically have much larger attack surfaces, "hard" multi-tenancy often requires extra attention to isolating the data-plane, though control plane isolation also remains critical. - -However, the terms "hard" and "soft" can often be confusing, as there is no single definition that will apply to all users. Rather, "hardness" or "softness" is better understood as a broad spectrum, with many different techniques that can be used to maintain different types of isolation in your clusters, based on your requirements. - - -In more extreme cases, it may be easier or necessary to forgo any cluster-level sharing at all and assign each tenant their dedicated cluster, possibly even running on dedicated hardware if VMs are not considered an adequate security boundary. This may be easier with managed Kubernetes clusters, where the overhead of creating and operating clusters is at least somewhat taken on by a cloud provider. The benefit of stronger tenant isolation must be evaluated against the cost and complexity of managing multiple clusters. The [Multi-cluster SIG](https://git.k8s.io/community/sig-multicluster/README.md) is responsible for addressing these types of use cases. - - - -The remainder of this page focuses on isolation techniques used for shared Kubernetes clusters. However, even if you are considering dedicated clusters, it may be valuable to review these recommendations, as it will give you the flexibility to shift to shared clusters in the future if your needs or capabilities change. - - -## Control plane isolation - -Control plane isolation ensures that different tenants cannot access or affect each others' Kubernetes API resources. - -### Namespaces - -In Kubernetes, a {{< glossary_tooltip text="Namespace" term_id="namespace" >}} provides a mechanism for isolating groups of API resources within a single cluster. This isolation has two key dimensions: - -1. Object names within a namespace can overlap with names in other namespaces, similar to files in folders. This allows tenants to name their resources without having to consider what other tenants are doing. - -2. Many Kubernetes security policies are scoped to namespaces. For example, RBAC Roles and Network Policies are namespace-scoped resources. Using RBAC, Users and Service Accounts can be restricted to a namespace. - -In a multi-tenant environment, a Namespace helps segment a tenant's workload into a logical and distinct management unit. In fact, a common practice is to isolate every workload in its own namespace, even if multiple workloads are operated by the same tenant. This ensures that each workload has its own identity and can be configured with an appropriate security policy. - -The namespace isolation model requires configuration of several other Kubernetes resources, networking plugins, and adherence to security best practices to properly isolate tenant workloads. These considerations are discussed below. - -### Access controls - -The most important type of isolation for the control plane is authorization. If teams or their workloads can access or modify each others' API resources, they can change or disable all other types of policies thereby negating any protection those policies may offer. As a result, it is critical to ensure that each tenant has the appropriate access to only the namespaces they need, and no more. This is known as the "Principle of Least Privilege." - - -Role-based access control (RBAC) is commonly used to enforce authorization in the Kubernetes control plane, for both users and workloads (service accounts). [Roles](/docs/reference/access-authn-authz/rbac/#role-and-clusterrole) and [role bindings](/docs/reference/access-authn-authz/rbac/#rolebinding-and-clusterrolebinding) are Kubernetes objects that are used at a namespace level to enforce access control in your application; similar objects exist for authorizing access to cluster-level objects, though these are less useful for multi-tenant clusters. - -In a multi-team environment, RBAC must be used to restrict tenants' access to the appropriate namespaces, and ensure that cluster-wide resources can only be accessed or modified by privileged users such as cluster administrators. - -If a policy ends up granting a user more permissions than they need, this is likely a signal that the namespace containing the affected resources should be refactored into finer-grained namespaces. Namespace management tools may simplify the management of these finer-grained namespaces by applying common RBAC policies to different namespaces, while still allowing fine-grained policies where necessary. - -### Quotas - -Kubernetes workloads consume node resources, like CPU and memory. In a multi-tenant environment, you can use -[Resource Quotas](/docs/concepts/policy/resource-quotas/) to manage resource usage of tenant workloads. -For the multiple teams use case, where tenants have access to the Kubernetes API, you can use resource quotas -to limit the number of API resources (for example: the number of Pods, or the number of ConfigMaps) -that a tenant can create. Limits on object count ensure fairness and aim to avoid _noisy neighbor_ issues from -affecting other tenants that share a control plane. - -Resource quotas are namespaced objects. By mapping tenants to namespaces, cluster admins can use quotas to ensure that a tenant cannot monopolize a cluster's resources or overwhelm its control plane. Namespace management tools simplify the administration of quotas. In addition, while Kubernetes quotas only apply within a single namespace, some namespace management tools allow groups of namespaces to share quotas, giving administrators far more flexibility with less effort than built-in quotas. - -Quotas prevent a single tenant from consuming greater than their allocated share of resources hence minimizing the “noisy neighbor” issue, where one tenant negatively impacts the performance of other tenants' workloads. - -When you apply a quota to namespace, Kubernetes requires you to also specify resource requests and limits for each container. Limits are the upper bound for the amount of resources that a container can consume. Containers that attempt to consume resources that exceed the configured limits will either be throttled or killed, based on the resource type. When resource requests are set lower than limits, each container is guaranteed the requested amount but there may still be some potential for impact across workloads. - -Quotas cannot protect against all kinds of resource sharing, such as network traffic. Node isolation (described below) may be a better solution for this problem. - -## Data Plane Isolation - -Data plane isolation ensures that pods and workloads for different tenants are sufficiently isolated. - -### Network isolation - -By default, all pods in a Kubernetes cluster are allowed to communicate with each other, and all network traffic is unencrypted. This can lead to security vulnerabilities where traffic is accidentally or maliciously sent to an unintended destination, or is intercepted by a workload on a compromised node. - -Pod-to-pod communication can be controlled using [Network Policies](/docs/concepts/services-networking/network-policies/), which restrict communication between pods using namespace labels or IP address ranges. In a multi-tenant environment where strict network isolation between tenants is required, starting with a default policy that denies communication between pods is recommended with another rule that allows all pods to query the DNS server for name resolution. With such a default policy in place, you can begin adding more permissive rules that allow for communication within a namespace. This scheme can be further refined as required. Note that this only applies to pods within a single control plane; pods that belong to different virtual control planes cannot talk to each other via Kubernetes networking. - -Namespace management tools may simplify the creation of default or common network policies. In addition, some of these tools allow you to enforce a consistent set of namespace labels across your cluster, ensuring that they are a trusted basis for your policies. - -{{< warning >}} -Network policies require a [CNI plugin](/docs/concepts/extend-kubernetes/compute-storage-net/network-plugins/#cni) that supports the implementation of network policies. Otherwise, NetworkPolicy resources will be ignored. -{{< /warning >}} - -More advanced network isolation may be provided by service meshes, which provide OSI Layer 7 policies based on workload identity, in addition to namespaces. These higher-level policies can make it easier to manage namespace-based multi-tenancy, especially when multiple namespaces are dedicated to a single tenant. They frequently also offer encryption using mutual TLS, protecting your data even in the presence of a compromised node, and work across dedicated or virtual clusters. However, they can be significantly more complex to manage and may not be appropriate for all users. - -### Storage isolation - -Kubernetes offers several types of volumes that can be used as persistent storage for workloads. For security and data-isolation, [dynamic volume provisioning](/docs/concepts/storage/dynamic-provisioning/) is recommended and volume types that use node resources should be avoided. - -[StorageClasses](/docs/concepts/storage/storage-classes/) allow you to describe custom "classes" of storage offered by your cluster, based on quality-of-service levels, backup policies, or custom policies determined by the cluster administrators. - -Pods can request storage using a [PersistentVolumeClaim](/docs/concepts/storage/persistent-volumes/). A PersistentVolumeClaim is a namespaced resource, which enables isolating portions of the storage system and dedicating it to tenants within the shared Kubernetes cluster. However, it is important to note that a PersistentVolume is a cluster-wide resource and has a lifecycle independent of workloads and namespaces. - -For example, you can configure a separate StorageClass for each tenant and use this to strengthen isolation. -If a StorageClass is shared, you should set a [reclaim policy of `Delete`](/docs/concepts/storage/storage-classes/#reclaim-policy) -to ensure that a PersistentVolume cannot be reused across different namespaces. - -### Sandboxing containers - -{{% thirdparty-content %}} - -Kubernetes pods are composed of one or more containers that execute on worker nodes. Containers utilize OS-level virtualization and hence offer a weaker isolation boundary than virtual machines that utilize hardware-based virtualization. - -In a shared environment, unpatched vulnerabilities in the application and system layers can be exploited by attackers for container breakouts and remote code execution that allow access to host resources. In some applications, like a Content Management System (CMS), customers may be allowed the ability to upload and execute untrusted scripts or code. In either case, mechanisms to further isolate and protect workloads using strong isolation are desirable. - -Sandboxing provides a way to isolate workloads running in a shared cluster. It typically involves running each pod in a separate execution environment such as a virtual machine or a userspace kernel. Sandboxing is often recommended when you are running untrusted code, where workloads are assumed to be malicious. Part of the reason this type of isolation is necessary is because containers are processes running on a shared kernel; they mount file systems like /sys and /proc from the underlying host, making them less secure than an application that runs on a virtual machine which has its own kernel. While controls such as seccomp, AppArmor, and SELinux can be used to strengthen the security of containers, it is hard to apply a universal set of rules to all workloads running in a shared cluster. Running workloads in a sandbox environment helps to insulate the host from container escapes, where an attacker exploits a vulnerability to gain access to the host system and all the processes/files running on that host. - -Virtual machines and userspace kernels are 2 popular approaches to sandboxing. The following sandboxing implementations are available: -* [gVisor](https://gvisor.dev/) intercepts syscalls from containers and runs them through a userspace kernel, written in Go, with limited access to the underlying host. -* [Kata Containers](https://katacontainers.io/) is an OCI compliant runtime that allows you to run containers in a VM. The hardware virtualization available in Kata offers an added layer of security for containers running untrusted code. - -### Node Isolation - -Node isolation is another technique that you can use to isolate tenant workloads from each other. With node isolation, a set of nodes is dedicated to running pods from a particular tenant and co-mingling of tenant pods is prohibited. This configuration reduces the noisy tenant issue, as all pods running on a node will belong to a single tenant. The risk of information disclosure is slightly lower with node isolation because an attacker that manages to escape from a container will only have access to the containers and volumes mounted to that node. - -Although workloads from different tenants are running on different nodes, it is important to be aware that the kubelet and (unless using virtual control planes) the API service are still shared services. A skilled attacker could use the permissions assigned to the kubelet or other pods running on the node to move laterally within the cluster and gain access to tenant workloads running on other nodes. If this is a major concern, consider implementing compensating controls such as seccomp, AppArmor or SELinux or explore using sandboxed containers or creating separate clusters for each tenant. - -Node isolation is a little easier to reason about from a billing standpoint than sandboxing containers since you can charge back per node rather than per pod. It also has fewer compatibility and performance issues and may be easier to implement than sandboxing containers. For example, nodes for each tenant can be configured with taints so that only pods with the corresponding toleration can run on them. A mutating webhook could then be used to automatically add tolerations and node affinities to pods deployed into tenant namespaces so that they run on a specific set of nodes designated for that tenant. - -Node isolation can be implemented using an [pod node selectors](/docs/concepts/scheduling-eviction/assign-pod-node/) or a [Virtual Kubelet](https://github.com/virtual-kubelet). - -## Additional Considerations - -This section discusses other Kubernetes constructs and patterns that are relevant for multi-tenancy. - -### API Priority and Fairness - -[API priority and fairness](/docs/concepts/cluster-administration/flow-control/) is a Kubernetes feature that allows you to assign a priority to certain pods running within the cluster. When an application calls the Kubernetes API, the API server evaluates the priority assigned to pod. Calls from pods with higher priority are fulfilled before those with a lower priority. When contention is high, lower priority calls can be queued until the server is less busy or you can reject the requests. - -Using API priority and fairness will not be very common in SaaS environments unless you are allowing customers to run applications that interface with the Kubernetes API, e.g. a controller. - -### Quality-of-Service (QoS) {#qos} - -When you’re running a SaaS application, you may want the ability to offer different Quality-of-Service (QoS) tiers of service to different tenants. For example, you may have freemium service that comes with fewer performance guarantees and features and a for-fee service tier with specific performance guarantees. Fortunately, there are several Kubernetes constructs that can help you accomplish this within a shared cluster, including network QoS, storage classes, and pod priority and preemption. The idea with each of these is to provide tenants with the quality of service that they paid for. Let’s start by looking at networking QoS. - -Typically, all pods on a node share a network interface. Without network QoS, some pods may consume an unfair share of the available bandwidth at the expense of other pods. The Kubernetes [bandwidth plugin](https://www.cni.dev/plugins/current/meta/bandwidth/) creates an [extended resource](/docs/concepts/configuration/manage-resources-containers/#extended-resources) for networking that allows you to use Kubernetes resources constructs, i.e. requests/limits, to apply rate limits to pods by using Linux tc queues. Be aware that the plugin is considered experimental as per the [Network Plugins](/docs/concepts/extend-kubernetes/compute-storage-net/network-plugins/#support-traffic-shaping) documentation and should be thoroughly tested before use in production environments. - -For storage QoS, you will likely want to create different storage classes or profiles with different performance characteristics. Each storage profile can be associated with a different tier of service that is optimized for different workloads such IO, redundancy, or throughput. Additional logic might be necessary to allow the tenant to associate the appropriate storage profile with their workload. - -Finally, there’s [pod priority and preemption](/docs/concepts/scheduling-eviction/pod-priority-preemption/) where you can assign priority values to pods. When scheduling pods, the scheduler will try evicting pods with lower priority when there are insufficient resources to schedule pods that are assigned a higher priority. If you have a use case where tenants have different service tiers in a shared cluster e.g. free and paid, you may want to give higher priority to certain tiers using this feature. - -### DNS - -Kubernetes clusters include a Domain Name System (DNS) service to provide translations from names to IP addresses, for all Services and Pods. By default, the Kubernetes DNS service allows lookups across all namespaces in the cluster. - -In multi-tenant environments where tenants can access pods and other Kubernetes resources, or where -stronger isolation is required, it may be necessary to prevent pods from looking up services in other -Namespaces. -You can restrict cross-namespace DNS lookups by configuring security rules for the DNS service. -For example, CoreDNS (the default DNS service for Kubernetes) can leverage Kubernetes metadata -to restrict queries to Pods and Services within a namespace. For more information, read an -[example](https://github.com/coredns/policy#kubernetes-metadata-multi-tenancy-policy) of configuring -this within the CoreDNS documentation. - -When a [Virtual Control Plane per tenant](#virtual-control-plane-per-tenant) model is used, a DNS service must be configured per tenant or a multi-tenant DNS service must be used. Here is an example of a [customized version of CoreDNS](https://github.com/kubernetes-sigs/cluster-api-provider-nested/blob/main/virtualcluster/doc/tenant-dns.md) that supports multiple tenants. - -### Operators - -[Operators](/docs/concepts/extend-kubernetes/operator/) are Kubernetes controllers that manage applications. Operators can simplify the management of multiple instances of an application, like a database service, which makes them a common building block in the multi-consumer (SaaS) multi-tenancy use case. - -Operators used in a multi-tenant environment should follow a stricter set of guidelines. Specifically, the Operator should: -* Support creating resources within different tenant namespaces, rather than just in the namespace in which the Operator is deployed. -* Ensure that the Pods are configured with resource requests and limits, to ensure scheduling and fairness. -* Support configuration of Pods for data-plane isolation techniques such as node isolation and sandboxed containers. - -## Implementations - -{{% thirdparty-content %}} - -There are two primary ways to share a Kubernetes cluster for multi-tenancy: using Namespaces (i.e. a Namespace per tenant) or by virtualizing the control plane (i.e. Virtual control plane per tenant). - -In both cases, data plane isolation, and management of additional considerations such as API Priority and Fairness, is also recommended. - -Namespace isolation is well-supported by Kubernetes, has a negligible resource cost, and provides mechanisms to allow tenants to interact appropriately, such as by allowing service-to-service communication. However, it can be difficult to configure, and doesn't apply to Kubernetes resources that can't be namespaced, such as Custom Resource Definitions, Storage Classes, and Webhooks. - -Control plane virtualization allows for isolation of non-namespaced resources at the cost of somewhat higher resource usage and more difficult cross-tenant sharing. It is a good option when namespace isolation is insufficient but dedicated clusters are undesirable, due to the high cost of maintaining them (especially on-prem) or due to their higher overhead and lack of resource sharing. However, even within a virtualized control plane, you will likely see benefits by using namespaces as well. - -The two options are discussed in more detail in the following sections: - -### Namespace per tenant - -As previously mentioned, you should consider isolating each workload in its own namespace, even if you are using dedicated clusters or virtualized control planes. This ensures that each workload only has access to its own resources, such as Config Maps and Secrets, and allows you to tailor dedicated security policies for each workload. In addition, it is a best practice to give each namespace names that are unique across your entire fleet (i.e., even if they are in separate clusters), as this gives you the flexibility to switch between dedicated and shared clusters in the future, or to use multi-cluster tooling such as service meshes. - -Conversely, there are also advantages to assigning namespaces at the tenant level, not just the workload level, since there are often policies that apply to all workloads owned by a single tenant. However, this raises its own problems. Firstly, this makes it difficult or impossible to customize policies to individual workloads, and secondly, it may be challenging to come up with a single level of "tenancy" that should be given a namespace. For example, an organization may have divisions, teams, and subteams - which should be assigned a namespace? - -To solve this, Kubernetes provides the [Hierarchical Namespace Controller (HNC)](https://github.com/kubernetes-sigs/hierarchical-namespaces), which allows you to organize your namespaces into hierarchies, and share certain policies and resources between them. It also helps you manage namespace labels, namespace lifecycles, and delegated management, and share resource quotas across related namespaces. These capabilities can be useful in both multi-team and multi-customer scenarios. - -Other projects that provide similar capabilities and aid in managing namespaced resources are listed below: - -#### Multi-team tenancy - -* [Capsule](https://github.com/clastix/capsule) -* [Kiosk](https://github.com/loft-sh/kiosk) - -#### Multi-customer tenancy - -* [Kubeplus](https://github.com/cloud-ark/kubeplus) - -#### Policy engines - -Policy engines provide features to validate and generate tenant configurations: - -* [Kyverno](https://kyverno.io/) -* [OPA/Gatekeeper](https://github.com/open-policy-agent/gatekeeper) - -### Virtual control plane per tenant - -Another form of control-plane isolation is to use Kubernetes extensions to provide each tenant a virtual control-plane that enables segmentation of cluster-wide API resources. [Data plane isolation](#data-plane-isolation) techniques can be used with this model to securely manage worker nodes across tenants. - -The virtual control plane based multi-tenancy model extends namespace-based multi-tenancy by providing each tenant with dedicated control plane components, and hence complete control over cluster-wide resources and add-on services. Worker nodes are shared across all tenants, and are managed by a Kubernetes cluster that is normally inaccessible to tenants. This cluster is often referred to as a _super-cluster_ (or sometimes as a _host-cluster_). Since a tenant’s control-plane is not directly associated with underlying compute resources it is referred to as a _virtual control plane_. - -A virtual control plane typically consists of the Kubernetes API server, the controller manager, and the etcd data store. It interacts with the super cluster via a metadata synchronization controller which coordinates changes across tenant control planes and the control plane of the super--cluster. - -By using per-tenant dedicated control planes, most of the isolation problems due to sharing one API server among all tenants are solved. Examples include noisy neighbors in the control plane, uncontrollable blast radius of policy misconfigurations, and conflicts between cluster scope objects such as webhooks and CRDs. Hence, the virtual control plane model is particularly suitable for cases where each tenant requires access to a Kubernetes API server and expects the full cluster manageability. - -The improved isolation comes at the cost of running and maintaining an individual virtual control plane per tenant. In addition, per-tenant control planes do not solve isolation problems in the data plane, such as node-level noisy neighbors or security threats. These must still be addressed separately. - -The Kubernetes [Cluster API - Nested (CAPN)](https://github.com/kubernetes-sigs/cluster-api-provider-nested/tree/main/virtualcluster) project provides an implementation of virtual control planes. - -#### Other implementations -* [Kamaji](https://github.com/clastix/kamaji) -* [vcluster](https://github.com/loft-sh/vcluster) - +--- +title: Multi-tenancy +content_type: concept +weight: 70 +--- + + + +This page provides an overview of available configuration options and best practices for cluster multi-tenancy. + +Sharing clusters saves costs and simplifies administration. However, sharing clusters also presents challenges such as security, fairness, and managing _noisy neighbors_. + +Clusters can be shared in many ways. In some cases, different applications may run in the same cluster. In other cases, multiple instances of the same application may run in the same cluster, one for each end user. All these types of sharing are frequently described using the umbrella term _multi-tenancy_. + +While Kubernetes does not have first-class concepts of end users or tenants, it provides several features to help manage different tenancy requirements. These are discussed below. + + +## Use cases + +The first step to determining how to share your cluster is understanding your use case, so you can evaluate the patterns and tools available. In general, multi-tenancy in Kubernetes clusters falls into two broad categories, though many variations and hybrids are also possible. + +### Multiple teams + +A common form of multi-tenancy is to share a cluster between multiple teams within an organization, each of whom may operate one or more workloads. These workloads frequently need to communicate with each other, and with other workloads located on the same or different clusters. + +In this scenario, members of the teams often have direct access to Kubernetes resources via tools such as `kubectl`, or indirect access through GitOps controllers or other types of release automation tools. There is often some level of trust between members of different teams, but Kubernetes policies such as RBAC, quotas, and network policies are essential to safely and fairly share clusters. + +### Multiple customers + +The other major form of multi-tenancy frequently involves a Software-as-a-Service (SaaS) vendor running multiple instances of a workload for customers. This business model is so strongly associated with this deployment style that many people call it "SaaS tenancy." However, a better term might be "multi-customer tenancy,” since SaaS vendors may also use other deployment models, and this deployment model can also be used outside of SaaS. + + +In this scenario, the customers do not have access to the cluster; Kubernetes is invisible from their perspective and is only used by the vendor to manage the workloads. Cost optimization is frequently a critical concern, and Kubernetes policies are used to ensure that the workloads are strongly isolated from each other. + + +## Terminology + +### Tenants + +When discussing multi-tenancy in Kubernetes, there is no single definition for a "tenant". Rather, the definition of a tenant will vary depending on whether multi-team or multi-customer tenancy is being discussed. + +In multi-team usage, a tenant is typically a team, where each team typically deploys a small number of workloads that scales with the complexity of the service. However, the definition of "team" may itself be fuzzy, as teams may be organized into higher-level divisions or subdivided into smaller teams. + + +By contrast, if each team deploys dedicated workloads for each new client, they are using a multi-customer model of tenancy. In this case, a "tenant" is simply a group of users who share a single workload. This may be as large as an entire company, or as small as a single team at that company. + +In many cases, the same organization may use both definitions of "tenants" in different contexts. For example, a platform team may offer shared services such as security tools and databases to multiple internal “customers” and a SaaS vendor may also have multiple teams sharing a development cluster. Finally, hybrid architectures are also possible, such as a SaaS provider using a combination of per-customer workloads for sensitive data, combined with multi-tenant shared services. + + +{{< figure src="/images/docs/multi-tenancy.png" title="A cluster showing coexisting tenancy models" class="diagram-large" >}} + + +### Isolation + +There are several ways to design and build multi-tenant solutions with Kubernetes. Each of these methods comes with its own set of tradeoffs that impact the isolation level, implementation effort, operational complexity, and cost of service. + + +A Kubernetes cluster consists of a control plane which runs Kubernetes software, and a data plane consisting of worker nodes where tenant workloads are executed as pods. Tenant isolation can be applied in both the control plane and the data plane based on organizational requirements. + +The level of isolation offered is sometimes described using terms like “hard” multi-tenancy, which implies strong isolation, and “soft” multi-tenancy, which implies weaker isolation. In particular, "hard" multi-tenancy is often used to describe cases where the tenants do not trust each other, often from security and resource sharing perspectives (e.g. guarding against attacks such as data exfiltration or DoS). Since data planes typically have much larger attack surfaces, "hard" multi-tenancy often requires extra attention to isolating the data-plane, though control plane isolation also remains critical. + +However, the terms "hard" and "soft" can often be confusing, as there is no single definition that will apply to all users. Rather, "hardness" or "softness" is better understood as a broad spectrum, with many different techniques that can be used to maintain different types of isolation in your clusters, based on your requirements. + + +In more extreme cases, it may be easier or necessary to forgo any cluster-level sharing at all and assign each tenant their dedicated cluster, possibly even running on dedicated hardware if VMs are not considered an adequate security boundary. This may be easier with managed Kubernetes clusters, where the overhead of creating and operating clusters is at least somewhat taken on by a cloud provider. The benefit of stronger tenant isolation must be evaluated against the cost and complexity of managing multiple clusters. The [Multi-cluster SIG](https://git.k8s.io/community/sig-multicluster/README.md) is responsible for addressing these types of use cases. + + + +The remainder of this page focuses on isolation techniques used for shared Kubernetes clusters. However, even if you are considering dedicated clusters, it may be valuable to review these recommendations, as it will give you the flexibility to shift to shared clusters in the future if your needs or capabilities change. + + +## Control plane isolation + +Control plane isolation ensures that different tenants cannot access or affect each others' Kubernetes API resources. + +### Namespaces + +In Kubernetes, a {{< glossary_tooltip text="Namespace" term_id="namespace" >}} provides a mechanism for isolating groups of API resources within a single cluster. This isolation has two key dimensions: + +1. Object names within a namespace can overlap with names in other namespaces, similar to files in folders. This allows tenants to name their resources without having to consider what other tenants are doing. + +2. Many Kubernetes security policies are scoped to namespaces. For example, RBAC Roles and Network Policies are namespace-scoped resources. Using RBAC, Users and Service Accounts can be restricted to a namespace. + +In a multi-tenant environment, a Namespace helps segment a tenant's workload into a logical and distinct management unit. In fact, a common practice is to isolate every workload in its own namespace, even if multiple workloads are operated by the same tenant. This ensures that each workload has its own identity and can be configured with an appropriate security policy. + +The namespace isolation model requires configuration of several other Kubernetes resources, networking plugins, and adherence to security best practices to properly isolate tenant workloads. These considerations are discussed below. + +### Access controls + +The most important type of isolation for the control plane is authorization. If teams or their workloads can access or modify each others' API resources, they can change or disable all other types of policies thereby negating any protection those policies may offer. As a result, it is critical to ensure that each tenant has the appropriate access to only the namespaces they need, and no more. This is known as the "Principle of Least Privilege." + + +Role-based access control (RBAC) is commonly used to enforce authorization in the Kubernetes control plane, for both users and workloads (service accounts). [Roles](/docs/reference/access-authn-authz/rbac/#role-and-clusterrole) and [role bindings](/docs/reference/access-authn-authz/rbac/#rolebinding-and-clusterrolebinding) are Kubernetes objects that are used at a namespace level to enforce access control in your application; similar objects exist for authorizing access to cluster-level objects, though these are less useful for multi-tenant clusters. + +In a multi-team environment, RBAC must be used to restrict tenants' access to the appropriate namespaces, and ensure that cluster-wide resources can only be accessed or modified by privileged users such as cluster administrators. + +If a policy ends up granting a user more permissions than they need, this is likely a signal that the namespace containing the affected resources should be refactored into finer-grained namespaces. Namespace management tools may simplify the management of these finer-grained namespaces by applying common RBAC policies to different namespaces, while still allowing fine-grained policies where necessary. + +### Quotas + +Kubernetes workloads consume node resources, like CPU and memory. In a multi-tenant environment, you can use +[Resource Quotas](/docs/concepts/policy/resource-quotas/) to manage resource usage of tenant workloads. +For the multiple teams use case, where tenants have access to the Kubernetes API, you can use resource quotas +to limit the number of API resources (for example: the number of Pods, or the number of ConfigMaps) +that a tenant can create. Limits on object count ensure fairness and aim to avoid _noisy neighbor_ issues from +affecting other tenants that share a control plane. + +Resource quotas are namespaced objects. By mapping tenants to namespaces, cluster admins can use quotas to ensure that a tenant cannot monopolize a cluster's resources or overwhelm its control plane. Namespace management tools simplify the administration of quotas. In addition, while Kubernetes quotas only apply within a single namespace, some namespace management tools allow groups of namespaces to share quotas, giving administrators far more flexibility with less effort than built-in quotas. + +Quotas prevent a single tenant from consuming greater than their allocated share of resources hence minimizing the “noisy neighbor” issue, where one tenant negatively impacts the performance of other tenants' workloads. + +When you apply a quota to namespace, Kubernetes requires you to also specify resource requests and limits for each container. Limits are the upper bound for the amount of resources that a container can consume. Containers that attempt to consume resources that exceed the configured limits will either be throttled or killed, based on the resource type. When resource requests are set lower than limits, each container is guaranteed the requested amount but there may still be some potential for impact across workloads. + +Quotas cannot protect against all kinds of resource sharing, such as network traffic. Node isolation (described below) may be a better solution for this problem. + +## Data Plane Isolation + +Data plane isolation ensures that pods and workloads for different tenants are sufficiently isolated. + +### Network isolation + +By default, all pods in a Kubernetes cluster are allowed to communicate with each other, and all network traffic is unencrypted. This can lead to security vulnerabilities where traffic is accidentally or maliciously sent to an unintended destination, or is intercepted by a workload on a compromised node. + +Pod-to-pod communication can be controlled using [Network Policies](/docs/concepts/services-networking/network-policies/), which restrict communication between pods using namespace labels or IP address ranges. In a multi-tenant environment where strict network isolation between tenants is required, starting with a default policy that denies communication between pods is recommended with another rule that allows all pods to query the DNS server for name resolution. With such a default policy in place, you can begin adding more permissive rules that allow for communication within a namespace. This scheme can be further refined as required. Note that this only applies to pods within a single control plane; pods that belong to different virtual control planes cannot talk to each other via Kubernetes networking. + +Namespace management tools may simplify the creation of default or common network policies. In addition, some of these tools allow you to enforce a consistent set of namespace labels across your cluster, ensuring that they are a trusted basis for your policies. + +{{< warning >}} +Network policies require a [CNI plugin](/docs/concepts/extend-kubernetes/compute-storage-net/network-plugins/#cni) that supports the implementation of network policies. Otherwise, NetworkPolicy resources will be ignored. +{{< /warning >}} + +More advanced network isolation may be provided by service meshes, which provide OSI Layer 7 policies based on workload identity, in addition to namespaces. These higher-level policies can make it easier to manage namespace-based multi-tenancy, especially when multiple namespaces are dedicated to a single tenant. They frequently also offer encryption using mutual TLS, protecting your data even in the presence of a compromised node, and work across dedicated or virtual clusters. However, they can be significantly more complex to manage and may not be appropriate for all users. + +### Storage isolation + +Kubernetes offers several types of volumes that can be used as persistent storage for workloads. For security and data-isolation, [dynamic volume provisioning](/docs/concepts/storage/dynamic-provisioning/) is recommended and volume types that use node resources should be avoided. + +[StorageClasses](/docs/concepts/storage/storage-classes/) allow you to describe custom "classes" of storage offered by your cluster, based on quality-of-service levels, backup policies, or custom policies determined by the cluster administrators. + +Pods can request storage using a [PersistentVolumeClaim](/docs/concepts/storage/persistent-volumes/). A PersistentVolumeClaim is a namespaced resource, which enables isolating portions of the storage system and dedicating it to tenants within the shared Kubernetes cluster. However, it is important to note that a PersistentVolume is a cluster-wide resource and has a lifecycle independent of workloads and namespaces. + +For example, you can configure a separate StorageClass for each tenant and use this to strengthen isolation. +If a StorageClass is shared, you should set a [reclaim policy of `Delete`](/docs/concepts/storage/storage-classes/#reclaim-policy) +to ensure that a PersistentVolume cannot be reused across different namespaces. + +### Sandboxing containers + +{{% thirdparty-content %}} + +Kubernetes pods are composed of one or more containers that execute on worker nodes. Containers utilize OS-level virtualization and hence offer a weaker isolation boundary than virtual machines that utilize hardware-based virtualization. + +In a shared environment, unpatched vulnerabilities in the application and system layers can be exploited by attackers for container breakouts and remote code execution that allow access to host resources. In some applications, like a Content Management System (CMS), customers may be allowed the ability to upload and execute untrusted scripts or code. In either case, mechanisms to further isolate and protect workloads using strong isolation are desirable. + +Sandboxing provides a way to isolate workloads running in a shared cluster. It typically involves running each pod in a separate execution environment such as a virtual machine or a userspace kernel. Sandboxing is often recommended when you are running untrusted code, where workloads are assumed to be malicious. Part of the reason this type of isolation is necessary is because containers are processes running on a shared kernel; they mount file systems like /sys and /proc from the underlying host, making them less secure than an application that runs on a virtual machine which has its own kernel. While controls such as seccomp, AppArmor, and SELinux can be used to strengthen the security of containers, it is hard to apply a universal set of rules to all workloads running in a shared cluster. Running workloads in a sandbox environment helps to insulate the host from container escapes, where an attacker exploits a vulnerability to gain access to the host system and all the processes/files running on that host. + +Virtual machines and userspace kernels are 2 popular approaches to sandboxing. The following sandboxing implementations are available: +* [gVisor](https://gvisor.dev/) intercepts syscalls from containers and runs them through a userspace kernel, written in Go, with limited access to the underlying host. +* [Kata Containers](https://katacontainers.io/) is an OCI compliant runtime that allows you to run containers in a VM. The hardware virtualization available in Kata offers an added layer of security for containers running untrusted code. + +### Node Isolation + +Node isolation is another technique that you can use to isolate tenant workloads from each other. With node isolation, a set of nodes is dedicated to running pods from a particular tenant and co-mingling of tenant pods is prohibited. This configuration reduces the noisy tenant issue, as all pods running on a node will belong to a single tenant. The risk of information disclosure is slightly lower with node isolation because an attacker that manages to escape from a container will only have access to the containers and volumes mounted to that node. + +Although workloads from different tenants are running on different nodes, it is important to be aware that the kubelet and (unless using virtual control planes) the API service are still shared services. A skilled attacker could use the permissions assigned to the kubelet or other pods running on the node to move laterally within the cluster and gain access to tenant workloads running on other nodes. If this is a major concern, consider implementing compensating controls such as seccomp, AppArmor or SELinux or explore using sandboxed containers or creating separate clusters for each tenant. + +Node isolation is a little easier to reason about from a billing standpoint than sandboxing containers since you can charge back per node rather than per pod. It also has fewer compatibility and performance issues and may be easier to implement than sandboxing containers. For example, nodes for each tenant can be configured with taints so that only pods with the corresponding toleration can run on them. A mutating webhook could then be used to automatically add tolerations and node affinities to pods deployed into tenant namespaces so that they run on a specific set of nodes designated for that tenant. + +Node isolation can be implemented using an [pod node selectors](/docs/concepts/scheduling-eviction/assign-pod-node/) or a [Virtual Kubelet](https://github.com/virtual-kubelet). + +## Additional Considerations + +This section discusses other Kubernetes constructs and patterns that are relevant for multi-tenancy. + +### API Priority and Fairness + +[API priority and fairness](/docs/concepts/cluster-administration/flow-control/) is a Kubernetes feature that allows you to assign a priority to certain pods running within the cluster. When an application calls the Kubernetes API, the API server evaluates the priority assigned to pod. Calls from pods with higher priority are fulfilled before those with a lower priority. When contention is high, lower priority calls can be queued until the server is less busy or you can reject the requests. + +Using API priority and fairness will not be very common in SaaS environments unless you are allowing customers to run applications that interface with the Kubernetes API, e.g. a controller. + +### Quality-of-Service (QoS) {#qos} + +When you’re running a SaaS application, you may want the ability to offer different Quality-of-Service (QoS) tiers of service to different tenants. For example, you may have freemium service that comes with fewer performance guarantees and features and a for-fee service tier with specific performance guarantees. Fortunately, there are several Kubernetes constructs that can help you accomplish this within a shared cluster, including network QoS, storage classes, and pod priority and preemption. The idea with each of these is to provide tenants with the quality of service that they paid for. Let’s start by looking at networking QoS. + +Typically, all pods on a node share a network interface. Without network QoS, some pods may consume an unfair share of the available bandwidth at the expense of other pods. The Kubernetes [bandwidth plugin](https://www.cni.dev/plugins/current/meta/bandwidth/) creates an [extended resource](/docs/concepts/configuration/manage-resources-containers/#extended-resources) for networking that allows you to use Kubernetes resources constructs, i.e. requests/limits, to apply rate limits to pods by using Linux tc queues. Be aware that the plugin is considered experimental as per the [Network Plugins](/docs/concepts/extend-kubernetes/compute-storage-net/network-plugins/#support-traffic-shaping) documentation and should be thoroughly tested before use in production environments. + +For storage QoS, you will likely want to create different storage classes or profiles with different performance characteristics. Each storage profile can be associated with a different tier of service that is optimized for different workloads such IO, redundancy, or throughput. Additional logic might be necessary to allow the tenant to associate the appropriate storage profile with their workload. + +Finally, there’s [pod priority and preemption](/docs/concepts/scheduling-eviction/pod-priority-preemption/) where you can assign priority values to pods. When scheduling pods, the scheduler will try evicting pods with lower priority when there are insufficient resources to schedule pods that are assigned a higher priority. If you have a use case where tenants have different service tiers in a shared cluster e.g. free and paid, you may want to give higher priority to certain tiers using this feature. + +### DNS + +Kubernetes clusters include a Domain Name System (DNS) service to provide translations from names to IP addresses, for all Services and Pods. By default, the Kubernetes DNS service allows lookups across all namespaces in the cluster. + +In multi-tenant environments where tenants can access pods and other Kubernetes resources, or where +stronger isolation is required, it may be necessary to prevent pods from looking up services in other +Namespaces. +You can restrict cross-namespace DNS lookups by configuring security rules for the DNS service. +For example, CoreDNS (the default DNS service for Kubernetes) can leverage Kubernetes metadata +to restrict queries to Pods and Services within a namespace. For more information, read an +[example](https://github.com/coredns/policy#kubernetes-metadata-multi-tenancy-policy) of configuring +this within the CoreDNS documentation. + +When a [Virtual Control Plane per tenant](#virtual-control-plane-per-tenant) model is used, a DNS service must be configured per tenant or a multi-tenant DNS service must be used. Here is an example of a [customized version of CoreDNS](https://github.com/kubernetes-sigs/cluster-api-provider-nested/blob/main/virtualcluster/doc/tenant-dns.md) that supports multiple tenants. + +### Operators + +[Operators](/docs/concepts/extend-kubernetes/operator/) are Kubernetes controllers that manage applications. Operators can simplify the management of multiple instances of an application, like a database service, which makes them a common building block in the multi-consumer (SaaS) multi-tenancy use case. + +Operators used in a multi-tenant environment should follow a stricter set of guidelines. Specifically, the Operator should: +* Support creating resources within different tenant namespaces, rather than just in the namespace in which the Operator is deployed. +* Ensure that the Pods are configured with resource requests and limits, to ensure scheduling and fairness. +* Support configuration of Pods for data-plane isolation techniques such as node isolation and sandboxed containers. + +## Implementations + +{{% thirdparty-content %}} + +There are two primary ways to share a Kubernetes cluster for multi-tenancy: using Namespaces (i.e. a Namespace per tenant) or by virtualizing the control plane (i.e. Virtual control plane per tenant). + +In both cases, data plane isolation, and management of additional considerations such as API Priority and Fairness, is also recommended. + +Namespace isolation is well-supported by Kubernetes, has a negligible resource cost, and provides mechanisms to allow tenants to interact appropriately, such as by allowing service-to-service communication. However, it can be difficult to configure, and doesn't apply to Kubernetes resources that can't be namespaced, such as Custom Resource Definitions, Storage Classes, and Webhooks. + +Control plane virtualization allows for isolation of non-namespaced resources at the cost of somewhat higher resource usage and more difficult cross-tenant sharing. It is a good option when namespace isolation is insufficient but dedicated clusters are undesirable, due to the high cost of maintaining them (especially on-prem) or due to their higher overhead and lack of resource sharing. However, even within a virtualized control plane, you will likely see benefits by using namespaces as well. + +The two options are discussed in more detail in the following sections: + +### Namespace per tenant + +As previously mentioned, you should consider isolating each workload in its own namespace, even if you are using dedicated clusters or virtualized control planes. This ensures that each workload only has access to its own resources, such as Config Maps and Secrets, and allows you to tailor dedicated security policies for each workload. In addition, it is a best practice to give each namespace names that are unique across your entire fleet (i.e., even if they are in separate clusters), as this gives you the flexibility to switch between dedicated and shared clusters in the future, or to use multi-cluster tooling such as service meshes. + +Conversely, there are also advantages to assigning namespaces at the tenant level, not just the workload level, since there are often policies that apply to all workloads owned by a single tenant. However, this raises its own problems. Firstly, this makes it difficult or impossible to customize policies to individual workloads, and secondly, it may be challenging to come up with a single level of "tenancy" that should be given a namespace. For example, an organization may have divisions, teams, and subteams - which should be assigned a namespace? + +To solve this, Kubernetes provides the [Hierarchical Namespace Controller (HNC)](https://github.com/kubernetes-sigs/hierarchical-namespaces), which allows you to organize your namespaces into hierarchies, and share certain policies and resources between them. It also helps you manage namespace labels, namespace lifecycles, and delegated management, and share resource quotas across related namespaces. These capabilities can be useful in both multi-team and multi-customer scenarios. + +Other projects that provide similar capabilities and aid in managing namespaced resources are listed below: + +#### Multi-team tenancy + +* [Capsule](https://github.com/clastix/capsule) +* [Kiosk](https://github.com/loft-sh/kiosk) + +#### Multi-customer tenancy + +* [Kubeplus](https://github.com/cloud-ark/kubeplus) + +#### Policy engines + +Policy engines provide features to validate and generate tenant configurations: + +* [Kyverno](https://kyverno.io/) +* [OPA/Gatekeeper](https://github.com/open-policy-agent/gatekeeper) + +### Virtual control plane per tenant + +Another form of control-plane isolation is to use Kubernetes extensions to provide each tenant a virtual control-plane that enables segmentation of cluster-wide API resources. [Data plane isolation](#data-plane-isolation) techniques can be used with this model to securely manage worker nodes across tenants. + +The virtual control plane based multi-tenancy model extends namespace-based multi-tenancy by providing each tenant with dedicated control plane components, and hence complete control over cluster-wide resources and add-on services. Worker nodes are shared across all tenants, and are managed by a Kubernetes cluster that is normally inaccessible to tenants. This cluster is often referred to as a _super-cluster_ (or sometimes as a _host-cluster_). Since a tenant’s control-plane is not directly associated with underlying compute resources it is referred to as a _virtual control plane_. + +A virtual control plane typically consists of the Kubernetes API server, the controller manager, and the etcd data store. It interacts with the super cluster via a metadata synchronization controller which coordinates changes across tenant control planes and the control plane of the super--cluster. + +By using per-tenant dedicated control planes, most of the isolation problems due to sharing one API server among all tenants are solved. Examples include noisy neighbors in the control plane, uncontrollable blast radius of policy misconfigurations, and conflicts between cluster scope objects such as webhooks and CRDs. Hence, the virtual control plane model is particularly suitable for cases where each tenant requires access to a Kubernetes API server and expects the full cluster manageability. + +The improved isolation comes at the cost of running and maintaining an individual virtual control plane per tenant. In addition, per-tenant control planes do not solve isolation problems in the data plane, such as node-level noisy neighbors or security threats. These must still be addressed separately. + +The Kubernetes [Cluster API - Nested (CAPN)](https://github.com/kubernetes-sigs/cluster-api-provider-nested/tree/main/virtualcluster) project provides an implementation of virtual control planes. + +#### Other implementations +* [Kamaji](https://github.com/clastix/kamaji) +* [vcluster](https://github.com/loft-sh/vcluster) + diff --git a/content/en/docs/reference/glossary/eviction.md b/content/en/docs/reference/glossary/eviction.md index 4437e43354..4a33a1b4ba 100644 --- a/content/en/docs/reference/glossary/eviction.md +++ b/content/en/docs/reference/glossary/eviction.md @@ -1,18 +1,18 @@ ---- -title: Eviction -id: eviction -date: 2021-05-08 -full_link: /docs/concepts/scheduling-eviction/ -short_description: > - Process of terminating one or more Pods on Nodes -aka: -tags: -- operation ---- - -Eviction is the process of terminating one or more Pods on Nodes. - - -There are two kinds of eviction: -* [Node-pressure eviction](/docs/concepts/scheduling-eviction/node-pressure-eviction/) -* [API-initiated eviction](/docs/concepts/scheduling-eviction/api-eviction/) +--- +title: Eviction +id: eviction +date: 2021-05-08 +full_link: /docs/concepts/scheduling-eviction/ +short_description: > + Process of terminating one or more Pods on Nodes +aka: +tags: +- operation +--- + +Eviction is the process of terminating one or more Pods on Nodes. + + +There are two kinds of eviction: +* [Node-pressure eviction](/docs/concepts/scheduling-eviction/node-pressure-eviction/) +* [API-initiated eviction](/docs/concepts/scheduling-eviction/api-eviction/) diff --git a/content/en/docs/tasks/administer-cluster/topology-manager.md b/content/en/docs/tasks/administer-cluster/topology-manager.md index 4002537f0c..fed109fdde 100644 --- a/content/en/docs/tasks/administer-cluster/topology-manager.md +++ b/content/en/docs/tasks/administer-cluster/topology-manager.md @@ -1,270 +1,270 @@ ---- -title: Control Topology Management Policies on a node - -reviewers: -- ConnorDoyle -- klueska -- lmdaly -- nolancon -- bg-chun - -content_type: task -min-kubernetes-server-version: v1.18 ---- - - - -{{< feature-state state="beta" for_k8s_version="v1.18" >}} - -An increasing number of systems leverage a combination of CPUs and hardware accelerators to support latency-critical execution and high-throughput parallel computation. These include workloads in fields such as telecommunications, scientific computing, machine learning, financial services and data analytics. Such hybrid systems comprise a high performance environment. - -In order to extract the best performance, optimizations related to CPU isolation, memory and device locality are required. However, in Kubernetes, these optimizations are handled by a disjoint set of components. - -_Topology Manager_ is a Kubelet component that aims to coordinate the set of components that are responsible for these optimizations. - - - -## {{% heading "prerequisites" %}} - - -{{< include "task-tutorial-prereqs.md" >}} {{< version-check >}} - - - - - -## How Topology Manager Works - -Prior to the introduction of Topology Manager, the CPU and Device Manager in Kubernetes make resource allocation decisions independently of each other. -This can result in undesirable allocations on multiple-socketed systems, performance/latency sensitive applications will suffer due to these undesirable allocations. - Undesirable in this case meaning for example, CPUs and devices being allocated from different NUMA Nodes thus, incurring additional latency. - -The Topology Manager is a Kubelet component, which acts as a source of truth so that other Kubelet components can make topology aligned resource allocation choices. - -The Topology Manager provides an interface for components, called *Hint Providers*, to send and receive topology information. Topology Manager has a set of node level policies which are explained below. - -The Topology manager receives Topology information from the *Hint Providers* as a bitmask denoting NUMA Nodes available and a preferred allocation indication. The Topology Manager policies perform a set of operations on the hints provided and converge on the hint determined by the policy to give the optimal result, if an undesirable hint is stored the preferred field for the hint will be set to false. In the current policies preferred is the narrowest preferred mask. -The selected hint is stored as part of the Topology Manager. Depending on the policy configured the pod can be accepted or rejected from the node based on the selected hint. -The hint is then stored in the Topology Manager for use by the *Hint Providers* when making the resource allocation decisions. - -### Enable the Topology Manager feature - -Support for the Topology Manager requires `TopologyManager` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/) to be enabled. It is enabled by default starting with Kubernetes 1.18. - -## Topology Manager Scopes and Policies - -The Topology Manager currently: - - Aligns Pods of all QoS classes. - - Aligns the requested resources that Hint Provider provides topology hints for. - -If these conditions are met, the Topology Manager will align the requested resources. - -In order to customise how this alignment is carried out, the Topology Manager provides two distinct knobs: `scope` and `policy`. - -The `scope` defines the granularity at which you would like resource alignment to be performed (e.g. at the `pod` or `container` level). And the `policy` defines the actual strategy used to carry out the alignment (e.g. `best-effort`, `restricted`, `single-numa-node`, etc.). - -Details on the various `scopes` and `policies` available today can be found below. - -{{< note >}} -To align CPU resources with other requested resources in a Pod Spec, the CPU Manager should be enabled and proper CPU Manager policy should be configured on a Node. See [control CPU Management Policies](/docs/tasks/administer-cluster/cpu-management-policies/). -{{< /note >}} - -{{< note >}} -To align memory (and hugepages) resources with other requested resources in a Pod Spec, the Memory Manager should be enabled and proper Memory Manager policy should be configured on a Node. Examine [Memory Manager](/docs/tasks/administer-cluster/memory-manager/) documentation. -{{< /note >}} - -### Topology Manager Scopes - -The Topology Manager can deal with the alignment of resources in a couple of distinct scopes: - -* `container` (default) -* `pod` - -Either option can be selected at a time of the kubelet startup, with `--topology-manager-scope` flag. - -### container scope - -The `container` scope is used by default. - -Within this scope, the Topology Manager performs a number of sequential resource alignments, i.e., for each container (in a pod) a separate alignment is computed. In other words, there is no notion of grouping the containers to a specific set of NUMA nodes, for this particular scope. In effect, the Topology Manager performs an arbitrary alignment of individual containers to NUMA nodes. - -The notion of grouping the containers was endorsed and implemented on purpose in the following scope, for example the `pod` scope. - -### pod scope - -To select the `pod` scope, start the kubelet with the command line option `--topology-manager-scope=pod`. - -This scope allows for grouping all containers in a pod to a common set of NUMA nodes. That is, the Topology Manager treats a pod as a whole and attempts to allocate the entire pod (all containers) to either a single NUMA node or a common set of NUMA nodes. The following examples illustrate the alignments produced by the Topology Manager on different occasions: - -* all containers can be and are allocated to a single NUMA node; -* all containers can be and are allocated to a shared set of NUMA nodes. - -The total amount of particular resource demanded for the entire pod is calculated according to [effective requests/limits](/docs/concepts/workloads/pods/init-containers/#resources) formula, and thus, this total value is equal to the maximum of: -* the sum of all app container requests, -* the maximum of init container requests, -for a resource. - -Using the `pod` scope in tandem with `single-numa-node` Topology Manager policy is specifically valuable for workloads that are latency sensitive or for high-throughput applications that perform IPC. By combining both options, you are able to place all containers in a pod onto a single NUMA node; hence, the inter-NUMA communication overhead can be eliminated for that pod. - -In the case of `single-numa-node` policy, a pod is accepted only if a suitable set of NUMA nodes is present among possible allocations. Reconsider the example above: - -* a set containing only a single NUMA node - it leads to pod being admitted, -* whereas a set containing more NUMA nodes - it results in pod rejection (because instead of one NUMA node, two or more NUMA nodes are required to satisfy the allocation). - -To recap, Topology Manager first computes a set of NUMA nodes and then tests it against Topology Manager policy, which either leads to the rejection or admission of the pod. - -### Topology Manager Policies - -Topology Manager supports four allocation policies. You can set a policy via a Kubelet flag, `--topology-manager-policy`. -There are four supported policies: - -* `none` (default) -* `best-effort` -* `restricted` -* `single-numa-node` - -{{< note >}} -If Topology Manager is configured with the **pod** scope, the container, which is considered by the policy, is reflecting requirements of the entire pod, and thus each container from the pod will result with **the same** topology alignment decision. -{{< /note >}} - -### none policy {#policy-none} - -This is the default policy and does not perform any topology alignment. - -### best-effort policy {#policy-best-effort} - -For each container in a Pod, the kubelet, with `best-effort` topology -management policy, calls each Hint Provider to discover their resource availability. -Using this information, the Topology Manager stores the -preferred NUMA Node affinity for that container. If the affinity is not preferred, -Topology Manager will store this and admit the pod to the node anyway. - -The *Hint Providers* can then use this information when making the -resource allocation decision. - -### restricted policy {#policy-restricted} - -For each container in a Pod, the kubelet, with `restricted` topology -management policy, calls each Hint Provider to discover their resource availability. -Using this information, the Topology Manager stores the -preferred NUMA Node affinity for that container. If the affinity is not preferred, -Topology Manager will reject this pod from the node. This will result in a pod in a `Terminated` state with a pod admission failure. - -Once the pod is in a `Terminated` state, the Kubernetes scheduler will **not** attempt to reschedule the pod. It is recommended to use a ReplicaSet or Deployment to trigger a redeploy of the pod. -An external control loop could be also implemented to trigger a redeployment of pods that have the `Topology Affinity` error. - -If the pod is admitted, the *Hint Providers* can then use this information when making the -resource allocation decision. - -### single-numa-node policy {#policy-single-numa-node} - -For each container in a Pod, the kubelet, with `single-numa-node` topology -management policy, calls each Hint Provider to discover their resource availability. -Using this information, the Topology Manager determines if a single NUMA Node affinity is possible. -If it is, Topology Manager will store this and the *Hint Providers* can then use this information when making the -resource allocation decision. -If, however, this is not possible then the Topology Manager will reject the pod from the node. This will result in a pod in a `Terminated` state with a pod admission failure. - -Once the pod is in a `Terminated` state, the Kubernetes scheduler will **not** attempt to reschedule the pod. It is recommended to use a Deployment with replicas to trigger a redeploy of the Pod. -An external control loop could be also implemented to trigger a redeployment of pods that have the `Topology Affinity` error. - -### Pod Interactions with Topology Manager Policies - -Consider the containers in the following pod specs: - -```yaml -spec: - containers: - - name: nginx - image: nginx -``` - -This pod runs in the `BestEffort` QoS class because no resource `requests` or -`limits` are specified. - -```yaml -spec: - containers: - - name: nginx - image: nginx - resources: - limits: - memory: "200Mi" - requests: - memory: "100Mi" -``` - -This pod runs in the `Burstable` QoS class because requests are less than limits. - -If the selected policy is anything other than `none`, Topology Manager would consider these Pod specifications. The Topology Manager would consult the Hint Providers to get topology hints. In the case of the `static`, the CPU Manager policy would return default topology hint, because these Pods do not have explicitly request CPU resources. - - -```yaml -spec: - containers: - - name: nginx - image: nginx - resources: - limits: - memory: "200Mi" - cpu: "2" - example.com/device: "1" - requests: - memory: "200Mi" - cpu: "2" - example.com/device: "1" -``` - -This pod with integer CPU request runs in the `Guaranteed` QoS class because `requests` are equal to `limits`. - - -```yaml -spec: - containers: - - name: nginx - image: nginx - resources: - limits: - memory: "200Mi" - cpu: "300m" - example.com/device: "1" - requests: - memory: "200Mi" - cpu: "300m" - example.com/device: "1" -``` - -This pod with sharing CPU request runs in the `Guaranteed` QoS class because `requests` are equal to `limits`. - - -```yaml -spec: - containers: - - name: nginx - image: nginx - resources: - limits: - example.com/deviceA: "1" - example.com/deviceB: "1" - requests: - example.com/deviceA: "1" - example.com/deviceB: "1" -``` -This pod runs in the `BestEffort` QoS class because there are no CPU and memory requests. - -The Topology Manager would consider the above pods. The Topology Manager would consult the Hint Providers, which are CPU and Device Manager to get topology hints for the pods. - -In the case of the `Guaranteed` pod with integer CPU request, the `static` CPU Manager policy would return topology hints relating to the exclusive CPU and the Device Manager would send back hints for the requested device. - -In the case of the `Guaranteed` pod with sharing CPU request, the `static` CPU Manager policy would return default topology hint as there is no exclusive CPU request and the Device Manager would send back hints for the requested device. - -In the above two cases of the `Guaranteed` pod, the `none` CPU Manager policy would return default topology hint. - -In the case of the `BestEffort` pod, the `static` CPU Manager policy would send back the default topology hint as there is no CPU request and the Device Manager would send back the hints for each of the requested devices. - -Using this information the Topology Manager calculates the optimal hint for the pod and stores this information, which will be used by the Hint Providers when they are making their resource assignments. - -### Known Limitations -1. The maximum number of NUMA nodes that Topology Manager allows is 8. With more than 8 NUMA nodes there will be a state explosion when trying to enumerate the possible NUMA affinities and generating their hints. - -2. The scheduler is not topology-aware, so it is possible to be scheduled on a node and then fail on the node due to the Topology Manager. +--- +title: Control Topology Management Policies on a node + +reviewers: +- ConnorDoyle +- klueska +- lmdaly +- nolancon +- bg-chun + +content_type: task +min-kubernetes-server-version: v1.18 +--- + + + +{{< feature-state state="beta" for_k8s_version="v1.18" >}} + +An increasing number of systems leverage a combination of CPUs and hardware accelerators to support latency-critical execution and high-throughput parallel computation. These include workloads in fields such as telecommunications, scientific computing, machine learning, financial services and data analytics. Such hybrid systems comprise a high performance environment. + +In order to extract the best performance, optimizations related to CPU isolation, memory and device locality are required. However, in Kubernetes, these optimizations are handled by a disjoint set of components. + +_Topology Manager_ is a Kubelet component that aims to coordinate the set of components that are responsible for these optimizations. + + + +## {{% heading "prerequisites" %}} + + +{{< include "task-tutorial-prereqs.md" >}} {{< version-check >}} + + + + + +## How Topology Manager Works + +Prior to the introduction of Topology Manager, the CPU and Device Manager in Kubernetes make resource allocation decisions independently of each other. +This can result in undesirable allocations on multiple-socketed systems, performance/latency sensitive applications will suffer due to these undesirable allocations. + Undesirable in this case meaning for example, CPUs and devices being allocated from different NUMA Nodes thus, incurring additional latency. + +The Topology Manager is a Kubelet component, which acts as a source of truth so that other Kubelet components can make topology aligned resource allocation choices. + +The Topology Manager provides an interface for components, called *Hint Providers*, to send and receive topology information. Topology Manager has a set of node level policies which are explained below. + +The Topology manager receives Topology information from the *Hint Providers* as a bitmask denoting NUMA Nodes available and a preferred allocation indication. The Topology Manager policies perform a set of operations on the hints provided and converge on the hint determined by the policy to give the optimal result, if an undesirable hint is stored the preferred field for the hint will be set to false. In the current policies preferred is the narrowest preferred mask. +The selected hint is stored as part of the Topology Manager. Depending on the policy configured the pod can be accepted or rejected from the node based on the selected hint. +The hint is then stored in the Topology Manager for use by the *Hint Providers* when making the resource allocation decisions. + +### Enable the Topology Manager feature + +Support for the Topology Manager requires `TopologyManager` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/) to be enabled. It is enabled by default starting with Kubernetes 1.18. + +## Topology Manager Scopes and Policies + +The Topology Manager currently: + - Aligns Pods of all QoS classes. + - Aligns the requested resources that Hint Provider provides topology hints for. + +If these conditions are met, the Topology Manager will align the requested resources. + +In order to customise how this alignment is carried out, the Topology Manager provides two distinct knobs: `scope` and `policy`. + +The `scope` defines the granularity at which you would like resource alignment to be performed (e.g. at the `pod` or `container` level). And the `policy` defines the actual strategy used to carry out the alignment (e.g. `best-effort`, `restricted`, `single-numa-node`, etc.). + +Details on the various `scopes` and `policies` available today can be found below. + +{{< note >}} +To align CPU resources with other requested resources in a Pod Spec, the CPU Manager should be enabled and proper CPU Manager policy should be configured on a Node. See [control CPU Management Policies](/docs/tasks/administer-cluster/cpu-management-policies/). +{{< /note >}} + +{{< note >}} +To align memory (and hugepages) resources with other requested resources in a Pod Spec, the Memory Manager should be enabled and proper Memory Manager policy should be configured on a Node. Examine [Memory Manager](/docs/tasks/administer-cluster/memory-manager/) documentation. +{{< /note >}} + +### Topology Manager Scopes + +The Topology Manager can deal with the alignment of resources in a couple of distinct scopes: + +* `container` (default) +* `pod` + +Either option can be selected at a time of the kubelet startup, with `--topology-manager-scope` flag. + +### container scope + +The `container` scope is used by default. + +Within this scope, the Topology Manager performs a number of sequential resource alignments, i.e., for each container (in a pod) a separate alignment is computed. In other words, there is no notion of grouping the containers to a specific set of NUMA nodes, for this particular scope. In effect, the Topology Manager performs an arbitrary alignment of individual containers to NUMA nodes. + +The notion of grouping the containers was endorsed and implemented on purpose in the following scope, for example the `pod` scope. + +### pod scope + +To select the `pod` scope, start the kubelet with the command line option `--topology-manager-scope=pod`. + +This scope allows for grouping all containers in a pod to a common set of NUMA nodes. That is, the Topology Manager treats a pod as a whole and attempts to allocate the entire pod (all containers) to either a single NUMA node or a common set of NUMA nodes. The following examples illustrate the alignments produced by the Topology Manager on different occasions: + +* all containers can be and are allocated to a single NUMA node; +* all containers can be and are allocated to a shared set of NUMA nodes. + +The total amount of particular resource demanded for the entire pod is calculated according to [effective requests/limits](/docs/concepts/workloads/pods/init-containers/#resources) formula, and thus, this total value is equal to the maximum of: +* the sum of all app container requests, +* the maximum of init container requests, +for a resource. + +Using the `pod` scope in tandem with `single-numa-node` Topology Manager policy is specifically valuable for workloads that are latency sensitive or for high-throughput applications that perform IPC. By combining both options, you are able to place all containers in a pod onto a single NUMA node; hence, the inter-NUMA communication overhead can be eliminated for that pod. + +In the case of `single-numa-node` policy, a pod is accepted only if a suitable set of NUMA nodes is present among possible allocations. Reconsider the example above: + +* a set containing only a single NUMA node - it leads to pod being admitted, +* whereas a set containing more NUMA nodes - it results in pod rejection (because instead of one NUMA node, two or more NUMA nodes are required to satisfy the allocation). + +To recap, Topology Manager first computes a set of NUMA nodes and then tests it against Topology Manager policy, which either leads to the rejection or admission of the pod. + +### Topology Manager Policies + +Topology Manager supports four allocation policies. You can set a policy via a Kubelet flag, `--topology-manager-policy`. +There are four supported policies: + +* `none` (default) +* `best-effort` +* `restricted` +* `single-numa-node` + +{{< note >}} +If Topology Manager is configured with the **pod** scope, the container, which is considered by the policy, is reflecting requirements of the entire pod, and thus each container from the pod will result with **the same** topology alignment decision. +{{< /note >}} + +### none policy {#policy-none} + +This is the default policy and does not perform any topology alignment. + +### best-effort policy {#policy-best-effort} + +For each container in a Pod, the kubelet, with `best-effort` topology +management policy, calls each Hint Provider to discover their resource availability. +Using this information, the Topology Manager stores the +preferred NUMA Node affinity for that container. If the affinity is not preferred, +Topology Manager will store this and admit the pod to the node anyway. + +The *Hint Providers* can then use this information when making the +resource allocation decision. + +### restricted policy {#policy-restricted} + +For each container in a Pod, the kubelet, with `restricted` topology +management policy, calls each Hint Provider to discover their resource availability. +Using this information, the Topology Manager stores the +preferred NUMA Node affinity for that container. If the affinity is not preferred, +Topology Manager will reject this pod from the node. This will result in a pod in a `Terminated` state with a pod admission failure. + +Once the pod is in a `Terminated` state, the Kubernetes scheduler will **not** attempt to reschedule the pod. It is recommended to use a ReplicaSet or Deployment to trigger a redeploy of the pod. +An external control loop could be also implemented to trigger a redeployment of pods that have the `Topology Affinity` error. + +If the pod is admitted, the *Hint Providers* can then use this information when making the +resource allocation decision. + +### single-numa-node policy {#policy-single-numa-node} + +For each container in a Pod, the kubelet, with `single-numa-node` topology +management policy, calls each Hint Provider to discover their resource availability. +Using this information, the Topology Manager determines if a single NUMA Node affinity is possible. +If it is, Topology Manager will store this and the *Hint Providers* can then use this information when making the +resource allocation decision. +If, however, this is not possible then the Topology Manager will reject the pod from the node. This will result in a pod in a `Terminated` state with a pod admission failure. + +Once the pod is in a `Terminated` state, the Kubernetes scheduler will **not** attempt to reschedule the pod. It is recommended to use a Deployment with replicas to trigger a redeploy of the Pod. +An external control loop could be also implemented to trigger a redeployment of pods that have the `Topology Affinity` error. + +### Pod Interactions with Topology Manager Policies + +Consider the containers in the following pod specs: + +```yaml +spec: + containers: + - name: nginx + image: nginx +``` + +This pod runs in the `BestEffort` QoS class because no resource `requests` or +`limits` are specified. + +```yaml +spec: + containers: + - name: nginx + image: nginx + resources: + limits: + memory: "200Mi" + requests: + memory: "100Mi" +``` + +This pod runs in the `Burstable` QoS class because requests are less than limits. + +If the selected policy is anything other than `none`, Topology Manager would consider these Pod specifications. The Topology Manager would consult the Hint Providers to get topology hints. In the case of the `static`, the CPU Manager policy would return default topology hint, because these Pods do not have explicitly request CPU resources. + + +```yaml +spec: + containers: + - name: nginx + image: nginx + resources: + limits: + memory: "200Mi" + cpu: "2" + example.com/device: "1" + requests: + memory: "200Mi" + cpu: "2" + example.com/device: "1" +``` + +This pod with integer CPU request runs in the `Guaranteed` QoS class because `requests` are equal to `limits`. + + +```yaml +spec: + containers: + - name: nginx + image: nginx + resources: + limits: + memory: "200Mi" + cpu: "300m" + example.com/device: "1" + requests: + memory: "200Mi" + cpu: "300m" + example.com/device: "1" +``` + +This pod with sharing CPU request runs in the `Guaranteed` QoS class because `requests` are equal to `limits`. + + +```yaml +spec: + containers: + - name: nginx + image: nginx + resources: + limits: + example.com/deviceA: "1" + example.com/deviceB: "1" + requests: + example.com/deviceA: "1" + example.com/deviceB: "1" +``` +This pod runs in the `BestEffort` QoS class because there are no CPU and memory requests. + +The Topology Manager would consider the above pods. The Topology Manager would consult the Hint Providers, which are CPU and Device Manager to get topology hints for the pods. + +In the case of the `Guaranteed` pod with integer CPU request, the `static` CPU Manager policy would return topology hints relating to the exclusive CPU and the Device Manager would send back hints for the requested device. + +In the case of the `Guaranteed` pod with sharing CPU request, the `static` CPU Manager policy would return default topology hint as there is no exclusive CPU request and the Device Manager would send back hints for the requested device. + +In the above two cases of the `Guaranteed` pod, the `none` CPU Manager policy would return default topology hint. + +In the case of the `BestEffort` pod, the `static` CPU Manager policy would send back the default topology hint as there is no CPU request and the Device Manager would send back the hints for each of the requested devices. + +Using this information the Topology Manager calculates the optimal hint for the pod and stores this information, which will be used by the Hint Providers when they are making their resource assignments. + +### Known Limitations +1. The maximum number of NUMA nodes that Topology Manager allows is 8. With more than 8 NUMA nodes there will be a state explosion when trying to enumerate the possible NUMA affinities and generating their hints. + +2. The scheduler is not topology-aware, so it is possible to be scheduled on a node and then fail on the node due to the Topology Manager.