commit
56e78c2011
File diff suppressed because it is too large
Load Diff
|
@ -1,269 +1,269 @@
|
|||
---
|
||||
layout: blog
|
||||
title: 'Using Finalizers to Control Deletion'
|
||||
date: 2021-05-14
|
||||
slug: using-finalizers-to-control-deletion
|
||||
---
|
||||
|
||||
**Authors:** Aaron Alpar (Kasten)
|
||||
|
||||
Deleting objects in Kubernetes can be challenging. You may think you’ve deleted something, only to find it still persists. While issuing a `kubectl delete` command and hoping for the best might work for day-to-day operations, understanding how Kubernetes `delete` commands operate will help you understand why some objects linger after deletion.
|
||||
|
||||
In this post, I’ll look at:
|
||||
|
||||
- What properties of a resource govern deletion
|
||||
- How finalizers and owner references impact object deletion
|
||||
- How the propagation policy can be used to change the order of deletions
|
||||
- How deletion works, with examples
|
||||
|
||||
For simplicity, all examples will use ConfigMaps and basic shell commands to demonstrate the process. We’ll explore how the commands work and discuss repercussions and results from using them in practice.
|
||||
|
||||
## The basic `delete`
|
||||
|
||||
Kubernetes has several different commands you can use that allow you to create, read, update, and delete objects. For the purpose of this blog post, we’ll focus on four `kubectl` commands: `create`, `get`, `patch`, and `delete`.
|
||||
|
||||
Here are examples of the basic `kubectl delete` command:
|
||||
|
||||
```
|
||||
kubectl create configmap mymap
|
||||
configmap/mymap created
|
||||
```
|
||||
|
||||
```
|
||||
kubectl get configmap/mymap
|
||||
NAME DATA AGE
|
||||
mymap 0 12s
|
||||
```
|
||||
|
||||
```
|
||||
kubectl delete configmap/mymap
|
||||
configmap "mymap" deleted
|
||||
```
|
||||
|
||||
```
|
||||
kubectl get configmap/mymap
|
||||
Error from server (NotFound): configmaps "mymap" not found
|
||||
```
|
||||
|
||||
Shell commands preceded by `$` are followed by their output. You can see that we begin with a `kubectl create configmap mymap`, which will create the empty configmap `mymap`. Next, we need to `get` the configmap to prove it exists. We can then delete that configmap. Attempting to `get` it again produces an HTTP 404 error, which means the configmap is not found.
|
||||
|
||||
The state diagram for the basic `delete` command is very simple:
|
||||
|
||||
|
||||
{{<figure width="495" src="/images/blog/2021-05-14-using-finalizers-to-control-deletion/state-diagram-delete.png" caption="State diagram for delete">}}
|
||||
|
||||
Although this operation is straightforward, other factors may interfere with the deletion, including finalizers and owner references.
|
||||
|
||||
## Understanding Finalizers
|
||||
|
||||
When it comes to understanding resource deletion in Kubernetes, knowledge of how finalizers work is helpful and can help you understand why some objects don’t get deleted.
|
||||
|
||||
Finalizers are keys on resources that signal pre-delete operations. They control the garbage collection on resources, and are designed to alert controllers what cleanup operations to perform prior to removing a resource. However, they don’t necessarily name code that should be executed; finalizers on resources are basically just lists of keys much like annotations. Like annotations, they can be manipulated.
|
||||
|
||||
Some common finalizers you’ve likely encountered are:
|
||||
|
||||
- `kubernetes.io/pv-protection`
|
||||
- `kubernetes.io/pvc-protection`
|
||||
|
||||
The finalizers above are used on volumes to prevent accidental deletion. Similarly, some finalizers can be used to prevent deletion of any resource but are not managed by any controller.
|
||||
|
||||
Below with a custom configmap, which has no properties but contains a finalizer:
|
||||
|
||||
```
|
||||
cat <<EOF | kubectl create -f -
|
||||
apiVersion: v1
|
||||
kind: ConfigMap
|
||||
metadata:
|
||||
name: mymap
|
||||
finalizers:
|
||||
- kubernetes
|
||||
EOF
|
||||
```
|
||||
|
||||
The configmap resource controller doesn't understand what to do with the `kubernetes` finalizer key. I term these “dead” finalizers for configmaps as it is normally used on namespaces. Here’s what happen upon attempting to delete the configmap:
|
||||
|
||||
```
|
||||
kubectl delete configmap/mymap &
|
||||
configmap "mymap" deleted
|
||||
jobs
|
||||
[1]+ Running kubectl delete configmap/mymap
|
||||
```
|
||||
|
||||
Kubernetes will report back that the object has been deleted, however, it hasn’t been deleted in a traditional sense. Rather, it’s in the process of deletion. When we attempt to `get` that object again, we discover the object has been modified to include the deletion timestamp.
|
||||
|
||||
```
|
||||
kubectl get configmap/mymap -o yaml
|
||||
apiVersion: v1
|
||||
kind: ConfigMap
|
||||
metadata:
|
||||
creationTimestamp: "2020-10-22T21:30:18Z"
|
||||
deletionGracePeriodSeconds: 0
|
||||
deletionTimestamp: "2020-10-22T21:30:34Z"
|
||||
finalizers:
|
||||
- kubernetes
|
||||
name: mymap
|
||||
namespace: default
|
||||
resourceVersion: "311456"
|
||||
selfLink: /api/v1/namespaces/default/configmaps/mymap
|
||||
uid: 93a37fed-23e3-45e8-b6ee-b2521db81638
|
||||
```
|
||||
|
||||
In short, what’s happened is that the object was updated, not deleted. That’s because Kubernetes saw that the object contained finalizers and blocked removal of the object from etcd. The deletion timestamp signals that deletion was requested, but the deletion will not be complete until we edit the object and remove the finalizer.
|
||||
|
||||
Here's a demonstration of using the `patch` command to remove finalizers. If we want to delete an object, we can simply patch it on the command line to remove the finalizers. In this way, the deletion that was running in the background will complete and the object will be deleted. When we attempt to `get` that configmap, it will be gone.
|
||||
|
||||
```
|
||||
kubectl patch configmap/mymap \
|
||||
--type json \
|
||||
--patch='[ { "op": "remove", "path": "/metadata/finalizers" } ]'
|
||||
configmap/mymap patched
|
||||
[1]+ Done kubectl delete configmap/mymap
|
||||
|
||||
kubectl get configmap/mymap -o yaml
|
||||
Error from server (NotFound): configmaps "mymap" not found
|
||||
```
|
||||
|
||||
Here's a state diagram for finalization:
|
||||
|
||||
{{<figure width="617" src="/images/blog/2021-05-14-using-finalizers-to-control-deletion/state-diagram-finalize.png" caption="State diagram for finalize">}}
|
||||
|
||||
So, if you attempt to delete an object that has a finalizer on it, it will remain in finalization until the controller has removed the finalizer keys or the finalizers are removed using Kubectl. Once that finalizer list is empty, the object can actually be reclaimed by Kubernetes and put into a queue to be deleted from the registry.
|
||||
|
||||
## Owner References
|
||||
|
||||
Owner references describe how groups of objects are related. They are properties on resources that specify the relationship to one another, so entire trees of resources can be deleted.
|
||||
|
||||
Finalizer rules are processed when there are owner references. An owner reference consists of a name and a UID. Owner references link resources within the same namespace, and it also needs a UID for that reference to work. Pods typically have owner references to the owning replica set. So, when deployments or stateful sets are deleted, then the child replica sets and pods are deleted in the process.
|
||||
|
||||
Here are some examples of owner references and how they work. In the first example, we create a parent object first, then the child. The result is a very simple configmap that contains an owner reference to its parent:
|
||||
|
||||
```
|
||||
cat <<EOF | kubectl create -f -
|
||||
apiVersion: v1
|
||||
kind: ConfigMap
|
||||
metadata:
|
||||
name: mymap-parent
|
||||
EOF
|
||||
CM_UID=$(kubectl get configmap mymap-parent -o jsonpath="{.metadata.uid}")
|
||||
|
||||
cat <<EOF | kubectl create -f -
|
||||
apiVersion: v1
|
||||
kind: ConfigMap
|
||||
metadata:
|
||||
name: mymap-child
|
||||
ownerReferences:
|
||||
- apiVersion: v1
|
||||
kind: ConfigMap
|
||||
name: mymap-parent
|
||||
uid: $CM_UID
|
||||
EOF
|
||||
```
|
||||
|
||||
Deleting the child object when an owner reference is involved does not delete the parent:
|
||||
|
||||
```
|
||||
kubectl get configmap
|
||||
NAME DATA AGE
|
||||
mymap-child 0 12m4s
|
||||
mymap-parent 0 12m4s
|
||||
|
||||
kubectl delete configmap/mymap-child
|
||||
configmap "mymap-child" deleted
|
||||
|
||||
kubectl get configmap
|
||||
NAME DATA AGE
|
||||
mymap-parent 0 12m10s
|
||||
```
|
||||
|
||||
In this example, we re-created the parent-child configmaps from above. Now, when deleting from the parent (instead of the child) with an owner reference from the child to the parent, when we `get` the configmaps, none are in the namespace:
|
||||
|
||||
```
|
||||
kubectl get configmap
|
||||
NAME DATA AGE
|
||||
mymap-child 0 10m2s
|
||||
mymap-parent 0 10m2s
|
||||
|
||||
kubectl delete configmap/mymap-parent
|
||||
configmap "mymap-parent" deleted
|
||||
|
||||
kubectl get configmap
|
||||
No resources found in default namespace.
|
||||
```
|
||||
|
||||
To sum things up, when there's an override owner reference from a child to a parent, deleting the parent deletes the children automatically. This is called `cascade`. The default for cascade is `true`, however, you can use the --cascade=orphan option for `kubectl delete` to delete an object and orphan its children. *Update: starting with kubectl v1.20, the default for cascade is `background`.*
|
||||
|
||||
|
||||
In the following example, there is a parent and a child. Notice the owner references are still included. If I delete the parent using --cascade=orphan, the parent is deleted but the child still exists:
|
||||
|
||||
```
|
||||
kubectl get configmap
|
||||
NAME DATA AGE
|
||||
mymap-child 0 13m8s
|
||||
mymap-parent 0 13m8s
|
||||
|
||||
kubectl delete --cascade=orphan configmap/mymap-parent
|
||||
configmap "mymap-parent" deleted
|
||||
|
||||
kubectl get configmap
|
||||
NAME DATA AGE
|
||||
mymap-child 0 13m21s
|
||||
```
|
||||
|
||||
The --cascade option links to the propagation policy in the API, which allows you to change the order in which objects are deleted within a tree. In the following example uses API access to craft a custom delete API call with the background propagation policy:
|
||||
|
||||
```
|
||||
kubectl proxy --port=8080 &
|
||||
Starting to serve on 127.0.0.1:8080
|
||||
|
||||
curl -X DELETE \
|
||||
localhost:8080/api/v1/namespaces/default/configmaps/mymap-parent \
|
||||
-d '{ "kind":"DeleteOptions", "apiVersion":"v1", "propagationPolicy":"Background" }' \
|
||||
-H "Content-Type: application/json"
|
||||
{
|
||||
"kind": "Status",
|
||||
"apiVersion": "v1",
|
||||
"metadata": {},
|
||||
"status": "Success",
|
||||
"details": { ... }
|
||||
}
|
||||
```
|
||||
|
||||
Note that the propagation policy cannot be specified on the command line using kubectl. You have to specify it using a custom API call. Simply create a proxy, so you have access to the API server from the client, and execute a `curl` command with just a URL to execute that `delete` command.
|
||||
|
||||
There are three different options for the propagation policy:
|
||||
|
||||
- `Foreground`: Children are deleted before the parent (post-order)
|
||||
- `Background`: Parent is deleted before the children (pre-order)
|
||||
- `Orphan`: Owner references are ignored
|
||||
|
||||
Keep in mind that when you delete an object and owner references have been specified, finalizers will be honored in the process. This can result in trees of objects persisting, and you end up with a partial deletion. At that point, you have to look at any existing owner references on your objects, as well as any finalizers, to understand what’s happening.
|
||||
|
||||
## Forcing a Deletion of a Namespace
|
||||
|
||||
There's one situation that may require forcing finalization for a namespace. If you've deleted a namespace and you've cleaned out all of the objects under it, but the namespace still exists, deletion can be forced by updating the namespace subresource, `finalize`. This informs the namespace controller that it needs to remove the finalizer from the namespace and perform any cleanup:
|
||||
|
||||
```
|
||||
cat <<EOF | curl -X PUT \
|
||||
localhost:8080/api/v1/namespaces/test/finalize \
|
||||
-H "Content-Type: application/json" \
|
||||
--data-binary @-
|
||||
{
|
||||
"kind": "Namespace",
|
||||
"apiVersion": "v1",
|
||||
"metadata": {
|
||||
"name": "test"
|
||||
},
|
||||
"spec": {
|
||||
"finalizers": null
|
||||
}
|
||||
}
|
||||
EOF
|
||||
```
|
||||
|
||||
This should be done with caution as it may delete the namespace only and leave orphan objects within the, now non-exiting, namespace - a confusing state for Kubernetes. If this happens, the namespace can be re-created manually and sometimes the orphaned objects will re-appear under the just-created namespace which will allow manual cleanup and recovery.
|
||||
|
||||
## Key Takeaways
|
||||
|
||||
As these examples demonstrate, finalizers can get in the way of deleting resources in Kubernetes, especially when there are parent-child relationships between objects. Often, there is a reason for adding a finalizer into the code, so you should always investigate before manually deleting it. Owner references allow you to specify and remove trees of resources, although finalizers will be honored in the process. Finally, the propagation policy can be used to specify the order of deletion via a custom API call, giving you control over how objects are deleted. Now that you know a little more about how deletions work in Kubernetes, we recommend you try it out on your own, using a test cluster.
|
||||
|
||||
{{< youtube class="youtube-quote-sm" id="F7-ZxWwf4sY" title="Clean Up Your Room! What Does It Mean to Delete Something in K8s">}}
|
||||
---
|
||||
layout: blog
|
||||
title: 'Using Finalizers to Control Deletion'
|
||||
date: 2021-05-14
|
||||
slug: using-finalizers-to-control-deletion
|
||||
---
|
||||
|
||||
**Authors:** Aaron Alpar (Kasten)
|
||||
|
||||
Deleting objects in Kubernetes can be challenging. You may think you’ve deleted something, only to find it still persists. While issuing a `kubectl delete` command and hoping for the best might work for day-to-day operations, understanding how Kubernetes `delete` commands operate will help you understand why some objects linger after deletion.
|
||||
|
||||
In this post, I’ll look at:
|
||||
|
||||
- What properties of a resource govern deletion
|
||||
- How finalizers and owner references impact object deletion
|
||||
- How the propagation policy can be used to change the order of deletions
|
||||
- How deletion works, with examples
|
||||
|
||||
For simplicity, all examples will use ConfigMaps and basic shell commands to demonstrate the process. We’ll explore how the commands work and discuss repercussions and results from using them in practice.
|
||||
|
||||
## The basic `delete`
|
||||
|
||||
Kubernetes has several different commands you can use that allow you to create, read, update, and delete objects. For the purpose of this blog post, we’ll focus on four `kubectl` commands: `create`, `get`, `patch`, and `delete`.
|
||||
|
||||
Here are examples of the basic `kubectl delete` command:
|
||||
|
||||
```
|
||||
kubectl create configmap mymap
|
||||
configmap/mymap created
|
||||
```
|
||||
|
||||
```
|
||||
kubectl get configmap/mymap
|
||||
NAME DATA AGE
|
||||
mymap 0 12s
|
||||
```
|
||||
|
||||
```
|
||||
kubectl delete configmap/mymap
|
||||
configmap "mymap" deleted
|
||||
```
|
||||
|
||||
```
|
||||
kubectl get configmap/mymap
|
||||
Error from server (NotFound): configmaps "mymap" not found
|
||||
```
|
||||
|
||||
Shell commands preceded by `$` are followed by their output. You can see that we begin with a `kubectl create configmap mymap`, which will create the empty configmap `mymap`. Next, we need to `get` the configmap to prove it exists. We can then delete that configmap. Attempting to `get` it again produces an HTTP 404 error, which means the configmap is not found.
|
||||
|
||||
The state diagram for the basic `delete` command is very simple:
|
||||
|
||||
|
||||
{{<figure width="495" src="/images/blog/2021-05-14-using-finalizers-to-control-deletion/state-diagram-delete.png" caption="State diagram for delete">}}
|
||||
|
||||
Although this operation is straightforward, other factors may interfere with the deletion, including finalizers and owner references.
|
||||
|
||||
## Understanding Finalizers
|
||||
|
||||
When it comes to understanding resource deletion in Kubernetes, knowledge of how finalizers work is helpful and can help you understand why some objects don’t get deleted.
|
||||
|
||||
Finalizers are keys on resources that signal pre-delete operations. They control the garbage collection on resources, and are designed to alert controllers what cleanup operations to perform prior to removing a resource. However, they don’t necessarily name code that should be executed; finalizers on resources are basically just lists of keys much like annotations. Like annotations, they can be manipulated.
|
||||
|
||||
Some common finalizers you’ve likely encountered are:
|
||||
|
||||
- `kubernetes.io/pv-protection`
|
||||
- `kubernetes.io/pvc-protection`
|
||||
|
||||
The finalizers above are used on volumes to prevent accidental deletion. Similarly, some finalizers can be used to prevent deletion of any resource but are not managed by any controller.
|
||||
|
||||
Below with a custom configmap, which has no properties but contains a finalizer:
|
||||
|
||||
```
|
||||
cat <<EOF | kubectl create -f -
|
||||
apiVersion: v1
|
||||
kind: ConfigMap
|
||||
metadata:
|
||||
name: mymap
|
||||
finalizers:
|
||||
- kubernetes
|
||||
EOF
|
||||
```
|
||||
|
||||
The configmap resource controller doesn't understand what to do with the `kubernetes` finalizer key. I term these “dead” finalizers for configmaps as it is normally used on namespaces. Here’s what happen upon attempting to delete the configmap:
|
||||
|
||||
```
|
||||
kubectl delete configmap/mymap &
|
||||
configmap "mymap" deleted
|
||||
jobs
|
||||
[1]+ Running kubectl delete configmap/mymap
|
||||
```
|
||||
|
||||
Kubernetes will report back that the object has been deleted, however, it hasn’t been deleted in a traditional sense. Rather, it’s in the process of deletion. When we attempt to `get` that object again, we discover the object has been modified to include the deletion timestamp.
|
||||
|
||||
```
|
||||
kubectl get configmap/mymap -o yaml
|
||||
apiVersion: v1
|
||||
kind: ConfigMap
|
||||
metadata:
|
||||
creationTimestamp: "2020-10-22T21:30:18Z"
|
||||
deletionGracePeriodSeconds: 0
|
||||
deletionTimestamp: "2020-10-22T21:30:34Z"
|
||||
finalizers:
|
||||
- kubernetes
|
||||
name: mymap
|
||||
namespace: default
|
||||
resourceVersion: "311456"
|
||||
selfLink: /api/v1/namespaces/default/configmaps/mymap
|
||||
uid: 93a37fed-23e3-45e8-b6ee-b2521db81638
|
||||
```
|
||||
|
||||
In short, what’s happened is that the object was updated, not deleted. That’s because Kubernetes saw that the object contained finalizers and blocked removal of the object from etcd. The deletion timestamp signals that deletion was requested, but the deletion will not be complete until we edit the object and remove the finalizer.
|
||||
|
||||
Here's a demonstration of using the `patch` command to remove finalizers. If we want to delete an object, we can simply patch it on the command line to remove the finalizers. In this way, the deletion that was running in the background will complete and the object will be deleted. When we attempt to `get` that configmap, it will be gone.
|
||||
|
||||
```
|
||||
kubectl patch configmap/mymap \
|
||||
--type json \
|
||||
--patch='[ { "op": "remove", "path": "/metadata/finalizers" } ]'
|
||||
configmap/mymap patched
|
||||
[1]+ Done kubectl delete configmap/mymap
|
||||
|
||||
kubectl get configmap/mymap -o yaml
|
||||
Error from server (NotFound): configmaps "mymap" not found
|
||||
```
|
||||
|
||||
Here's a state diagram for finalization:
|
||||
|
||||
{{<figure width="617" src="/images/blog/2021-05-14-using-finalizers-to-control-deletion/state-diagram-finalize.png" caption="State diagram for finalize">}}
|
||||
|
||||
So, if you attempt to delete an object that has a finalizer on it, it will remain in finalization until the controller has removed the finalizer keys or the finalizers are removed using Kubectl. Once that finalizer list is empty, the object can actually be reclaimed by Kubernetes and put into a queue to be deleted from the registry.
|
||||
|
||||
## Owner References
|
||||
|
||||
Owner references describe how groups of objects are related. They are properties on resources that specify the relationship to one another, so entire trees of resources can be deleted.
|
||||
|
||||
Finalizer rules are processed when there are owner references. An owner reference consists of a name and a UID. Owner references link resources within the same namespace, and it also needs a UID for that reference to work. Pods typically have owner references to the owning replica set. So, when deployments or stateful sets are deleted, then the child replica sets and pods are deleted in the process.
|
||||
|
||||
Here are some examples of owner references and how they work. In the first example, we create a parent object first, then the child. The result is a very simple configmap that contains an owner reference to its parent:
|
||||
|
||||
```
|
||||
cat <<EOF | kubectl create -f -
|
||||
apiVersion: v1
|
||||
kind: ConfigMap
|
||||
metadata:
|
||||
name: mymap-parent
|
||||
EOF
|
||||
CM_UID=$(kubectl get configmap mymap-parent -o jsonpath="{.metadata.uid}")
|
||||
|
||||
cat <<EOF | kubectl create -f -
|
||||
apiVersion: v1
|
||||
kind: ConfigMap
|
||||
metadata:
|
||||
name: mymap-child
|
||||
ownerReferences:
|
||||
- apiVersion: v1
|
||||
kind: ConfigMap
|
||||
name: mymap-parent
|
||||
uid: $CM_UID
|
||||
EOF
|
||||
```
|
||||
|
||||
Deleting the child object when an owner reference is involved does not delete the parent:
|
||||
|
||||
```
|
||||
kubectl get configmap
|
||||
NAME DATA AGE
|
||||
mymap-child 0 12m4s
|
||||
mymap-parent 0 12m4s
|
||||
|
||||
kubectl delete configmap/mymap-child
|
||||
configmap "mymap-child" deleted
|
||||
|
||||
kubectl get configmap
|
||||
NAME DATA AGE
|
||||
mymap-parent 0 12m10s
|
||||
```
|
||||
|
||||
In this example, we re-created the parent-child configmaps from above. Now, when deleting from the parent (instead of the child) with an owner reference from the child to the parent, when we `get` the configmaps, none are in the namespace:
|
||||
|
||||
```
|
||||
kubectl get configmap
|
||||
NAME DATA AGE
|
||||
mymap-child 0 10m2s
|
||||
mymap-parent 0 10m2s
|
||||
|
||||
kubectl delete configmap/mymap-parent
|
||||
configmap "mymap-parent" deleted
|
||||
|
||||
kubectl get configmap
|
||||
No resources found in default namespace.
|
||||
```
|
||||
|
||||
To sum things up, when there's an override owner reference from a child to a parent, deleting the parent deletes the children automatically. This is called `cascade`. The default for cascade is `true`, however, you can use the --cascade=orphan option for `kubectl delete` to delete an object and orphan its children. *Update: starting with kubectl v1.20, the default for cascade is `background`.*
|
||||
|
||||
|
||||
In the following example, there is a parent and a child. Notice the owner references are still included. If I delete the parent using --cascade=orphan, the parent is deleted but the child still exists:
|
||||
|
||||
```
|
||||
kubectl get configmap
|
||||
NAME DATA AGE
|
||||
mymap-child 0 13m8s
|
||||
mymap-parent 0 13m8s
|
||||
|
||||
kubectl delete --cascade=orphan configmap/mymap-parent
|
||||
configmap "mymap-parent" deleted
|
||||
|
||||
kubectl get configmap
|
||||
NAME DATA AGE
|
||||
mymap-child 0 13m21s
|
||||
```
|
||||
|
||||
The --cascade option links to the propagation policy in the API, which allows you to change the order in which objects are deleted within a tree. In the following example uses API access to craft a custom delete API call with the background propagation policy:
|
||||
|
||||
```
|
||||
kubectl proxy --port=8080 &
|
||||
Starting to serve on 127.0.0.1:8080
|
||||
|
||||
curl -X DELETE \
|
||||
localhost:8080/api/v1/namespaces/default/configmaps/mymap-parent \
|
||||
-d '{ "kind":"DeleteOptions", "apiVersion":"v1", "propagationPolicy":"Background" }' \
|
||||
-H "Content-Type: application/json"
|
||||
{
|
||||
"kind": "Status",
|
||||
"apiVersion": "v1",
|
||||
"metadata": {},
|
||||
"status": "Success",
|
||||
"details": { ... }
|
||||
}
|
||||
```
|
||||
|
||||
Note that the propagation policy cannot be specified on the command line using kubectl. You have to specify it using a custom API call. Simply create a proxy, so you have access to the API server from the client, and execute a `curl` command with just a URL to execute that `delete` command.
|
||||
|
||||
There are three different options for the propagation policy:
|
||||
|
||||
- `Foreground`: Children are deleted before the parent (post-order)
|
||||
- `Background`: Parent is deleted before the children (pre-order)
|
||||
- `Orphan`: Owner references are ignored
|
||||
|
||||
Keep in mind that when you delete an object and owner references have been specified, finalizers will be honored in the process. This can result in trees of objects persisting, and you end up with a partial deletion. At that point, you have to look at any existing owner references on your objects, as well as any finalizers, to understand what’s happening.
|
||||
|
||||
## Forcing a Deletion of a Namespace
|
||||
|
||||
There's one situation that may require forcing finalization for a namespace. If you've deleted a namespace and you've cleaned out all of the objects under it, but the namespace still exists, deletion can be forced by updating the namespace subresource, `finalize`. This informs the namespace controller that it needs to remove the finalizer from the namespace and perform any cleanup:
|
||||
|
||||
```
|
||||
cat <<EOF | curl -X PUT \
|
||||
localhost:8080/api/v1/namespaces/test/finalize \
|
||||
-H "Content-Type: application/json" \
|
||||
--data-binary @-
|
||||
{
|
||||
"kind": "Namespace",
|
||||
"apiVersion": "v1",
|
||||
"metadata": {
|
||||
"name": "test"
|
||||
},
|
||||
"spec": {
|
||||
"finalizers": null
|
||||
}
|
||||
}
|
||||
EOF
|
||||
```
|
||||
|
||||
This should be done with caution as it may delete the namespace only and leave orphan objects within the, now non-exiting, namespace - a confusing state for Kubernetes. If this happens, the namespace can be re-created manually and sometimes the orphaned objects will re-appear under the just-created namespace which will allow manual cleanup and recovery.
|
||||
|
||||
## Key Takeaways
|
||||
|
||||
As these examples demonstrate, finalizers can get in the way of deleting resources in Kubernetes, especially when there are parent-child relationships between objects. Often, there is a reason for adding a finalizer into the code, so you should always investigate before manually deleting it. Owner references allow you to specify and remove trees of resources, although finalizers will be honored in the process. Finally, the propagation policy can be used to specify the order of deletion via a custom API call, giving you control over how objects are deleted. Now that you know a little more about how deletions work in Kubernetes, we recommend you try it out on your own, using a test cluster.
|
||||
|
||||
{{< youtube class="youtube-quote-sm" id="F7-ZxWwf4sY" title="Clean Up Your Room! What Does It Mean to Delete Something in K8s">}}
|
||||
|
|
|
@ -1,271 +1,271 @@
|
|||
---
|
||||
title: Multi-tenancy
|
||||
content_type: concept
|
||||
weight: 70
|
||||
---
|
||||
|
||||
<!-- overview -->
|
||||
|
||||
This page provides an overview of available configuration options and best practices for cluster multi-tenancy.
|
||||
|
||||
Sharing clusters saves costs and simplifies administration. However, sharing clusters also presents challenges such as security, fairness, and managing _noisy neighbors_.
|
||||
|
||||
Clusters can be shared in many ways. In some cases, different applications may run in the same cluster. In other cases, multiple instances of the same application may run in the same cluster, one for each end user. All these types of sharing are frequently described using the umbrella term _multi-tenancy_.
|
||||
|
||||
While Kubernetes does not have first-class concepts of end users or tenants, it provides several features to help manage different tenancy requirements. These are discussed below.
|
||||
|
||||
<!-- body -->
|
||||
## Use cases
|
||||
|
||||
The first step to determining how to share your cluster is understanding your use case, so you can evaluate the patterns and tools available. In general, multi-tenancy in Kubernetes clusters falls into two broad categories, though many variations and hybrids are also possible.
|
||||
|
||||
### Multiple teams
|
||||
|
||||
A common form of multi-tenancy is to share a cluster between multiple teams within an organization, each of whom may operate one or more workloads. These workloads frequently need to communicate with each other, and with other workloads located on the same or different clusters.
|
||||
|
||||
In this scenario, members of the teams often have direct access to Kubernetes resources via tools such as `kubectl`, or indirect access through GitOps controllers or other types of release automation tools. There is often some level of trust between members of different teams, but Kubernetes policies such as RBAC, quotas, and network policies are essential to safely and fairly share clusters.
|
||||
|
||||
### Multiple customers
|
||||
|
||||
The other major form of multi-tenancy frequently involves a Software-as-a-Service (SaaS) vendor running multiple instances of a workload for customers. This business model is so strongly associated with this deployment style that many people call it "SaaS tenancy." However, a better term might be "multi-customer tenancy,” since SaaS vendors may also use other deployment models, and this deployment model can also be used outside of SaaS.
|
||||
|
||||
|
||||
In this scenario, the customers do not have access to the cluster; Kubernetes is invisible from their perspective and is only used by the vendor to manage the workloads. Cost optimization is frequently a critical concern, and Kubernetes policies are used to ensure that the workloads are strongly isolated from each other.
|
||||
|
||||
|
||||
## Terminology
|
||||
|
||||
### Tenants
|
||||
|
||||
When discussing multi-tenancy in Kubernetes, there is no single definition for a "tenant". Rather, the definition of a tenant will vary depending on whether multi-team or multi-customer tenancy is being discussed.
|
||||
|
||||
In multi-team usage, a tenant is typically a team, where each team typically deploys a small number of workloads that scales with the complexity of the service. However, the definition of "team" may itself be fuzzy, as teams may be organized into higher-level divisions or subdivided into smaller teams.
|
||||
|
||||
|
||||
By contrast, if each team deploys dedicated workloads for each new client, they are using a multi-customer model of tenancy. In this case, a "tenant" is simply a group of users who share a single workload. This may be as large as an entire company, or as small as a single team at that company.
|
||||
|
||||
In many cases, the same organization may use both definitions of "tenants" in different contexts. For example, a platform team may offer shared services such as security tools and databases to multiple internal “customers” and a SaaS vendor may also have multiple teams sharing a development cluster. Finally, hybrid architectures are also possible, such as a SaaS provider using a combination of per-customer workloads for sensitive data, combined with multi-tenant shared services.
|
||||
|
||||
|
||||
{{< figure src="/images/docs/multi-tenancy.png" title="A cluster showing coexisting tenancy models" class="diagram-large" >}}
|
||||
|
||||
|
||||
### Isolation
|
||||
|
||||
There are several ways to design and build multi-tenant solutions with Kubernetes. Each of these methods comes with its own set of tradeoffs that impact the isolation level, implementation effort, operational complexity, and cost of service.
|
||||
|
||||
|
||||
A Kubernetes cluster consists of a control plane which runs Kubernetes software, and a data plane consisting of worker nodes where tenant workloads are executed as pods. Tenant isolation can be applied in both the control plane and the data plane based on organizational requirements.
|
||||
|
||||
The level of isolation offered is sometimes described using terms like “hard” multi-tenancy, which implies strong isolation, and “soft” multi-tenancy, which implies weaker isolation. In particular, "hard" multi-tenancy is often used to describe cases where the tenants do not trust each other, often from security and resource sharing perspectives (e.g. guarding against attacks such as data exfiltration or DoS). Since data planes typically have much larger attack surfaces, "hard" multi-tenancy often requires extra attention to isolating the data-plane, though control plane isolation also remains critical.
|
||||
|
||||
However, the terms "hard" and "soft" can often be confusing, as there is no single definition that will apply to all users. Rather, "hardness" or "softness" is better understood as a broad spectrum, with many different techniques that can be used to maintain different types of isolation in your clusters, based on your requirements.
|
||||
|
||||
|
||||
In more extreme cases, it may be easier or necessary to forgo any cluster-level sharing at all and assign each tenant their dedicated cluster, possibly even running on dedicated hardware if VMs are not considered an adequate security boundary. This may be easier with managed Kubernetes clusters, where the overhead of creating and operating clusters is at least somewhat taken on by a cloud provider. The benefit of stronger tenant isolation must be evaluated against the cost and complexity of managing multiple clusters. The [Multi-cluster SIG](https://git.k8s.io/community/sig-multicluster/README.md) is responsible for addressing these types of use cases.
|
||||
|
||||
|
||||
|
||||
The remainder of this page focuses on isolation techniques used for shared Kubernetes clusters. However, even if you are considering dedicated clusters, it may be valuable to review these recommendations, as it will give you the flexibility to shift to shared clusters in the future if your needs or capabilities change.
|
||||
|
||||
|
||||
## Control plane isolation
|
||||
|
||||
Control plane isolation ensures that different tenants cannot access or affect each others' Kubernetes API resources.
|
||||
|
||||
### Namespaces
|
||||
|
||||
In Kubernetes, a {{< glossary_tooltip text="Namespace" term_id="namespace" >}} provides a mechanism for isolating groups of API resources within a single cluster. This isolation has two key dimensions:
|
||||
|
||||
1. Object names within a namespace can overlap with names in other namespaces, similar to files in folders. This allows tenants to name their resources without having to consider what other tenants are doing.
|
||||
|
||||
2. Many Kubernetes security policies are scoped to namespaces. For example, RBAC Roles and Network Policies are namespace-scoped resources. Using RBAC, Users and Service Accounts can be restricted to a namespace.
|
||||
|
||||
In a multi-tenant environment, a Namespace helps segment a tenant's workload into a logical and distinct management unit. In fact, a common practice is to isolate every workload in its own namespace, even if multiple workloads are operated by the same tenant. This ensures that each workload has its own identity and can be configured with an appropriate security policy.
|
||||
|
||||
The namespace isolation model requires configuration of several other Kubernetes resources, networking plugins, and adherence to security best practices to properly isolate tenant workloads. These considerations are discussed below.
|
||||
|
||||
### Access controls
|
||||
|
||||
The most important type of isolation for the control plane is authorization. If teams or their workloads can access or modify each others' API resources, they can change or disable all other types of policies thereby negating any protection those policies may offer. As a result, it is critical to ensure that each tenant has the appropriate access to only the namespaces they need, and no more. This is known as the "Principle of Least Privilege."
|
||||
|
||||
|
||||
Role-based access control (RBAC) is commonly used to enforce authorization in the Kubernetes control plane, for both users and workloads (service accounts). [Roles](/docs/reference/access-authn-authz/rbac/#role-and-clusterrole) and [role bindings](/docs/reference/access-authn-authz/rbac/#rolebinding-and-clusterrolebinding) are Kubernetes objects that are used at a namespace level to enforce access control in your application; similar objects exist for authorizing access to cluster-level objects, though these are less useful for multi-tenant clusters.
|
||||
|
||||
In a multi-team environment, RBAC must be used to restrict tenants' access to the appropriate namespaces, and ensure that cluster-wide resources can only be accessed or modified by privileged users such as cluster administrators.
|
||||
|
||||
If a policy ends up granting a user more permissions than they need, this is likely a signal that the namespace containing the affected resources should be refactored into finer-grained namespaces. Namespace management tools may simplify the management of these finer-grained namespaces by applying common RBAC policies to different namespaces, while still allowing fine-grained policies where necessary.
|
||||
|
||||
### Quotas
|
||||
|
||||
Kubernetes workloads consume node resources, like CPU and memory. In a multi-tenant environment, you can use
|
||||
[Resource Quotas](/docs/concepts/policy/resource-quotas/) to manage resource usage of tenant workloads.
|
||||
For the multiple teams use case, where tenants have access to the Kubernetes API, you can use resource quotas
|
||||
to limit the number of API resources (for example: the number of Pods, or the number of ConfigMaps)
|
||||
that a tenant can create. Limits on object count ensure fairness and aim to avoid _noisy neighbor_ issues from
|
||||
affecting other tenants that share a control plane.
|
||||
|
||||
Resource quotas are namespaced objects. By mapping tenants to namespaces, cluster admins can use quotas to ensure that a tenant cannot monopolize a cluster's resources or overwhelm its control plane. Namespace management tools simplify the administration of quotas. In addition, while Kubernetes quotas only apply within a single namespace, some namespace management tools allow groups of namespaces to share quotas, giving administrators far more flexibility with less effort than built-in quotas.
|
||||
|
||||
Quotas prevent a single tenant from consuming greater than their allocated share of resources hence minimizing the “noisy neighbor” issue, where one tenant negatively impacts the performance of other tenants' workloads.
|
||||
|
||||
When you apply a quota to namespace, Kubernetes requires you to also specify resource requests and limits for each container. Limits are the upper bound for the amount of resources that a container can consume. Containers that attempt to consume resources that exceed the configured limits will either be throttled or killed, based on the resource type. When resource requests are set lower than limits, each container is guaranteed the requested amount but there may still be some potential for impact across workloads.
|
||||
|
||||
Quotas cannot protect against all kinds of resource sharing, such as network traffic. Node isolation (described below) may be a better solution for this problem.
|
||||
|
||||
## Data Plane Isolation
|
||||
|
||||
Data plane isolation ensures that pods and workloads for different tenants are sufficiently isolated.
|
||||
|
||||
### Network isolation
|
||||
|
||||
By default, all pods in a Kubernetes cluster are allowed to communicate with each other, and all network traffic is unencrypted. This can lead to security vulnerabilities where traffic is accidentally or maliciously sent to an unintended destination, or is intercepted by a workload on a compromised node.
|
||||
|
||||
Pod-to-pod communication can be controlled using [Network Policies](/docs/concepts/services-networking/network-policies/), which restrict communication between pods using namespace labels or IP address ranges. In a multi-tenant environment where strict network isolation between tenants is required, starting with a default policy that denies communication between pods is recommended with another rule that allows all pods to query the DNS server for name resolution. With such a default policy in place, you can begin adding more permissive rules that allow for communication within a namespace. This scheme can be further refined as required. Note that this only applies to pods within a single control plane; pods that belong to different virtual control planes cannot talk to each other via Kubernetes networking.
|
||||
|
||||
Namespace management tools may simplify the creation of default or common network policies. In addition, some of these tools allow you to enforce a consistent set of namespace labels across your cluster, ensuring that they are a trusted basis for your policies.
|
||||
|
||||
{{< warning >}}
|
||||
Network policies require a [CNI plugin](/docs/concepts/extend-kubernetes/compute-storage-net/network-plugins/#cni) that supports the implementation of network policies. Otherwise, NetworkPolicy resources will be ignored.
|
||||
{{< /warning >}}
|
||||
|
||||
More advanced network isolation may be provided by service meshes, which provide OSI Layer 7 policies based on workload identity, in addition to namespaces. These higher-level policies can make it easier to manage namespace-based multi-tenancy, especially when multiple namespaces are dedicated to a single tenant. They frequently also offer encryption using mutual TLS, protecting your data even in the presence of a compromised node, and work across dedicated or virtual clusters. However, they can be significantly more complex to manage and may not be appropriate for all users.
|
||||
|
||||
### Storage isolation
|
||||
|
||||
Kubernetes offers several types of volumes that can be used as persistent storage for workloads. For security and data-isolation, [dynamic volume provisioning](/docs/concepts/storage/dynamic-provisioning/) is recommended and volume types that use node resources should be avoided.
|
||||
|
||||
[StorageClasses](/docs/concepts/storage/storage-classes/) allow you to describe custom "classes" of storage offered by your cluster, based on quality-of-service levels, backup policies, or custom policies determined by the cluster administrators.
|
||||
|
||||
Pods can request storage using a [PersistentVolumeClaim](/docs/concepts/storage/persistent-volumes/). A PersistentVolumeClaim is a namespaced resource, which enables isolating portions of the storage system and dedicating it to tenants within the shared Kubernetes cluster. However, it is important to note that a PersistentVolume is a cluster-wide resource and has a lifecycle independent of workloads and namespaces.
|
||||
|
||||
For example, you can configure a separate StorageClass for each tenant and use this to strengthen isolation.
|
||||
If a StorageClass is shared, you should set a [reclaim policy of `Delete`](/docs/concepts/storage/storage-classes/#reclaim-policy)
|
||||
to ensure that a PersistentVolume cannot be reused across different namespaces.
|
||||
|
||||
### Sandboxing containers
|
||||
|
||||
{{% thirdparty-content %}}
|
||||
|
||||
Kubernetes pods are composed of one or more containers that execute on worker nodes. Containers utilize OS-level virtualization and hence offer a weaker isolation boundary than virtual machines that utilize hardware-based virtualization.
|
||||
|
||||
In a shared environment, unpatched vulnerabilities in the application and system layers can be exploited by attackers for container breakouts and remote code execution that allow access to host resources. In some applications, like a Content Management System (CMS), customers may be allowed the ability to upload and execute untrusted scripts or code. In either case, mechanisms to further isolate and protect workloads using strong isolation are desirable.
|
||||
|
||||
Sandboxing provides a way to isolate workloads running in a shared cluster. It typically involves running each pod in a separate execution environment such as a virtual machine or a userspace kernel. Sandboxing is often recommended when you are running untrusted code, where workloads are assumed to be malicious. Part of the reason this type of isolation is necessary is because containers are processes running on a shared kernel; they mount file systems like /sys and /proc from the underlying host, making them less secure than an application that runs on a virtual machine which has its own kernel. While controls such as seccomp, AppArmor, and SELinux can be used to strengthen the security of containers, it is hard to apply a universal set of rules to all workloads running in a shared cluster. Running workloads in a sandbox environment helps to insulate the host from container escapes, where an attacker exploits a vulnerability to gain access to the host system and all the processes/files running on that host.
|
||||
|
||||
Virtual machines and userspace kernels are 2 popular approaches to sandboxing. The following sandboxing implementations are available:
|
||||
* [gVisor](https://gvisor.dev/) intercepts syscalls from containers and runs them through a userspace kernel, written in Go, with limited access to the underlying host.
|
||||
* [Kata Containers](https://katacontainers.io/) is an OCI compliant runtime that allows you to run containers in a VM. The hardware virtualization available in Kata offers an added layer of security for containers running untrusted code.
|
||||
|
||||
### Node Isolation
|
||||
|
||||
Node isolation is another technique that you can use to isolate tenant workloads from each other. With node isolation, a set of nodes is dedicated to running pods from a particular tenant and co-mingling of tenant pods is prohibited. This configuration reduces the noisy tenant issue, as all pods running on a node will belong to a single tenant. The risk of information disclosure is slightly lower with node isolation because an attacker that manages to escape from a container will only have access to the containers and volumes mounted to that node.
|
||||
|
||||
Although workloads from different tenants are running on different nodes, it is important to be aware that the kubelet and (unless using virtual control planes) the API service are still shared services. A skilled attacker could use the permissions assigned to the kubelet or other pods running on the node to move laterally within the cluster and gain access to tenant workloads running on other nodes. If this is a major concern, consider implementing compensating controls such as seccomp, AppArmor or SELinux or explore using sandboxed containers or creating separate clusters for each tenant.
|
||||
|
||||
Node isolation is a little easier to reason about from a billing standpoint than sandboxing containers since you can charge back per node rather than per pod. It also has fewer compatibility and performance issues and may be easier to implement than sandboxing containers. For example, nodes for each tenant can be configured with taints so that only pods with the corresponding toleration can run on them. A mutating webhook could then be used to automatically add tolerations and node affinities to pods deployed into tenant namespaces so that they run on a specific set of nodes designated for that tenant.
|
||||
|
||||
Node isolation can be implemented using an [pod node selectors](/docs/concepts/scheduling-eviction/assign-pod-node/) or a [Virtual Kubelet](https://github.com/virtual-kubelet).
|
||||
|
||||
## Additional Considerations
|
||||
|
||||
This section discusses other Kubernetes constructs and patterns that are relevant for multi-tenancy.
|
||||
|
||||
### API Priority and Fairness
|
||||
|
||||
[API priority and fairness](/docs/concepts/cluster-administration/flow-control/) is a Kubernetes feature that allows you to assign a priority to certain pods running within the cluster. When an application calls the Kubernetes API, the API server evaluates the priority assigned to pod. Calls from pods with higher priority are fulfilled before those with a lower priority. When contention is high, lower priority calls can be queued until the server is less busy or you can reject the requests.
|
||||
|
||||
Using API priority and fairness will not be very common in SaaS environments unless you are allowing customers to run applications that interface with the Kubernetes API, e.g. a controller.
|
||||
|
||||
### Quality-of-Service (QoS) {#qos}
|
||||
|
||||
When you’re running a SaaS application, you may want the ability to offer different Quality-of-Service (QoS) tiers of service to different tenants. For example, you may have freemium service that comes with fewer performance guarantees and features and a for-fee service tier with specific performance guarantees. Fortunately, there are several Kubernetes constructs that can help you accomplish this within a shared cluster, including network QoS, storage classes, and pod priority and preemption. The idea with each of these is to provide tenants with the quality of service that they paid for. Let’s start by looking at networking QoS.
|
||||
|
||||
Typically, all pods on a node share a network interface. Without network QoS, some pods may consume an unfair share of the available bandwidth at the expense of other pods. The Kubernetes [bandwidth plugin](https://www.cni.dev/plugins/current/meta/bandwidth/) creates an [extended resource](/docs/concepts/configuration/manage-resources-containers/#extended-resources) for networking that allows you to use Kubernetes resources constructs, i.e. requests/limits, to apply rate limits to pods by using Linux tc queues. Be aware that the plugin is considered experimental as per the [Network Plugins](/docs/concepts/extend-kubernetes/compute-storage-net/network-plugins/#support-traffic-shaping) documentation and should be thoroughly tested before use in production environments.
|
||||
|
||||
For storage QoS, you will likely want to create different storage classes or profiles with different performance characteristics. Each storage profile can be associated with a different tier of service that is optimized for different workloads such IO, redundancy, or throughput. Additional logic might be necessary to allow the tenant to associate the appropriate storage profile with their workload.
|
||||
|
||||
Finally, there’s [pod priority and preemption](/docs/concepts/scheduling-eviction/pod-priority-preemption/) where you can assign priority values to pods. When scheduling pods, the scheduler will try evicting pods with lower priority when there are insufficient resources to schedule pods that are assigned a higher priority. If you have a use case where tenants have different service tiers in a shared cluster e.g. free and paid, you may want to give higher priority to certain tiers using this feature.
|
||||
|
||||
### DNS
|
||||
|
||||
Kubernetes clusters include a Domain Name System (DNS) service to provide translations from names to IP addresses, for all Services and Pods. By default, the Kubernetes DNS service allows lookups across all namespaces in the cluster.
|
||||
|
||||
In multi-tenant environments where tenants can access pods and other Kubernetes resources, or where
|
||||
stronger isolation is required, it may be necessary to prevent pods from looking up services in other
|
||||
Namespaces.
|
||||
You can restrict cross-namespace DNS lookups by configuring security rules for the DNS service.
|
||||
For example, CoreDNS (the default DNS service for Kubernetes) can leverage Kubernetes metadata
|
||||
to restrict queries to Pods and Services within a namespace. For more information, read an
|
||||
[example](https://github.com/coredns/policy#kubernetes-metadata-multi-tenancy-policy) of configuring
|
||||
this within the CoreDNS documentation.
|
||||
|
||||
When a [Virtual Control Plane per tenant](#virtual-control-plane-per-tenant) model is used, a DNS service must be configured per tenant or a multi-tenant DNS service must be used. Here is an example of a [customized version of CoreDNS](https://github.com/kubernetes-sigs/cluster-api-provider-nested/blob/main/virtualcluster/doc/tenant-dns.md) that supports multiple tenants.
|
||||
|
||||
### Operators
|
||||
|
||||
[Operators](/docs/concepts/extend-kubernetes/operator/) are Kubernetes controllers that manage applications. Operators can simplify the management of multiple instances of an application, like a database service, which makes them a common building block in the multi-consumer (SaaS) multi-tenancy use case.
|
||||
|
||||
Operators used in a multi-tenant environment should follow a stricter set of guidelines. Specifically, the Operator should:
|
||||
* Support creating resources within different tenant namespaces, rather than just in the namespace in which the Operator is deployed.
|
||||
* Ensure that the Pods are configured with resource requests and limits, to ensure scheduling and fairness.
|
||||
* Support configuration of Pods for data-plane isolation techniques such as node isolation and sandboxed containers.
|
||||
|
||||
## Implementations
|
||||
|
||||
{{% thirdparty-content %}}
|
||||
|
||||
There are two primary ways to share a Kubernetes cluster for multi-tenancy: using Namespaces (i.e. a Namespace per tenant) or by virtualizing the control plane (i.e. Virtual control plane per tenant).
|
||||
|
||||
In both cases, data plane isolation, and management of additional considerations such as API Priority and Fairness, is also recommended.
|
||||
|
||||
Namespace isolation is well-supported by Kubernetes, has a negligible resource cost, and provides mechanisms to allow tenants to interact appropriately, such as by allowing service-to-service communication. However, it can be difficult to configure, and doesn't apply to Kubernetes resources that can't be namespaced, such as Custom Resource Definitions, Storage Classes, and Webhooks.
|
||||
|
||||
Control plane virtualization allows for isolation of non-namespaced resources at the cost of somewhat higher resource usage and more difficult cross-tenant sharing. It is a good option when namespace isolation is insufficient but dedicated clusters are undesirable, due to the high cost of maintaining them (especially on-prem) or due to their higher overhead and lack of resource sharing. However, even within a virtualized control plane, you will likely see benefits by using namespaces as well.
|
||||
|
||||
The two options are discussed in more detail in the following sections:
|
||||
|
||||
### Namespace per tenant
|
||||
|
||||
As previously mentioned, you should consider isolating each workload in its own namespace, even if you are using dedicated clusters or virtualized control planes. This ensures that each workload only has access to its own resources, such as Config Maps and Secrets, and allows you to tailor dedicated security policies for each workload. In addition, it is a best practice to give each namespace names that are unique across your entire fleet (i.e., even if they are in separate clusters), as this gives you the flexibility to switch between dedicated and shared clusters in the future, or to use multi-cluster tooling such as service meshes.
|
||||
|
||||
Conversely, there are also advantages to assigning namespaces at the tenant level, not just the workload level, since there are often policies that apply to all workloads owned by a single tenant. However, this raises its own problems. Firstly, this makes it difficult or impossible to customize policies to individual workloads, and secondly, it may be challenging to come up with a single level of "tenancy" that should be given a namespace. For example, an organization may have divisions, teams, and subteams - which should be assigned a namespace?
|
||||
|
||||
To solve this, Kubernetes provides the [Hierarchical Namespace Controller (HNC)](https://github.com/kubernetes-sigs/hierarchical-namespaces), which allows you to organize your namespaces into hierarchies, and share certain policies and resources between them. It also helps you manage namespace labels, namespace lifecycles, and delegated management, and share resource quotas across related namespaces. These capabilities can be useful in both multi-team and multi-customer scenarios.
|
||||
|
||||
Other projects that provide similar capabilities and aid in managing namespaced resources are listed below:
|
||||
|
||||
#### Multi-team tenancy
|
||||
|
||||
* [Capsule](https://github.com/clastix/capsule)
|
||||
* [Kiosk](https://github.com/loft-sh/kiosk)
|
||||
|
||||
#### Multi-customer tenancy
|
||||
|
||||
* [Kubeplus](https://github.com/cloud-ark/kubeplus)
|
||||
|
||||
#### Policy engines
|
||||
|
||||
Policy engines provide features to validate and generate tenant configurations:
|
||||
|
||||
* [Kyverno](https://kyverno.io/)
|
||||
* [OPA/Gatekeeper](https://github.com/open-policy-agent/gatekeeper)
|
||||
|
||||
### Virtual control plane per tenant
|
||||
|
||||
Another form of control-plane isolation is to use Kubernetes extensions to provide each tenant a virtual control-plane that enables segmentation of cluster-wide API resources. [Data plane isolation](#data-plane-isolation) techniques can be used with this model to securely manage worker nodes across tenants.
|
||||
|
||||
The virtual control plane based multi-tenancy model extends namespace-based multi-tenancy by providing each tenant with dedicated control plane components, and hence complete control over cluster-wide resources and add-on services. Worker nodes are shared across all tenants, and are managed by a Kubernetes cluster that is normally inaccessible to tenants. This cluster is often referred to as a _super-cluster_ (or sometimes as a _host-cluster_). Since a tenant’s control-plane is not directly associated with underlying compute resources it is referred to as a _virtual control plane_.
|
||||
|
||||
A virtual control plane typically consists of the Kubernetes API server, the controller manager, and the etcd data store. It interacts with the super cluster via a metadata synchronization controller which coordinates changes across tenant control planes and the control plane of the super--cluster.
|
||||
|
||||
By using per-tenant dedicated control planes, most of the isolation problems due to sharing one API server among all tenants are solved. Examples include noisy neighbors in the control plane, uncontrollable blast radius of policy misconfigurations, and conflicts between cluster scope objects such as webhooks and CRDs. Hence, the virtual control plane model is particularly suitable for cases where each tenant requires access to a Kubernetes API server and expects the full cluster manageability.
|
||||
|
||||
The improved isolation comes at the cost of running and maintaining an individual virtual control plane per tenant. In addition, per-tenant control planes do not solve isolation problems in the data plane, such as node-level noisy neighbors or security threats. These must still be addressed separately.
|
||||
|
||||
The Kubernetes [Cluster API - Nested (CAPN)](https://github.com/kubernetes-sigs/cluster-api-provider-nested/tree/main/virtualcluster) project provides an implementation of virtual control planes.
|
||||
|
||||
#### Other implementations
|
||||
* [Kamaji](https://github.com/clastix/kamaji)
|
||||
* [vcluster](https://github.com/loft-sh/vcluster)
|
||||
|
||||
---
|
||||
title: Multi-tenancy
|
||||
content_type: concept
|
||||
weight: 70
|
||||
---
|
||||
|
||||
<!-- overview -->
|
||||
|
||||
This page provides an overview of available configuration options and best practices for cluster multi-tenancy.
|
||||
|
||||
Sharing clusters saves costs and simplifies administration. However, sharing clusters also presents challenges such as security, fairness, and managing _noisy neighbors_.
|
||||
|
||||
Clusters can be shared in many ways. In some cases, different applications may run in the same cluster. In other cases, multiple instances of the same application may run in the same cluster, one for each end user. All these types of sharing are frequently described using the umbrella term _multi-tenancy_.
|
||||
|
||||
While Kubernetes does not have first-class concepts of end users or tenants, it provides several features to help manage different tenancy requirements. These are discussed below.
|
||||
|
||||
<!-- body -->
|
||||
## Use cases
|
||||
|
||||
The first step to determining how to share your cluster is understanding your use case, so you can evaluate the patterns and tools available. In general, multi-tenancy in Kubernetes clusters falls into two broad categories, though many variations and hybrids are also possible.
|
||||
|
||||
### Multiple teams
|
||||
|
||||
A common form of multi-tenancy is to share a cluster between multiple teams within an organization, each of whom may operate one or more workloads. These workloads frequently need to communicate with each other, and with other workloads located on the same or different clusters.
|
||||
|
||||
In this scenario, members of the teams often have direct access to Kubernetes resources via tools such as `kubectl`, or indirect access through GitOps controllers or other types of release automation tools. There is often some level of trust between members of different teams, but Kubernetes policies such as RBAC, quotas, and network policies are essential to safely and fairly share clusters.
|
||||
|
||||
### Multiple customers
|
||||
|
||||
The other major form of multi-tenancy frequently involves a Software-as-a-Service (SaaS) vendor running multiple instances of a workload for customers. This business model is so strongly associated with this deployment style that many people call it "SaaS tenancy." However, a better term might be "multi-customer tenancy,” since SaaS vendors may also use other deployment models, and this deployment model can also be used outside of SaaS.
|
||||
|
||||
|
||||
In this scenario, the customers do not have access to the cluster; Kubernetes is invisible from their perspective and is only used by the vendor to manage the workloads. Cost optimization is frequently a critical concern, and Kubernetes policies are used to ensure that the workloads are strongly isolated from each other.
|
||||
|
||||
|
||||
## Terminology
|
||||
|
||||
### Tenants
|
||||
|
||||
When discussing multi-tenancy in Kubernetes, there is no single definition for a "tenant". Rather, the definition of a tenant will vary depending on whether multi-team or multi-customer tenancy is being discussed.
|
||||
|
||||
In multi-team usage, a tenant is typically a team, where each team typically deploys a small number of workloads that scales with the complexity of the service. However, the definition of "team" may itself be fuzzy, as teams may be organized into higher-level divisions or subdivided into smaller teams.
|
||||
|
||||
|
||||
By contrast, if each team deploys dedicated workloads for each new client, they are using a multi-customer model of tenancy. In this case, a "tenant" is simply a group of users who share a single workload. This may be as large as an entire company, or as small as a single team at that company.
|
||||
|
||||
In many cases, the same organization may use both definitions of "tenants" in different contexts. For example, a platform team may offer shared services such as security tools and databases to multiple internal “customers” and a SaaS vendor may also have multiple teams sharing a development cluster. Finally, hybrid architectures are also possible, such as a SaaS provider using a combination of per-customer workloads for sensitive data, combined with multi-tenant shared services.
|
||||
|
||||
|
||||
{{< figure src="/images/docs/multi-tenancy.png" title="A cluster showing coexisting tenancy models" class="diagram-large" >}}
|
||||
|
||||
|
||||
### Isolation
|
||||
|
||||
There are several ways to design and build multi-tenant solutions with Kubernetes. Each of these methods comes with its own set of tradeoffs that impact the isolation level, implementation effort, operational complexity, and cost of service.
|
||||
|
||||
|
||||
A Kubernetes cluster consists of a control plane which runs Kubernetes software, and a data plane consisting of worker nodes where tenant workloads are executed as pods. Tenant isolation can be applied in both the control plane and the data plane based on organizational requirements.
|
||||
|
||||
The level of isolation offered is sometimes described using terms like “hard” multi-tenancy, which implies strong isolation, and “soft” multi-tenancy, which implies weaker isolation. In particular, "hard" multi-tenancy is often used to describe cases where the tenants do not trust each other, often from security and resource sharing perspectives (e.g. guarding against attacks such as data exfiltration or DoS). Since data planes typically have much larger attack surfaces, "hard" multi-tenancy often requires extra attention to isolating the data-plane, though control plane isolation also remains critical.
|
||||
|
||||
However, the terms "hard" and "soft" can often be confusing, as there is no single definition that will apply to all users. Rather, "hardness" or "softness" is better understood as a broad spectrum, with many different techniques that can be used to maintain different types of isolation in your clusters, based on your requirements.
|
||||
|
||||
|
||||
In more extreme cases, it may be easier or necessary to forgo any cluster-level sharing at all and assign each tenant their dedicated cluster, possibly even running on dedicated hardware if VMs are not considered an adequate security boundary. This may be easier with managed Kubernetes clusters, where the overhead of creating and operating clusters is at least somewhat taken on by a cloud provider. The benefit of stronger tenant isolation must be evaluated against the cost and complexity of managing multiple clusters. The [Multi-cluster SIG](https://git.k8s.io/community/sig-multicluster/README.md) is responsible for addressing these types of use cases.
|
||||
|
||||
|
||||
|
||||
The remainder of this page focuses on isolation techniques used for shared Kubernetes clusters. However, even if you are considering dedicated clusters, it may be valuable to review these recommendations, as it will give you the flexibility to shift to shared clusters in the future if your needs or capabilities change.
|
||||
|
||||
|
||||
## Control plane isolation
|
||||
|
||||
Control plane isolation ensures that different tenants cannot access or affect each others' Kubernetes API resources.
|
||||
|
||||
### Namespaces
|
||||
|
||||
In Kubernetes, a {{< glossary_tooltip text="Namespace" term_id="namespace" >}} provides a mechanism for isolating groups of API resources within a single cluster. This isolation has two key dimensions:
|
||||
|
||||
1. Object names within a namespace can overlap with names in other namespaces, similar to files in folders. This allows tenants to name their resources without having to consider what other tenants are doing.
|
||||
|
||||
2. Many Kubernetes security policies are scoped to namespaces. For example, RBAC Roles and Network Policies are namespace-scoped resources. Using RBAC, Users and Service Accounts can be restricted to a namespace.
|
||||
|
||||
In a multi-tenant environment, a Namespace helps segment a tenant's workload into a logical and distinct management unit. In fact, a common practice is to isolate every workload in its own namespace, even if multiple workloads are operated by the same tenant. This ensures that each workload has its own identity and can be configured with an appropriate security policy.
|
||||
|
||||
The namespace isolation model requires configuration of several other Kubernetes resources, networking plugins, and adherence to security best practices to properly isolate tenant workloads. These considerations are discussed below.
|
||||
|
||||
### Access controls
|
||||
|
||||
The most important type of isolation for the control plane is authorization. If teams or their workloads can access or modify each others' API resources, they can change or disable all other types of policies thereby negating any protection those policies may offer. As a result, it is critical to ensure that each tenant has the appropriate access to only the namespaces they need, and no more. This is known as the "Principle of Least Privilege."
|
||||
|
||||
|
||||
Role-based access control (RBAC) is commonly used to enforce authorization in the Kubernetes control plane, for both users and workloads (service accounts). [Roles](/docs/reference/access-authn-authz/rbac/#role-and-clusterrole) and [role bindings](/docs/reference/access-authn-authz/rbac/#rolebinding-and-clusterrolebinding) are Kubernetes objects that are used at a namespace level to enforce access control in your application; similar objects exist for authorizing access to cluster-level objects, though these are less useful for multi-tenant clusters.
|
||||
|
||||
In a multi-team environment, RBAC must be used to restrict tenants' access to the appropriate namespaces, and ensure that cluster-wide resources can only be accessed or modified by privileged users such as cluster administrators.
|
||||
|
||||
If a policy ends up granting a user more permissions than they need, this is likely a signal that the namespace containing the affected resources should be refactored into finer-grained namespaces. Namespace management tools may simplify the management of these finer-grained namespaces by applying common RBAC policies to different namespaces, while still allowing fine-grained policies where necessary.
|
||||
|
||||
### Quotas
|
||||
|
||||
Kubernetes workloads consume node resources, like CPU and memory. In a multi-tenant environment, you can use
|
||||
[Resource Quotas](/docs/concepts/policy/resource-quotas/) to manage resource usage of tenant workloads.
|
||||
For the multiple teams use case, where tenants have access to the Kubernetes API, you can use resource quotas
|
||||
to limit the number of API resources (for example: the number of Pods, or the number of ConfigMaps)
|
||||
that a tenant can create. Limits on object count ensure fairness and aim to avoid _noisy neighbor_ issues from
|
||||
affecting other tenants that share a control plane.
|
||||
|
||||
Resource quotas are namespaced objects. By mapping tenants to namespaces, cluster admins can use quotas to ensure that a tenant cannot monopolize a cluster's resources or overwhelm its control plane. Namespace management tools simplify the administration of quotas. In addition, while Kubernetes quotas only apply within a single namespace, some namespace management tools allow groups of namespaces to share quotas, giving administrators far more flexibility with less effort than built-in quotas.
|
||||
|
||||
Quotas prevent a single tenant from consuming greater than their allocated share of resources hence minimizing the “noisy neighbor” issue, where one tenant negatively impacts the performance of other tenants' workloads.
|
||||
|
||||
When you apply a quota to namespace, Kubernetes requires you to also specify resource requests and limits for each container. Limits are the upper bound for the amount of resources that a container can consume. Containers that attempt to consume resources that exceed the configured limits will either be throttled or killed, based on the resource type. When resource requests are set lower than limits, each container is guaranteed the requested amount but there may still be some potential for impact across workloads.
|
||||
|
||||
Quotas cannot protect against all kinds of resource sharing, such as network traffic. Node isolation (described below) may be a better solution for this problem.
|
||||
|
||||
## Data Plane Isolation
|
||||
|
||||
Data plane isolation ensures that pods and workloads for different tenants are sufficiently isolated.
|
||||
|
||||
### Network isolation
|
||||
|
||||
By default, all pods in a Kubernetes cluster are allowed to communicate with each other, and all network traffic is unencrypted. This can lead to security vulnerabilities where traffic is accidentally or maliciously sent to an unintended destination, or is intercepted by a workload on a compromised node.
|
||||
|
||||
Pod-to-pod communication can be controlled using [Network Policies](/docs/concepts/services-networking/network-policies/), which restrict communication between pods using namespace labels or IP address ranges. In a multi-tenant environment where strict network isolation between tenants is required, starting with a default policy that denies communication between pods is recommended with another rule that allows all pods to query the DNS server for name resolution. With such a default policy in place, you can begin adding more permissive rules that allow for communication within a namespace. This scheme can be further refined as required. Note that this only applies to pods within a single control plane; pods that belong to different virtual control planes cannot talk to each other via Kubernetes networking.
|
||||
|
||||
Namespace management tools may simplify the creation of default or common network policies. In addition, some of these tools allow you to enforce a consistent set of namespace labels across your cluster, ensuring that they are a trusted basis for your policies.
|
||||
|
||||
{{< warning >}}
|
||||
Network policies require a [CNI plugin](/docs/concepts/extend-kubernetes/compute-storage-net/network-plugins/#cni) that supports the implementation of network policies. Otherwise, NetworkPolicy resources will be ignored.
|
||||
{{< /warning >}}
|
||||
|
||||
More advanced network isolation may be provided by service meshes, which provide OSI Layer 7 policies based on workload identity, in addition to namespaces. These higher-level policies can make it easier to manage namespace-based multi-tenancy, especially when multiple namespaces are dedicated to a single tenant. They frequently also offer encryption using mutual TLS, protecting your data even in the presence of a compromised node, and work across dedicated or virtual clusters. However, they can be significantly more complex to manage and may not be appropriate for all users.
|
||||
|
||||
### Storage isolation
|
||||
|
||||
Kubernetes offers several types of volumes that can be used as persistent storage for workloads. For security and data-isolation, [dynamic volume provisioning](/docs/concepts/storage/dynamic-provisioning/) is recommended and volume types that use node resources should be avoided.
|
||||
|
||||
[StorageClasses](/docs/concepts/storage/storage-classes/) allow you to describe custom "classes" of storage offered by your cluster, based on quality-of-service levels, backup policies, or custom policies determined by the cluster administrators.
|
||||
|
||||
Pods can request storage using a [PersistentVolumeClaim](/docs/concepts/storage/persistent-volumes/). A PersistentVolumeClaim is a namespaced resource, which enables isolating portions of the storage system and dedicating it to tenants within the shared Kubernetes cluster. However, it is important to note that a PersistentVolume is a cluster-wide resource and has a lifecycle independent of workloads and namespaces.
|
||||
|
||||
For example, you can configure a separate StorageClass for each tenant and use this to strengthen isolation.
|
||||
If a StorageClass is shared, you should set a [reclaim policy of `Delete`](/docs/concepts/storage/storage-classes/#reclaim-policy)
|
||||
to ensure that a PersistentVolume cannot be reused across different namespaces.
|
||||
|
||||
### Sandboxing containers
|
||||
|
||||
{{% thirdparty-content %}}
|
||||
|
||||
Kubernetes pods are composed of one or more containers that execute on worker nodes. Containers utilize OS-level virtualization and hence offer a weaker isolation boundary than virtual machines that utilize hardware-based virtualization.
|
||||
|
||||
In a shared environment, unpatched vulnerabilities in the application and system layers can be exploited by attackers for container breakouts and remote code execution that allow access to host resources. In some applications, like a Content Management System (CMS), customers may be allowed the ability to upload and execute untrusted scripts or code. In either case, mechanisms to further isolate and protect workloads using strong isolation are desirable.
|
||||
|
||||
Sandboxing provides a way to isolate workloads running in a shared cluster. It typically involves running each pod in a separate execution environment such as a virtual machine or a userspace kernel. Sandboxing is often recommended when you are running untrusted code, where workloads are assumed to be malicious. Part of the reason this type of isolation is necessary is because containers are processes running on a shared kernel; they mount file systems like /sys and /proc from the underlying host, making them less secure than an application that runs on a virtual machine which has its own kernel. While controls such as seccomp, AppArmor, and SELinux can be used to strengthen the security of containers, it is hard to apply a universal set of rules to all workloads running in a shared cluster. Running workloads in a sandbox environment helps to insulate the host from container escapes, where an attacker exploits a vulnerability to gain access to the host system and all the processes/files running on that host.
|
||||
|
||||
Virtual machines and userspace kernels are 2 popular approaches to sandboxing. The following sandboxing implementations are available:
|
||||
* [gVisor](https://gvisor.dev/) intercepts syscalls from containers and runs them through a userspace kernel, written in Go, with limited access to the underlying host.
|
||||
* [Kata Containers](https://katacontainers.io/) is an OCI compliant runtime that allows you to run containers in a VM. The hardware virtualization available in Kata offers an added layer of security for containers running untrusted code.
|
||||
|
||||
### Node Isolation
|
||||
|
||||
Node isolation is another technique that you can use to isolate tenant workloads from each other. With node isolation, a set of nodes is dedicated to running pods from a particular tenant and co-mingling of tenant pods is prohibited. This configuration reduces the noisy tenant issue, as all pods running on a node will belong to a single tenant. The risk of information disclosure is slightly lower with node isolation because an attacker that manages to escape from a container will only have access to the containers and volumes mounted to that node.
|
||||
|
||||
Although workloads from different tenants are running on different nodes, it is important to be aware that the kubelet and (unless using virtual control planes) the API service are still shared services. A skilled attacker could use the permissions assigned to the kubelet or other pods running on the node to move laterally within the cluster and gain access to tenant workloads running on other nodes. If this is a major concern, consider implementing compensating controls such as seccomp, AppArmor or SELinux or explore using sandboxed containers or creating separate clusters for each tenant.
|
||||
|
||||
Node isolation is a little easier to reason about from a billing standpoint than sandboxing containers since you can charge back per node rather than per pod. It also has fewer compatibility and performance issues and may be easier to implement than sandboxing containers. For example, nodes for each tenant can be configured with taints so that only pods with the corresponding toleration can run on them. A mutating webhook could then be used to automatically add tolerations and node affinities to pods deployed into tenant namespaces so that they run on a specific set of nodes designated for that tenant.
|
||||
|
||||
Node isolation can be implemented using an [pod node selectors](/docs/concepts/scheduling-eviction/assign-pod-node/) or a [Virtual Kubelet](https://github.com/virtual-kubelet).
|
||||
|
||||
## Additional Considerations
|
||||
|
||||
This section discusses other Kubernetes constructs and patterns that are relevant for multi-tenancy.
|
||||
|
||||
### API Priority and Fairness
|
||||
|
||||
[API priority and fairness](/docs/concepts/cluster-administration/flow-control/) is a Kubernetes feature that allows you to assign a priority to certain pods running within the cluster. When an application calls the Kubernetes API, the API server evaluates the priority assigned to pod. Calls from pods with higher priority are fulfilled before those with a lower priority. When contention is high, lower priority calls can be queued until the server is less busy or you can reject the requests.
|
||||
|
||||
Using API priority and fairness will not be very common in SaaS environments unless you are allowing customers to run applications that interface with the Kubernetes API, e.g. a controller.
|
||||
|
||||
### Quality-of-Service (QoS) {#qos}
|
||||
|
||||
When you’re running a SaaS application, you may want the ability to offer different Quality-of-Service (QoS) tiers of service to different tenants. For example, you may have freemium service that comes with fewer performance guarantees and features and a for-fee service tier with specific performance guarantees. Fortunately, there are several Kubernetes constructs that can help you accomplish this within a shared cluster, including network QoS, storage classes, and pod priority and preemption. The idea with each of these is to provide tenants with the quality of service that they paid for. Let’s start by looking at networking QoS.
|
||||
|
||||
Typically, all pods on a node share a network interface. Without network QoS, some pods may consume an unfair share of the available bandwidth at the expense of other pods. The Kubernetes [bandwidth plugin](https://www.cni.dev/plugins/current/meta/bandwidth/) creates an [extended resource](/docs/concepts/configuration/manage-resources-containers/#extended-resources) for networking that allows you to use Kubernetes resources constructs, i.e. requests/limits, to apply rate limits to pods by using Linux tc queues. Be aware that the plugin is considered experimental as per the [Network Plugins](/docs/concepts/extend-kubernetes/compute-storage-net/network-plugins/#support-traffic-shaping) documentation and should be thoroughly tested before use in production environments.
|
||||
|
||||
For storage QoS, you will likely want to create different storage classes or profiles with different performance characteristics. Each storage profile can be associated with a different tier of service that is optimized for different workloads such IO, redundancy, or throughput. Additional logic might be necessary to allow the tenant to associate the appropriate storage profile with their workload.
|
||||
|
||||
Finally, there’s [pod priority and preemption](/docs/concepts/scheduling-eviction/pod-priority-preemption/) where you can assign priority values to pods. When scheduling pods, the scheduler will try evicting pods with lower priority when there are insufficient resources to schedule pods that are assigned a higher priority. If you have a use case where tenants have different service tiers in a shared cluster e.g. free and paid, you may want to give higher priority to certain tiers using this feature.
|
||||
|
||||
### DNS
|
||||
|
||||
Kubernetes clusters include a Domain Name System (DNS) service to provide translations from names to IP addresses, for all Services and Pods. By default, the Kubernetes DNS service allows lookups across all namespaces in the cluster.
|
||||
|
||||
In multi-tenant environments where tenants can access pods and other Kubernetes resources, or where
|
||||
stronger isolation is required, it may be necessary to prevent pods from looking up services in other
|
||||
Namespaces.
|
||||
You can restrict cross-namespace DNS lookups by configuring security rules for the DNS service.
|
||||
For example, CoreDNS (the default DNS service for Kubernetes) can leverage Kubernetes metadata
|
||||
to restrict queries to Pods and Services within a namespace. For more information, read an
|
||||
[example](https://github.com/coredns/policy#kubernetes-metadata-multi-tenancy-policy) of configuring
|
||||
this within the CoreDNS documentation.
|
||||
|
||||
When a [Virtual Control Plane per tenant](#virtual-control-plane-per-tenant) model is used, a DNS service must be configured per tenant or a multi-tenant DNS service must be used. Here is an example of a [customized version of CoreDNS](https://github.com/kubernetes-sigs/cluster-api-provider-nested/blob/main/virtualcluster/doc/tenant-dns.md) that supports multiple tenants.
|
||||
|
||||
### Operators
|
||||
|
||||
[Operators](/docs/concepts/extend-kubernetes/operator/) are Kubernetes controllers that manage applications. Operators can simplify the management of multiple instances of an application, like a database service, which makes them a common building block in the multi-consumer (SaaS) multi-tenancy use case.
|
||||
|
||||
Operators used in a multi-tenant environment should follow a stricter set of guidelines. Specifically, the Operator should:
|
||||
* Support creating resources within different tenant namespaces, rather than just in the namespace in which the Operator is deployed.
|
||||
* Ensure that the Pods are configured with resource requests and limits, to ensure scheduling and fairness.
|
||||
* Support configuration of Pods for data-plane isolation techniques such as node isolation and sandboxed containers.
|
||||
|
||||
## Implementations
|
||||
|
||||
{{% thirdparty-content %}}
|
||||
|
||||
There are two primary ways to share a Kubernetes cluster for multi-tenancy: using Namespaces (i.e. a Namespace per tenant) or by virtualizing the control plane (i.e. Virtual control plane per tenant).
|
||||
|
||||
In both cases, data plane isolation, and management of additional considerations such as API Priority and Fairness, is also recommended.
|
||||
|
||||
Namespace isolation is well-supported by Kubernetes, has a negligible resource cost, and provides mechanisms to allow tenants to interact appropriately, such as by allowing service-to-service communication. However, it can be difficult to configure, and doesn't apply to Kubernetes resources that can't be namespaced, such as Custom Resource Definitions, Storage Classes, and Webhooks.
|
||||
|
||||
Control plane virtualization allows for isolation of non-namespaced resources at the cost of somewhat higher resource usage and more difficult cross-tenant sharing. It is a good option when namespace isolation is insufficient but dedicated clusters are undesirable, due to the high cost of maintaining them (especially on-prem) or due to their higher overhead and lack of resource sharing. However, even within a virtualized control plane, you will likely see benefits by using namespaces as well.
|
||||
|
||||
The two options are discussed in more detail in the following sections:
|
||||
|
||||
### Namespace per tenant
|
||||
|
||||
As previously mentioned, you should consider isolating each workload in its own namespace, even if you are using dedicated clusters or virtualized control planes. This ensures that each workload only has access to its own resources, such as Config Maps and Secrets, and allows you to tailor dedicated security policies for each workload. In addition, it is a best practice to give each namespace names that are unique across your entire fleet (i.e., even if they are in separate clusters), as this gives you the flexibility to switch between dedicated and shared clusters in the future, or to use multi-cluster tooling such as service meshes.
|
||||
|
||||
Conversely, there are also advantages to assigning namespaces at the tenant level, not just the workload level, since there are often policies that apply to all workloads owned by a single tenant. However, this raises its own problems. Firstly, this makes it difficult or impossible to customize policies to individual workloads, and secondly, it may be challenging to come up with a single level of "tenancy" that should be given a namespace. For example, an organization may have divisions, teams, and subteams - which should be assigned a namespace?
|
||||
|
||||
To solve this, Kubernetes provides the [Hierarchical Namespace Controller (HNC)](https://github.com/kubernetes-sigs/hierarchical-namespaces), which allows you to organize your namespaces into hierarchies, and share certain policies and resources between them. It also helps you manage namespace labels, namespace lifecycles, and delegated management, and share resource quotas across related namespaces. These capabilities can be useful in both multi-team and multi-customer scenarios.
|
||||
|
||||
Other projects that provide similar capabilities and aid in managing namespaced resources are listed below:
|
||||
|
||||
#### Multi-team tenancy
|
||||
|
||||
* [Capsule](https://github.com/clastix/capsule)
|
||||
* [Kiosk](https://github.com/loft-sh/kiosk)
|
||||
|
||||
#### Multi-customer tenancy
|
||||
|
||||
* [Kubeplus](https://github.com/cloud-ark/kubeplus)
|
||||
|
||||
#### Policy engines
|
||||
|
||||
Policy engines provide features to validate and generate tenant configurations:
|
||||
|
||||
* [Kyverno](https://kyverno.io/)
|
||||
* [OPA/Gatekeeper](https://github.com/open-policy-agent/gatekeeper)
|
||||
|
||||
### Virtual control plane per tenant
|
||||
|
||||
Another form of control-plane isolation is to use Kubernetes extensions to provide each tenant a virtual control-plane that enables segmentation of cluster-wide API resources. [Data plane isolation](#data-plane-isolation) techniques can be used with this model to securely manage worker nodes across tenants.
|
||||
|
||||
The virtual control plane based multi-tenancy model extends namespace-based multi-tenancy by providing each tenant with dedicated control plane components, and hence complete control over cluster-wide resources and add-on services. Worker nodes are shared across all tenants, and are managed by a Kubernetes cluster that is normally inaccessible to tenants. This cluster is often referred to as a _super-cluster_ (or sometimes as a _host-cluster_). Since a tenant’s control-plane is not directly associated with underlying compute resources it is referred to as a _virtual control plane_.
|
||||
|
||||
A virtual control plane typically consists of the Kubernetes API server, the controller manager, and the etcd data store. It interacts with the super cluster via a metadata synchronization controller which coordinates changes across tenant control planes and the control plane of the super--cluster.
|
||||
|
||||
By using per-tenant dedicated control planes, most of the isolation problems due to sharing one API server among all tenants are solved. Examples include noisy neighbors in the control plane, uncontrollable blast radius of policy misconfigurations, and conflicts between cluster scope objects such as webhooks and CRDs. Hence, the virtual control plane model is particularly suitable for cases where each tenant requires access to a Kubernetes API server and expects the full cluster manageability.
|
||||
|
||||
The improved isolation comes at the cost of running and maintaining an individual virtual control plane per tenant. In addition, per-tenant control planes do not solve isolation problems in the data plane, such as node-level noisy neighbors or security threats. These must still be addressed separately.
|
||||
|
||||
The Kubernetes [Cluster API - Nested (CAPN)](https://github.com/kubernetes-sigs/cluster-api-provider-nested/tree/main/virtualcluster) project provides an implementation of virtual control planes.
|
||||
|
||||
#### Other implementations
|
||||
* [Kamaji](https://github.com/clastix/kamaji)
|
||||
* [vcluster](https://github.com/loft-sh/vcluster)
|
||||
|
||||
|
|
|
@ -1,18 +1,18 @@
|
|||
---
|
||||
title: Eviction
|
||||
id: eviction
|
||||
date: 2021-05-08
|
||||
full_link: /docs/concepts/scheduling-eviction/
|
||||
short_description: >
|
||||
Process of terminating one or more Pods on Nodes
|
||||
aka:
|
||||
tags:
|
||||
- operation
|
||||
---
|
||||
|
||||
Eviction is the process of terminating one or more Pods on Nodes.
|
||||
|
||||
<!--more-->
|
||||
There are two kinds of eviction:
|
||||
* [Node-pressure eviction](/docs/concepts/scheduling-eviction/node-pressure-eviction/)
|
||||
* [API-initiated eviction](/docs/concepts/scheduling-eviction/api-eviction/)
|
||||
---
|
||||
title: Eviction
|
||||
id: eviction
|
||||
date: 2021-05-08
|
||||
full_link: /docs/concepts/scheduling-eviction/
|
||||
short_description: >
|
||||
Process of terminating one or more Pods on Nodes
|
||||
aka:
|
||||
tags:
|
||||
- operation
|
||||
---
|
||||
|
||||
Eviction is the process of terminating one or more Pods on Nodes.
|
||||
|
||||
<!--more-->
|
||||
There are two kinds of eviction:
|
||||
* [Node-pressure eviction](/docs/concepts/scheduling-eviction/node-pressure-eviction/)
|
||||
* [API-initiated eviction](/docs/concepts/scheduling-eviction/api-eviction/)
|
||||
|
|
|
@ -1,270 +1,270 @@
|
|||
---
|
||||
title: Control Topology Management Policies on a node
|
||||
|
||||
reviewers:
|
||||
- ConnorDoyle
|
||||
- klueska
|
||||
- lmdaly
|
||||
- nolancon
|
||||
- bg-chun
|
||||
|
||||
content_type: task
|
||||
min-kubernetes-server-version: v1.18
|
||||
---
|
||||
|
||||
<!-- overview -->
|
||||
|
||||
{{< feature-state state="beta" for_k8s_version="v1.18" >}}
|
||||
|
||||
An increasing number of systems leverage a combination of CPUs and hardware accelerators to support latency-critical execution and high-throughput parallel computation. These include workloads in fields such as telecommunications, scientific computing, machine learning, financial services and data analytics. Such hybrid systems comprise a high performance environment.
|
||||
|
||||
In order to extract the best performance, optimizations related to CPU isolation, memory and device locality are required. However, in Kubernetes, these optimizations are handled by a disjoint set of components.
|
||||
|
||||
_Topology Manager_ is a Kubelet component that aims to coordinate the set of components that are responsible for these optimizations.
|
||||
|
||||
|
||||
|
||||
## {{% heading "prerequisites" %}}
|
||||
|
||||
|
||||
{{< include "task-tutorial-prereqs.md" >}} {{< version-check >}}
|
||||
|
||||
|
||||
|
||||
<!-- steps -->
|
||||
|
||||
## How Topology Manager Works
|
||||
|
||||
Prior to the introduction of Topology Manager, the CPU and Device Manager in Kubernetes make resource allocation decisions independently of each other.
|
||||
This can result in undesirable allocations on multiple-socketed systems, performance/latency sensitive applications will suffer due to these undesirable allocations.
|
||||
Undesirable in this case meaning for example, CPUs and devices being allocated from different NUMA Nodes thus, incurring additional latency.
|
||||
|
||||
The Topology Manager is a Kubelet component, which acts as a source of truth so that other Kubelet components can make topology aligned resource allocation choices.
|
||||
|
||||
The Topology Manager provides an interface for components, called *Hint Providers*, to send and receive topology information. Topology Manager has a set of node level policies which are explained below.
|
||||
|
||||
The Topology manager receives Topology information from the *Hint Providers* as a bitmask denoting NUMA Nodes available and a preferred allocation indication. The Topology Manager policies perform a set of operations on the hints provided and converge on the hint determined by the policy to give the optimal result, if an undesirable hint is stored the preferred field for the hint will be set to false. In the current policies preferred is the narrowest preferred mask.
|
||||
The selected hint is stored as part of the Topology Manager. Depending on the policy configured the pod can be accepted or rejected from the node based on the selected hint.
|
||||
The hint is then stored in the Topology Manager for use by the *Hint Providers* when making the resource allocation decisions.
|
||||
|
||||
### Enable the Topology Manager feature
|
||||
|
||||
Support for the Topology Manager requires `TopologyManager` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/) to be enabled. It is enabled by default starting with Kubernetes 1.18.
|
||||
|
||||
## Topology Manager Scopes and Policies
|
||||
|
||||
The Topology Manager currently:
|
||||
- Aligns Pods of all QoS classes.
|
||||
- Aligns the requested resources that Hint Provider provides topology hints for.
|
||||
|
||||
If these conditions are met, the Topology Manager will align the requested resources.
|
||||
|
||||
In order to customise how this alignment is carried out, the Topology Manager provides two distinct knobs: `scope` and `policy`.
|
||||
|
||||
The `scope` defines the granularity at which you would like resource alignment to be performed (e.g. at the `pod` or `container` level). And the `policy` defines the actual strategy used to carry out the alignment (e.g. `best-effort`, `restricted`, `single-numa-node`, etc.).
|
||||
|
||||
Details on the various `scopes` and `policies` available today can be found below.
|
||||
|
||||
{{< note >}}
|
||||
To align CPU resources with other requested resources in a Pod Spec, the CPU Manager should be enabled and proper CPU Manager policy should be configured on a Node. See [control CPU Management Policies](/docs/tasks/administer-cluster/cpu-management-policies/).
|
||||
{{< /note >}}
|
||||
|
||||
{{< note >}}
|
||||
To align memory (and hugepages) resources with other requested resources in a Pod Spec, the Memory Manager should be enabled and proper Memory Manager policy should be configured on a Node. Examine [Memory Manager](/docs/tasks/administer-cluster/memory-manager/) documentation.
|
||||
{{< /note >}}
|
||||
|
||||
### Topology Manager Scopes
|
||||
|
||||
The Topology Manager can deal with the alignment of resources in a couple of distinct scopes:
|
||||
|
||||
* `container` (default)
|
||||
* `pod`
|
||||
|
||||
Either option can be selected at a time of the kubelet startup, with `--topology-manager-scope` flag.
|
||||
|
||||
### container scope
|
||||
|
||||
The `container` scope is used by default.
|
||||
|
||||
Within this scope, the Topology Manager performs a number of sequential resource alignments, i.e., for each container (in a pod) a separate alignment is computed. In other words, there is no notion of grouping the containers to a specific set of NUMA nodes, for this particular scope. In effect, the Topology Manager performs an arbitrary alignment of individual containers to NUMA nodes.
|
||||
|
||||
The notion of grouping the containers was endorsed and implemented on purpose in the following scope, for example the `pod` scope.
|
||||
|
||||
### pod scope
|
||||
|
||||
To select the `pod` scope, start the kubelet with the command line option `--topology-manager-scope=pod`.
|
||||
|
||||
This scope allows for grouping all containers in a pod to a common set of NUMA nodes. That is, the Topology Manager treats a pod as a whole and attempts to allocate the entire pod (all containers) to either a single NUMA node or a common set of NUMA nodes. The following examples illustrate the alignments produced by the Topology Manager on different occasions:
|
||||
|
||||
* all containers can be and are allocated to a single NUMA node;
|
||||
* all containers can be and are allocated to a shared set of NUMA nodes.
|
||||
|
||||
The total amount of particular resource demanded for the entire pod is calculated according to [effective requests/limits](/docs/concepts/workloads/pods/init-containers/#resources) formula, and thus, this total value is equal to the maximum of:
|
||||
* the sum of all app container requests,
|
||||
* the maximum of init container requests,
|
||||
for a resource.
|
||||
|
||||
Using the `pod` scope in tandem with `single-numa-node` Topology Manager policy is specifically valuable for workloads that are latency sensitive or for high-throughput applications that perform IPC. By combining both options, you are able to place all containers in a pod onto a single NUMA node; hence, the inter-NUMA communication overhead can be eliminated for that pod.
|
||||
|
||||
In the case of `single-numa-node` policy, a pod is accepted only if a suitable set of NUMA nodes is present among possible allocations. Reconsider the example above:
|
||||
|
||||
* a set containing only a single NUMA node - it leads to pod being admitted,
|
||||
* whereas a set containing more NUMA nodes - it results in pod rejection (because instead of one NUMA node, two or more NUMA nodes are required to satisfy the allocation).
|
||||
|
||||
To recap, Topology Manager first computes a set of NUMA nodes and then tests it against Topology Manager policy, which either leads to the rejection or admission of the pod.
|
||||
|
||||
### Topology Manager Policies
|
||||
|
||||
Topology Manager supports four allocation policies. You can set a policy via a Kubelet flag, `--topology-manager-policy`.
|
||||
There are four supported policies:
|
||||
|
||||
* `none` (default)
|
||||
* `best-effort`
|
||||
* `restricted`
|
||||
* `single-numa-node`
|
||||
|
||||
{{< note >}}
|
||||
If Topology Manager is configured with the **pod** scope, the container, which is considered by the policy, is reflecting requirements of the entire pod, and thus each container from the pod will result with **the same** topology alignment decision.
|
||||
{{< /note >}}
|
||||
|
||||
### none policy {#policy-none}
|
||||
|
||||
This is the default policy and does not perform any topology alignment.
|
||||
|
||||
### best-effort policy {#policy-best-effort}
|
||||
|
||||
For each container in a Pod, the kubelet, with `best-effort` topology
|
||||
management policy, calls each Hint Provider to discover their resource availability.
|
||||
Using this information, the Topology Manager stores the
|
||||
preferred NUMA Node affinity for that container. If the affinity is not preferred,
|
||||
Topology Manager will store this and admit the pod to the node anyway.
|
||||
|
||||
The *Hint Providers* can then use this information when making the
|
||||
resource allocation decision.
|
||||
|
||||
### restricted policy {#policy-restricted}
|
||||
|
||||
For each container in a Pod, the kubelet, with `restricted` topology
|
||||
management policy, calls each Hint Provider to discover their resource availability.
|
||||
Using this information, the Topology Manager stores the
|
||||
preferred NUMA Node affinity for that container. If the affinity is not preferred,
|
||||
Topology Manager will reject this pod from the node. This will result in a pod in a `Terminated` state with a pod admission failure.
|
||||
|
||||
Once the pod is in a `Terminated` state, the Kubernetes scheduler will **not** attempt to reschedule the pod. It is recommended to use a ReplicaSet or Deployment to trigger a redeploy of the pod.
|
||||
An external control loop could be also implemented to trigger a redeployment of pods that have the `Topology Affinity` error.
|
||||
|
||||
If the pod is admitted, the *Hint Providers* can then use this information when making the
|
||||
resource allocation decision.
|
||||
|
||||
### single-numa-node policy {#policy-single-numa-node}
|
||||
|
||||
For each container in a Pod, the kubelet, with `single-numa-node` topology
|
||||
management policy, calls each Hint Provider to discover their resource availability.
|
||||
Using this information, the Topology Manager determines if a single NUMA Node affinity is possible.
|
||||
If it is, Topology Manager will store this and the *Hint Providers* can then use this information when making the
|
||||
resource allocation decision.
|
||||
If, however, this is not possible then the Topology Manager will reject the pod from the node. This will result in a pod in a `Terminated` state with a pod admission failure.
|
||||
|
||||
Once the pod is in a `Terminated` state, the Kubernetes scheduler will **not** attempt to reschedule the pod. It is recommended to use a Deployment with replicas to trigger a redeploy of the Pod.
|
||||
An external control loop could be also implemented to trigger a redeployment of pods that have the `Topology Affinity` error.
|
||||
|
||||
### Pod Interactions with Topology Manager Policies
|
||||
|
||||
Consider the containers in the following pod specs:
|
||||
|
||||
```yaml
|
||||
spec:
|
||||
containers:
|
||||
- name: nginx
|
||||
image: nginx
|
||||
```
|
||||
|
||||
This pod runs in the `BestEffort` QoS class because no resource `requests` or
|
||||
`limits` are specified.
|
||||
|
||||
```yaml
|
||||
spec:
|
||||
containers:
|
||||
- name: nginx
|
||||
image: nginx
|
||||
resources:
|
||||
limits:
|
||||
memory: "200Mi"
|
||||
requests:
|
||||
memory: "100Mi"
|
||||
```
|
||||
|
||||
This pod runs in the `Burstable` QoS class because requests are less than limits.
|
||||
|
||||
If the selected policy is anything other than `none`, Topology Manager would consider these Pod specifications. The Topology Manager would consult the Hint Providers to get topology hints. In the case of the `static`, the CPU Manager policy would return default topology hint, because these Pods do not have explicitly request CPU resources.
|
||||
|
||||
|
||||
```yaml
|
||||
spec:
|
||||
containers:
|
||||
- name: nginx
|
||||
image: nginx
|
||||
resources:
|
||||
limits:
|
||||
memory: "200Mi"
|
||||
cpu: "2"
|
||||
example.com/device: "1"
|
||||
requests:
|
||||
memory: "200Mi"
|
||||
cpu: "2"
|
||||
example.com/device: "1"
|
||||
```
|
||||
|
||||
This pod with integer CPU request runs in the `Guaranteed` QoS class because `requests` are equal to `limits`.
|
||||
|
||||
|
||||
```yaml
|
||||
spec:
|
||||
containers:
|
||||
- name: nginx
|
||||
image: nginx
|
||||
resources:
|
||||
limits:
|
||||
memory: "200Mi"
|
||||
cpu: "300m"
|
||||
example.com/device: "1"
|
||||
requests:
|
||||
memory: "200Mi"
|
||||
cpu: "300m"
|
||||
example.com/device: "1"
|
||||
```
|
||||
|
||||
This pod with sharing CPU request runs in the `Guaranteed` QoS class because `requests` are equal to `limits`.
|
||||
|
||||
|
||||
```yaml
|
||||
spec:
|
||||
containers:
|
||||
- name: nginx
|
||||
image: nginx
|
||||
resources:
|
||||
limits:
|
||||
example.com/deviceA: "1"
|
||||
example.com/deviceB: "1"
|
||||
requests:
|
||||
example.com/deviceA: "1"
|
||||
example.com/deviceB: "1"
|
||||
```
|
||||
This pod runs in the `BestEffort` QoS class because there are no CPU and memory requests.
|
||||
|
||||
The Topology Manager would consider the above pods. The Topology Manager would consult the Hint Providers, which are CPU and Device Manager to get topology hints for the pods.
|
||||
|
||||
In the case of the `Guaranteed` pod with integer CPU request, the `static` CPU Manager policy would return topology hints relating to the exclusive CPU and the Device Manager would send back hints for the requested device.
|
||||
|
||||
In the case of the `Guaranteed` pod with sharing CPU request, the `static` CPU Manager policy would return default topology hint as there is no exclusive CPU request and the Device Manager would send back hints for the requested device.
|
||||
|
||||
In the above two cases of the `Guaranteed` pod, the `none` CPU Manager policy would return default topology hint.
|
||||
|
||||
In the case of the `BestEffort` pod, the `static` CPU Manager policy would send back the default topology hint as there is no CPU request and the Device Manager would send back the hints for each of the requested devices.
|
||||
|
||||
Using this information the Topology Manager calculates the optimal hint for the pod and stores this information, which will be used by the Hint Providers when they are making their resource assignments.
|
||||
|
||||
### Known Limitations
|
||||
1. The maximum number of NUMA nodes that Topology Manager allows is 8. With more than 8 NUMA nodes there will be a state explosion when trying to enumerate the possible NUMA affinities and generating their hints.
|
||||
|
||||
2. The scheduler is not topology-aware, so it is possible to be scheduled on a node and then fail on the node due to the Topology Manager.
|
||||
---
|
||||
title: Control Topology Management Policies on a node
|
||||
|
||||
reviewers:
|
||||
- ConnorDoyle
|
||||
- klueska
|
||||
- lmdaly
|
||||
- nolancon
|
||||
- bg-chun
|
||||
|
||||
content_type: task
|
||||
min-kubernetes-server-version: v1.18
|
||||
---
|
||||
|
||||
<!-- overview -->
|
||||
|
||||
{{< feature-state state="beta" for_k8s_version="v1.18" >}}
|
||||
|
||||
An increasing number of systems leverage a combination of CPUs and hardware accelerators to support latency-critical execution and high-throughput parallel computation. These include workloads in fields such as telecommunications, scientific computing, machine learning, financial services and data analytics. Such hybrid systems comprise a high performance environment.
|
||||
|
||||
In order to extract the best performance, optimizations related to CPU isolation, memory and device locality are required. However, in Kubernetes, these optimizations are handled by a disjoint set of components.
|
||||
|
||||
_Topology Manager_ is a Kubelet component that aims to coordinate the set of components that are responsible for these optimizations.
|
||||
|
||||
|
||||
|
||||
## {{% heading "prerequisites" %}}
|
||||
|
||||
|
||||
{{< include "task-tutorial-prereqs.md" >}} {{< version-check >}}
|
||||
|
||||
|
||||
|
||||
<!-- steps -->
|
||||
|
||||
## How Topology Manager Works
|
||||
|
||||
Prior to the introduction of Topology Manager, the CPU and Device Manager in Kubernetes make resource allocation decisions independently of each other.
|
||||
This can result in undesirable allocations on multiple-socketed systems, performance/latency sensitive applications will suffer due to these undesirable allocations.
|
||||
Undesirable in this case meaning for example, CPUs and devices being allocated from different NUMA Nodes thus, incurring additional latency.
|
||||
|
||||
The Topology Manager is a Kubelet component, which acts as a source of truth so that other Kubelet components can make topology aligned resource allocation choices.
|
||||
|
||||
The Topology Manager provides an interface for components, called *Hint Providers*, to send and receive topology information. Topology Manager has a set of node level policies which are explained below.
|
||||
|
||||
The Topology manager receives Topology information from the *Hint Providers* as a bitmask denoting NUMA Nodes available and a preferred allocation indication. The Topology Manager policies perform a set of operations on the hints provided and converge on the hint determined by the policy to give the optimal result, if an undesirable hint is stored the preferred field for the hint will be set to false. In the current policies preferred is the narrowest preferred mask.
|
||||
The selected hint is stored as part of the Topology Manager. Depending on the policy configured the pod can be accepted or rejected from the node based on the selected hint.
|
||||
The hint is then stored in the Topology Manager for use by the *Hint Providers* when making the resource allocation decisions.
|
||||
|
||||
### Enable the Topology Manager feature
|
||||
|
||||
Support for the Topology Manager requires `TopologyManager` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/) to be enabled. It is enabled by default starting with Kubernetes 1.18.
|
||||
|
||||
## Topology Manager Scopes and Policies
|
||||
|
||||
The Topology Manager currently:
|
||||
- Aligns Pods of all QoS classes.
|
||||
- Aligns the requested resources that Hint Provider provides topology hints for.
|
||||
|
||||
If these conditions are met, the Topology Manager will align the requested resources.
|
||||
|
||||
In order to customise how this alignment is carried out, the Topology Manager provides two distinct knobs: `scope` and `policy`.
|
||||
|
||||
The `scope` defines the granularity at which you would like resource alignment to be performed (e.g. at the `pod` or `container` level). And the `policy` defines the actual strategy used to carry out the alignment (e.g. `best-effort`, `restricted`, `single-numa-node`, etc.).
|
||||
|
||||
Details on the various `scopes` and `policies` available today can be found below.
|
||||
|
||||
{{< note >}}
|
||||
To align CPU resources with other requested resources in a Pod Spec, the CPU Manager should be enabled and proper CPU Manager policy should be configured on a Node. See [control CPU Management Policies](/docs/tasks/administer-cluster/cpu-management-policies/).
|
||||
{{< /note >}}
|
||||
|
||||
{{< note >}}
|
||||
To align memory (and hugepages) resources with other requested resources in a Pod Spec, the Memory Manager should be enabled and proper Memory Manager policy should be configured on a Node. Examine [Memory Manager](/docs/tasks/administer-cluster/memory-manager/) documentation.
|
||||
{{< /note >}}
|
||||
|
||||
### Topology Manager Scopes
|
||||
|
||||
The Topology Manager can deal with the alignment of resources in a couple of distinct scopes:
|
||||
|
||||
* `container` (default)
|
||||
* `pod`
|
||||
|
||||
Either option can be selected at a time of the kubelet startup, with `--topology-manager-scope` flag.
|
||||
|
||||
### container scope
|
||||
|
||||
The `container` scope is used by default.
|
||||
|
||||
Within this scope, the Topology Manager performs a number of sequential resource alignments, i.e., for each container (in a pod) a separate alignment is computed. In other words, there is no notion of grouping the containers to a specific set of NUMA nodes, for this particular scope. In effect, the Topology Manager performs an arbitrary alignment of individual containers to NUMA nodes.
|
||||
|
||||
The notion of grouping the containers was endorsed and implemented on purpose in the following scope, for example the `pod` scope.
|
||||
|
||||
### pod scope
|
||||
|
||||
To select the `pod` scope, start the kubelet with the command line option `--topology-manager-scope=pod`.
|
||||
|
||||
This scope allows for grouping all containers in a pod to a common set of NUMA nodes. That is, the Topology Manager treats a pod as a whole and attempts to allocate the entire pod (all containers) to either a single NUMA node or a common set of NUMA nodes. The following examples illustrate the alignments produced by the Topology Manager on different occasions:
|
||||
|
||||
* all containers can be and are allocated to a single NUMA node;
|
||||
* all containers can be and are allocated to a shared set of NUMA nodes.
|
||||
|
||||
The total amount of particular resource demanded for the entire pod is calculated according to [effective requests/limits](/docs/concepts/workloads/pods/init-containers/#resources) formula, and thus, this total value is equal to the maximum of:
|
||||
* the sum of all app container requests,
|
||||
* the maximum of init container requests,
|
||||
for a resource.
|
||||
|
||||
Using the `pod` scope in tandem with `single-numa-node` Topology Manager policy is specifically valuable for workloads that are latency sensitive or for high-throughput applications that perform IPC. By combining both options, you are able to place all containers in a pod onto a single NUMA node; hence, the inter-NUMA communication overhead can be eliminated for that pod.
|
||||
|
||||
In the case of `single-numa-node` policy, a pod is accepted only if a suitable set of NUMA nodes is present among possible allocations. Reconsider the example above:
|
||||
|
||||
* a set containing only a single NUMA node - it leads to pod being admitted,
|
||||
* whereas a set containing more NUMA nodes - it results in pod rejection (because instead of one NUMA node, two or more NUMA nodes are required to satisfy the allocation).
|
||||
|
||||
To recap, Topology Manager first computes a set of NUMA nodes and then tests it against Topology Manager policy, which either leads to the rejection or admission of the pod.
|
||||
|
||||
### Topology Manager Policies
|
||||
|
||||
Topology Manager supports four allocation policies. You can set a policy via a Kubelet flag, `--topology-manager-policy`.
|
||||
There are four supported policies:
|
||||
|
||||
* `none` (default)
|
||||
* `best-effort`
|
||||
* `restricted`
|
||||
* `single-numa-node`
|
||||
|
||||
{{< note >}}
|
||||
If Topology Manager is configured with the **pod** scope, the container, which is considered by the policy, is reflecting requirements of the entire pod, and thus each container from the pod will result with **the same** topology alignment decision.
|
||||
{{< /note >}}
|
||||
|
||||
### none policy {#policy-none}
|
||||
|
||||
This is the default policy and does not perform any topology alignment.
|
||||
|
||||
### best-effort policy {#policy-best-effort}
|
||||
|
||||
For each container in a Pod, the kubelet, with `best-effort` topology
|
||||
management policy, calls each Hint Provider to discover their resource availability.
|
||||
Using this information, the Topology Manager stores the
|
||||
preferred NUMA Node affinity for that container. If the affinity is not preferred,
|
||||
Topology Manager will store this and admit the pod to the node anyway.
|
||||
|
||||
The *Hint Providers* can then use this information when making the
|
||||
resource allocation decision.
|
||||
|
||||
### restricted policy {#policy-restricted}
|
||||
|
||||
For each container in a Pod, the kubelet, with `restricted` topology
|
||||
management policy, calls each Hint Provider to discover their resource availability.
|
||||
Using this information, the Topology Manager stores the
|
||||
preferred NUMA Node affinity for that container. If the affinity is not preferred,
|
||||
Topology Manager will reject this pod from the node. This will result in a pod in a `Terminated` state with a pod admission failure.
|
||||
|
||||
Once the pod is in a `Terminated` state, the Kubernetes scheduler will **not** attempt to reschedule the pod. It is recommended to use a ReplicaSet or Deployment to trigger a redeploy of the pod.
|
||||
An external control loop could be also implemented to trigger a redeployment of pods that have the `Topology Affinity` error.
|
||||
|
||||
If the pod is admitted, the *Hint Providers* can then use this information when making the
|
||||
resource allocation decision.
|
||||
|
||||
### single-numa-node policy {#policy-single-numa-node}
|
||||
|
||||
For each container in a Pod, the kubelet, with `single-numa-node` topology
|
||||
management policy, calls each Hint Provider to discover their resource availability.
|
||||
Using this information, the Topology Manager determines if a single NUMA Node affinity is possible.
|
||||
If it is, Topology Manager will store this and the *Hint Providers* can then use this information when making the
|
||||
resource allocation decision.
|
||||
If, however, this is not possible then the Topology Manager will reject the pod from the node. This will result in a pod in a `Terminated` state with a pod admission failure.
|
||||
|
||||
Once the pod is in a `Terminated` state, the Kubernetes scheduler will **not** attempt to reschedule the pod. It is recommended to use a Deployment with replicas to trigger a redeploy of the Pod.
|
||||
An external control loop could be also implemented to trigger a redeployment of pods that have the `Topology Affinity` error.
|
||||
|
||||
### Pod Interactions with Topology Manager Policies
|
||||
|
||||
Consider the containers in the following pod specs:
|
||||
|
||||
```yaml
|
||||
spec:
|
||||
containers:
|
||||
- name: nginx
|
||||
image: nginx
|
||||
```
|
||||
|
||||
This pod runs in the `BestEffort` QoS class because no resource `requests` or
|
||||
`limits` are specified.
|
||||
|
||||
```yaml
|
||||
spec:
|
||||
containers:
|
||||
- name: nginx
|
||||
image: nginx
|
||||
resources:
|
||||
limits:
|
||||
memory: "200Mi"
|
||||
requests:
|
||||
memory: "100Mi"
|
||||
```
|
||||
|
||||
This pod runs in the `Burstable` QoS class because requests are less than limits.
|
||||
|
||||
If the selected policy is anything other than `none`, Topology Manager would consider these Pod specifications. The Topology Manager would consult the Hint Providers to get topology hints. In the case of the `static`, the CPU Manager policy would return default topology hint, because these Pods do not have explicitly request CPU resources.
|
||||
|
||||
|
||||
```yaml
|
||||
spec:
|
||||
containers:
|
||||
- name: nginx
|
||||
image: nginx
|
||||
resources:
|
||||
limits:
|
||||
memory: "200Mi"
|
||||
cpu: "2"
|
||||
example.com/device: "1"
|
||||
requests:
|
||||
memory: "200Mi"
|
||||
cpu: "2"
|
||||
example.com/device: "1"
|
||||
```
|
||||
|
||||
This pod with integer CPU request runs in the `Guaranteed` QoS class because `requests` are equal to `limits`.
|
||||
|
||||
|
||||
```yaml
|
||||
spec:
|
||||
containers:
|
||||
- name: nginx
|
||||
image: nginx
|
||||
resources:
|
||||
limits:
|
||||
memory: "200Mi"
|
||||
cpu: "300m"
|
||||
example.com/device: "1"
|
||||
requests:
|
||||
memory: "200Mi"
|
||||
cpu: "300m"
|
||||
example.com/device: "1"
|
||||
```
|
||||
|
||||
This pod with sharing CPU request runs in the `Guaranteed` QoS class because `requests` are equal to `limits`.
|
||||
|
||||
|
||||
```yaml
|
||||
spec:
|
||||
containers:
|
||||
- name: nginx
|
||||
image: nginx
|
||||
resources:
|
||||
limits:
|
||||
example.com/deviceA: "1"
|
||||
example.com/deviceB: "1"
|
||||
requests:
|
||||
example.com/deviceA: "1"
|
||||
example.com/deviceB: "1"
|
||||
```
|
||||
This pod runs in the `BestEffort` QoS class because there are no CPU and memory requests.
|
||||
|
||||
The Topology Manager would consider the above pods. The Topology Manager would consult the Hint Providers, which are CPU and Device Manager to get topology hints for the pods.
|
||||
|
||||
In the case of the `Guaranteed` pod with integer CPU request, the `static` CPU Manager policy would return topology hints relating to the exclusive CPU and the Device Manager would send back hints for the requested device.
|
||||
|
||||
In the case of the `Guaranteed` pod with sharing CPU request, the `static` CPU Manager policy would return default topology hint as there is no exclusive CPU request and the Device Manager would send back hints for the requested device.
|
||||
|
||||
In the above two cases of the `Guaranteed` pod, the `none` CPU Manager policy would return default topology hint.
|
||||
|
||||
In the case of the `BestEffort` pod, the `static` CPU Manager policy would send back the default topology hint as there is no CPU request and the Device Manager would send back the hints for each of the requested devices.
|
||||
|
||||
Using this information the Topology Manager calculates the optimal hint for the pod and stores this information, which will be used by the Hint Providers when they are making their resource assignments.
|
||||
|
||||
### Known Limitations
|
||||
1. The maximum number of NUMA nodes that Topology Manager allows is 8. With more than 8 NUMA nodes there will be a state explosion when trying to enumerate the possible NUMA affinities and generating their hints.
|
||||
|
||||
2. The scheduler is not topology-aware, so it is possible to be scheduled on a node and then fail on the node due to the Topology Manager.
|
||||
|
|
Loading…
Reference in New Issue