The PodOverhead feature is GA
parent
d200945499
commit
0bc8468bfa
|
@ -59,12 +59,15 @@ The RuntimeClass resource currently only has 2 significant fields: the RuntimeCl
|
|||
(`metadata.name`) and the handler (`handler`). The object definition looks like this:
|
||||
|
||||
```yaml
|
||||
apiVersion: node.k8s.io/v1 # RuntimeClass is defined in the node.k8s.io API group
|
||||
# RuntimeClass is defined in the node.k8s.io API group
|
||||
apiVersion: node.k8s.io/v1
|
||||
kind: RuntimeClass
|
||||
metadata:
|
||||
name: myclass # The name the RuntimeClass will be referenced by
|
||||
# RuntimeClass is a non-namespaced resource
|
||||
handler: myconfiguration # The name of the corresponding CRI configuration
|
||||
# The name the RuntimeClass will be referenced by.
|
||||
# RuntimeClass is a non-namespaced resource.
|
||||
name: myclass
|
||||
# The name of the corresponding CRI configuration
|
||||
handler: myconfiguration
|
||||
```
|
||||
|
||||
The name of a RuntimeClass object must be a valid
|
||||
|
@ -72,14 +75,14 @@ The name of a RuntimeClass object must be a valid
|
|||
|
||||
{{< note >}}
|
||||
It is recommended that RuntimeClass write operations (create/update/patch/delete) be
|
||||
restricted to the cluster administrator. This is typically the default. See [Authorization
|
||||
Overview](/docs/reference/access-authn-authz/authorization/) for more details.
|
||||
restricted to the cluster administrator. This is typically the default. See
|
||||
[Authorization Overview](/docs/reference/access-authn-authz/authorization/) for more details.
|
||||
{{< /note >}}
|
||||
|
||||
## Usage
|
||||
|
||||
Once RuntimeClasses are configured for the cluster, using them is very simple. Specify a
|
||||
`runtimeClassName` in the Pod spec. For example:
|
||||
Once RuntimeClasses are configured for the cluster, you can specify a
|
||||
`runtimeClassName` in the Pod spec to use it. For example:
|
||||
|
||||
```yaml
|
||||
apiVersion: v1
|
||||
|
@ -113,14 +116,14 @@ Runtime handlers are configured through containerd's configuration at
|
|||
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.${HANDLER_NAME}]
|
||||
```
|
||||
|
||||
See containerd's config documentation for more details:
|
||||
https://github.com/containerd/cri/blob/master/docs/config.md
|
||||
See containerd's [config documentation](https://github.com/containerd/cri/blob/master/docs/config.md)
|
||||
for more details:
|
||||
|
||||
#### {{< glossary_tooltip term_id="cri-o" >}}
|
||||
|
||||
Runtime handlers are configured through CRI-O's configuration at `/etc/crio/crio.conf`. Valid
|
||||
handlers are configured under the [crio.runtime
|
||||
table](https://github.com/cri-o/cri-o/blob/master/docs/crio.conf.5.md#crioruntime-table):
|
||||
handlers are configured under the
|
||||
[crio.runtime table](https://github.com/cri-o/cri-o/blob/master/docs/crio.conf.5.md#crioruntime-table):
|
||||
|
||||
```
|
||||
[crio.runtime.runtimes.${HANDLER_NAME}]
|
||||
|
@ -148,19 +151,17 @@ can add `tolerations` to the RuntimeClass. As with the `nodeSelector`, the toler
|
|||
with the pod's tolerations in admission, effectively taking the union of the set of nodes tolerated
|
||||
by each.
|
||||
|
||||
To learn more about configuring the node selector and tolerations, see [Assigning Pods to
|
||||
Nodes](/docs/concepts/scheduling-eviction/assign-pod-node/).
|
||||
To learn more about configuring the node selector and tolerations, see
|
||||
[Assigning Pods to Nodes](/docs/concepts/scheduling-eviction/assign-pod-node/).
|
||||
|
||||
### Pod Overhead
|
||||
|
||||
{{< feature-state for_k8s_version="v1.18" state="beta" >}}
|
||||
{{< feature-state for_k8s_version="v1.24" state="stable" >}}
|
||||
|
||||
You can specify _overhead_ resources that are associated with running a Pod. Declaring overhead allows
|
||||
the cluster (including the scheduler) to account for it when making decisions about Pods and resources.
|
||||
To use Pod overhead, you must have the PodOverhead [feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
|
||||
enabled (it is on by default).
|
||||
|
||||
Pod overhead is defined in RuntimeClass through the `overhead` fields. Through the use of these fields,
|
||||
Pod overhead is defined in RuntimeClass through the `overhead` field. Through the use of this field,
|
||||
you can specify the overhead of running pods utilizing this RuntimeClass and ensure these overheads
|
||||
are accounted for in Kubernetes.
|
||||
|
||||
|
@ -170,3 +171,4 @@ are accounted for in Kubernetes.
|
|||
- [RuntimeClass Scheduling Design](https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/585-runtime-class/README.md#runtimeclass-scheduling)
|
||||
- Read about the [Pod Overhead](/docs/concepts/scheduling-eviction/pod-overhead/) concept
|
||||
- [PodOverhead Feature Design](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/688-pod-overhead)
|
||||
|
||||
|
|
|
@ -10,17 +10,12 @@ weight: 30
|
|||
|
||||
<!-- overview -->
|
||||
|
||||
{{< feature-state for_k8s_version="v1.18" state="beta" >}}
|
||||
|
||||
{{< feature-state for_k8s_version="v1.24" state="stable" >}}
|
||||
|
||||
When you run a Pod on a Node, the Pod itself takes an amount of system resources. These
|
||||
resources are additional to the resources needed to run the container(s) inside the Pod.
|
||||
_Pod Overhead_ is a feature for accounting for the resources consumed by the Pod infrastructure
|
||||
on top of the container requests & limits.
|
||||
|
||||
|
||||
|
||||
|
||||
In Kubernetes, _Pod Overhead_ is a way to account for the resources consumed by the Pod
|
||||
infrastructure on top of the container requests & limits.
|
||||
|
||||
<!-- body -->
|
||||
|
||||
|
@ -29,33 +24,30 @@ In Kubernetes, the Pod's overhead is set at
|
|||
time according to the overhead associated with the Pod's
|
||||
[RuntimeClass](/docs/concepts/containers/runtime-class/).
|
||||
|
||||
When Pod Overhead is enabled, the overhead is considered in addition to the sum of container
|
||||
resource requests when scheduling a Pod. Similarly, the kubelet will include the Pod overhead when sizing
|
||||
the Pod cgroup, and when carrying out Pod eviction ranking.
|
||||
A pod's overhead is considered in addition to the sum of container resource requests when
|
||||
scheduling a Pod. Similarly, the kubelet will include the Pod overhead when sizing the Pod cgroup,
|
||||
and when carrying out Pod eviction ranking.
|
||||
|
||||
## Enabling Pod Overhead {#set-up}
|
||||
## Configuring Pod overhead {#set-up}
|
||||
|
||||
You need to make sure that the `PodOverhead`
|
||||
[feature gate](/docs/reference/command-line-tools-reference/feature-gates/) is enabled (it is on by default as of 1.18)
|
||||
across your cluster, and a `RuntimeClass` is utilized which defines the `overhead` field.
|
||||
You need to make sure a `RuntimeClass` is utilized which defines the `overhead` field.
|
||||
|
||||
## Usage example
|
||||
|
||||
To use the PodOverhead feature, you need a RuntimeClass that defines the `overhead` field. As
|
||||
an example, you could use the following RuntimeClass definition with a virtualizing container runtime
|
||||
that uses around 120MiB per Pod for the virtual machine and the guest OS:
|
||||
To work with Pod overhead, you need a RuntimeClass that defines the `overhead` field. As
|
||||
an example, you could use the following RuntimeClass definition with a virtualization container
|
||||
runtime that uses around 120MiB per Pod for the virtual machine and the guest OS:
|
||||
|
||||
```yaml
|
||||
---
|
||||
kind: RuntimeClass
|
||||
apiVersion: node.k8s.io/v1
|
||||
kind: RuntimeClass
|
||||
metadata:
|
||||
name: kata-fc
|
||||
name: kata-fc
|
||||
handler: kata-fc
|
||||
overhead:
|
||||
podFixed:
|
||||
memory: "120Mi"
|
||||
cpu: "250m"
|
||||
podFixed:
|
||||
memory: "120Mi"
|
||||
cpu: "250m"
|
||||
```
|
||||
|
||||
Workloads which are created which specify the `kata-fc` RuntimeClass handler will take the memory and
|
||||
|
@ -92,13 +84,15 @@ updates the workload's PodSpec to include the `overhead` as described in the Run
|
|||
the Pod will be rejected. In the given example, since only the RuntimeClass name is specified, the admission controller mutates the Pod
|
||||
to include an `overhead`.
|
||||
|
||||
After the RuntimeClass admission controller, you can check the updated PodSpec:
|
||||
After the RuntimeClass admission controller has made modifications, you can check the updated
|
||||
Pod overhead value:
|
||||
|
||||
```bash
|
||||
kubectl get pod test-pod -o jsonpath='{.spec.overhead}'
|
||||
```
|
||||
|
||||
The output is:
|
||||
|
||||
```
|
||||
map[cpu:250m memory:120Mi]
|
||||
```
|
||||
|
@ -110,44 +104,50 @@ When the kube-scheduler is deciding which node should run a new Pod, the schedul
|
|||
`overhead` as well as the sum of container requests for that Pod. For this example, the scheduler adds the
|
||||
requests and the overhead, then looks for a node that has 2.25 CPU and 320 MiB of memory available.
|
||||
|
||||
Once a Pod is scheduled to a node, the kubelet on that node creates a new {{< glossary_tooltip text="cgroup" term_id="cgroup" >}}
|
||||
for the Pod. It is within this pod that the underlying container runtime will create containers.
|
||||
Once a Pod is scheduled to a node, the kubelet on that node creates a new {{< glossary_tooltip
|
||||
text="cgroup" term_id="cgroup" >}} for the Pod. It is within this pod that the underlying
|
||||
container runtime will create containers.
|
||||
|
||||
If the resource has a limit defined for each container (Guaranteed QoS or Bustrable QoS with limits defined),
|
||||
the kubelet will set an upper limit for the pod cgroup associated with that resource (cpu.cfs_quota_us for CPU
|
||||
and memory.limit_in_bytes memory). This upper limit is based on the sum of the container limits plus the `overhead`
|
||||
defined in the PodSpec.
|
||||
|
||||
For CPU, if the Pod is Guaranteed or Burstable QoS, the kubelet will set `cpu.shares` based on the sum of container
|
||||
requests plus the `overhead` defined in the PodSpec.
|
||||
For CPU, if the Pod is Guaranteed or Burstable QoS, the kubelet will set `cpu.shares` based on the
|
||||
sum of container requests plus the `overhead` defined in the PodSpec.
|
||||
|
||||
Looking at our example, verify the container requests for the workload:
|
||||
|
||||
```bash
|
||||
kubectl get pod test-pod -o jsonpath='{.spec.containers[*].resources.limits}'
|
||||
```
|
||||
|
||||
The total container requests are 2000m CPU and 200MiB of memory:
|
||||
|
||||
```
|
||||
map[cpu: 500m memory:100Mi] map[cpu:1500m memory:100Mi]
|
||||
```
|
||||
|
||||
Check this against what is observed by the node:
|
||||
|
||||
```bash
|
||||
kubectl describe node | grep test-pod -B2
|
||||
```
|
||||
|
||||
The output shows 2250m CPU and 320MiB of memory are requested, which includes PodOverhead:
|
||||
The output shows requests for 2250m CPU, and for 320MiB of memory. The requests include Pod overhead:
|
||||
|
||||
```
|
||||
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
|
||||
--------- ---- ------------ ---------- --------------- ------------- ---
|
||||
default test-pod 2250m (56%) 2250m (56%) 320Mi (1%) 320Mi (1%) 36m
|
||||
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
|
||||
--------- ---- ------------ ---------- --------------- ------------- ---
|
||||
default test-pod 2250m (56%) 2250m (56%) 320Mi (1%) 320Mi (1%) 36m
|
||||
```
|
||||
|
||||
## Verify Pod cgroup limits
|
||||
|
||||
Check the Pod's memory cgroups on the node where the workload is running. In the following example, [`crictl`](https://github.com/kubernetes-sigs/cri-tools/blob/master/docs/crictl.md)
|
||||
Check the Pod's memory cgroups on the node where the workload is running. In the following example,
|
||||
[`crictl`](https://github.com/kubernetes-sigs/cri-tools/blob/master/docs/crictl.md)
|
||||
is used on the node, which provides a CLI for CRI-compatible container runtimes. This is an
|
||||
advanced example to show PodOverhead behavior, and it is not expected that users should need to check
|
||||
advanced example to show Pod overhead behavior, and it is not expected that users should need to check
|
||||
cgroups directly on the node.
|
||||
|
||||
First, on the particular node, determine the Pod identifier:
|
||||
|
@ -158,17 +158,21 @@ POD_ID="$(sudo crictl pods --name test-pod -q)"
|
|||
```
|
||||
|
||||
From this, you can determine the cgroup path for the Pod:
|
||||
|
||||
```bash
|
||||
# Run this on the node where the Pod is scheduled
|
||||
sudo crictl inspectp -o=json $POD_ID | grep cgroupsPath
|
||||
```
|
||||
|
||||
The resulting cgroup path includes the Pod's `pause` container. The Pod level cgroup is one directory above.
|
||||
|
||||
```
|
||||
"cgroupsPath": "/kubepods/podd7f4b509-cf94-4951-9417-d1087c92a5b2/7ccf55aee35dd16aca4189c952d83487297f3cd760f1bbf09620e206e7d0c27a"
|
||||
"cgroupsPath": "/kubepods/podd7f4b509-cf94-4951-9417-d1087c92a5b2/7ccf55aee35dd16aca4189c952d83487297f3cd760f1bbf09620e206e7d0c27a"
|
||||
```
|
||||
|
||||
In this specific case, the pod cgroup path is `kubepods/podd7f4b509-cf94-4951-9417-d1087c92a5b2`. Verify the Pod level cgroup setting for memory:
|
||||
In this specific case, the pod cgroup path is `kubepods/podd7f4b509-cf94-4951-9417-d1087c92a5b2`.
|
||||
Verify the Pod level cgroup setting for memory:
|
||||
|
||||
```bash
|
||||
# Run this on the node where the Pod is scheduled.
|
||||
# Also, change the name of the cgroup to match the cgroup allocated for your pod.
|
||||
|
@ -176,22 +180,20 @@ In this specific case, the pod cgroup path is `kubepods/podd7f4b509-cf94-4951-94
|
|||
```
|
||||
|
||||
This is 320 MiB, as expected:
|
||||
|
||||
```
|
||||
335544320
|
||||
```
|
||||
|
||||
### Observability
|
||||
|
||||
A `kube_pod_overhead` metric is available in [kube-state-metrics](https://github.com/kubernetes/kube-state-metrics)
|
||||
to help identify when PodOverhead is being utilized and to help observe stability of workloads
|
||||
running with a defined Overhead. This functionality is not available in the 1.9 release of
|
||||
kube-state-metrics, but is expected in a following release. Users will need to build kube-state-metrics
|
||||
from source in the meantime.
|
||||
|
||||
|
||||
Some `kube_pod_overhead_*` metrics are available in [kube-state-metrics](https://github.com/kubernetes/kube-state-metrics)
|
||||
to help identify when Pod overhead is being utilized and to help observe stability of workloads
|
||||
running with a defined overhead.
|
||||
|
||||
## {{% heading "whatsnext" %}}
|
||||
|
||||
* Learn more about [RuntimeClass](/docs/concepts/containers/runtime-class/)
|
||||
* Read the [PodOverhead Design](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/688-pod-overhead)
|
||||
enhancement proposal for extra context
|
||||
|
||||
* [RuntimeClass](/docs/concepts/containers/runtime-class/)
|
||||
* [PodOverhead Design](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/688-pod-overhead)
|
||||
|
|
|
@ -666,6 +666,7 @@ plugins:
|
|||
{{< /tabs >}}
|
||||
|
||||
#### Configuration Annotation Format
|
||||
|
||||
`PodNodeSelector` uses the annotation key `scheduler.alpha.kubernetes.io/node-selector` to assign node selectors to namespaces.
|
||||
|
||||
```yaml
|
||||
|
@ -678,6 +679,7 @@ metadata:
|
|||
```
|
||||
|
||||
#### Internal Behavior
|
||||
|
||||
This admission controller has the following behavior:
|
||||
|
||||
1. If the `Namespace` has an annotation with a key `scheduler.alpha.kubernetes.io/node-selector`, use its value as the
|
||||
|
@ -746,7 +748,8 @@ metadata:
|
|||
|
||||
### Priority {#priority}
|
||||
|
||||
The priority admission controller uses the `priorityClassName` field and populates the integer value of the priority. If the priority class is not found, the Pod is rejected.
|
||||
The priority admission controller uses the `priorityClassName` field and populates the integer value of the priority.
|
||||
If the priority class is not found, the Pod is rejected.
|
||||
|
||||
### ResourceQuota {#resourcequota}
|
||||
|
||||
|
@ -754,19 +757,20 @@ This admission controller will observe the incoming request and ensure that it d
|
|||
enumerated in the `ResourceQuota` object in a `Namespace`. If you are using `ResourceQuota`
|
||||
objects in your Kubernetes deployment, you MUST use this admission controller to enforce quota constraints.
|
||||
|
||||
See the [resourceQuota design doc](https://git.k8s.io/community/contributors/design-proposals/resource-management/admission_control_resource_quota.md) and the [example of Resource Quota](/docs/concepts/policy/resource-quotas/) for more details.
|
||||
See the [resourceQuota design doc](https://git.k8s.io/community/contributors/design-proposals/resource-management/admission_control_resource_quota.md)
|
||||
and the [example of Resource Quota](/docs/concepts/policy/resource-quotas/) for more details.
|
||||
|
||||
### RuntimeClass {#runtimeclass}
|
||||
|
||||
{{< feature-state for_k8s_version="v1.20" state="stable" >}}
|
||||
|
||||
If you enable the `PodOverhead` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/), and define a RuntimeClass with [Pod overhead](/docs/concepts/scheduling-eviction/pod-overhead/) configured, this admission controller checks incoming
|
||||
Pods. When enabled, this admission controller rejects any Pod create requests that have the overhead already set.
|
||||
For Pods that have a RuntimeClass is configured and selected in their `.spec`, this admission controller sets `.spec.overhead` in the Pod based on the value defined in the corresponding RuntimeClass.
|
||||
|
||||
{{< note >}}
|
||||
The `.spec.overhead` field for Pod and the `.overhead` field for RuntimeClass are both in beta. If you do not enable the `PodOverhead` feature gate, all Pods are treated as if `.spec.overhead` is unset.
|
||||
{{< /note >}}
|
||||
If you define a RuntimeClass with [Pod overhead](/docs/concepts/scheduling-eviction/pod-overhead/)
|
||||
configured, this admission controller checks incoming Pods.
|
||||
When enabled, this admission controller rejects any Pod create requests
|
||||
that have the overhead already set.
|
||||
For Pods that have a RuntimeClass configured and selected in their `.spec`,
|
||||
this admission controller sets `.spec.overhead` in the Pod based on the value
|
||||
defined in the corresponding RuntimeClass.
|
||||
|
||||
See also [Pod Overhead](/docs/concepts/scheduling-eviction/pod-overhead/)
|
||||
for more information.
|
||||
|
@ -823,11 +827,11 @@ If you disable the ValidatingAdmissionWebhook, you must also disable the
|
|||
group/version via the `--runtime-config` flag (both are on by default in
|
||||
versions 1.9 and later).
|
||||
|
||||
|
||||
## Is there a recommended set of admission controllers to use?
|
||||
|
||||
Yes. The recommended admission controllers are enabled by default (shown [here](/docs/reference/command-line-tools-reference/kube-apiserver/#options)), so you do not need to explicitly specify them. You can enable additional admission controllers beyond the default set using the `--enable-admission-plugins` flag (**order doesn't matter**).
|
||||
Yes. The recommended admission controllers are enabled by default
|
||||
(shown [here](/docs/reference/command-line-tools-reference/kube-apiserver/#options)),
|
||||
so you do not need to explicitly specify them.
|
||||
You can enable additional admission controllers beyond the default set using the
|
||||
`--enable-admission-plugins` flag (**order doesn't matter**).
|
||||
|
||||
{{< note >}}
|
||||
`--admission-control` was deprecated in 1.10 and replaced with `--enable-admission-plugins`.
|
||||
{{< /note >}}
|
||||
|
|
|
@ -163,8 +163,6 @@ different Kubernetes components.
|
|||
| `PodAndContainerStatsFromCRI` | `false` | Alpha | 1.23 | |
|
||||
| `PodDeletionCost` | `false` | Alpha | 1.21 | 1.21 |
|
||||
| `PodDeletionCost` | `true` | Beta | 1.22 | |
|
||||
| `PodOverhead` | `false` | Alpha | 1.16 | 1.17 |
|
||||
| `PodOverhead` | `true` | Beta | 1.18 | |
|
||||
| `PodSecurity` | `false` | Alpha | 1.22 | 1.22 |
|
||||
| `PodSecurity` | `true` | Beta | 1.23 | |
|
||||
| `ProbeTerminationGracePeriod` | `false` | Alpha | 1.21 | 1.21 |
|
||||
|
@ -411,6 +409,9 @@ different Kubernetes components.
|
|||
| `PodDisruptionBudget` | `false` | Alpha | 1.3 | 1.4 |
|
||||
| `PodDisruptionBudget` | `true` | Beta | 1.5 | 1.20 |
|
||||
| `PodDisruptionBudget` | `true` | GA | 1.21 | - |
|
||||
| `PodOverhead` | `false` | Alpha | 1.16 | 1.17 |
|
||||
| `PodOverhead` | `true` | Beta | 1.18 | 1.23 |
|
||||
| `PodOverhead` | `true` | GA | 1.24 | - |
|
||||
| `PodPriority` | `false` | Alpha | 1.8 | 1.10 |
|
||||
| `PodPriority` | `true` | Beta | 1.11 | 1.13 |
|
||||
| `PodPriority` | `true` | GA | 1.14 | - |
|
||||
|
|
Loading…
Reference in New Issue