Reword pod priority and preemption concept (#17508)

* Remove mentions of unsupported Kubernetes versions

No need to mention supported-since-this-version details for Kubernetes
releases that are now out of support.

* Warn about disabling preemption

* Tweak wording

* Add What's next section

* Tweak Pod priority troubleshooting advice

Reshape the advice about pod priority troubleshooting to explain
user-induced issues (and define expected behavior). If the reader
detects behavior that does not match the documentation, they have
observed a bug and can report that via the usual routes.

* Rewords notes on Pod priority vs. QoS

* Move warning into page body
pull/19655/head
Tim Bannister 2020-03-16 05:42:35 +00:00 committed by GitHub
parent f29221f4c4
commit 43212d6bc7
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 69 additions and 87 deletions

View File

@ -16,42 +16,25 @@ importance of a Pod relative to other Pods. If a Pod cannot be scheduled, the
scheduler tries to preempt (evict) lower priority Pods to make scheduling of the
pending Pod possible.
In Kubernetes 1.9 and later, Priority also affects scheduling order of Pods and
out-of-resource eviction ordering on the Node.
Pod priority and preemption graduated to beta in Kubernetes 1.11 and to GA in
Kubernetes 1.14. They have been enabled by default since 1.11.
In Kubernetes versions where Pod priority and preemption is still an alpha-level
feature, you need to explicitly enable it. To use these features in the older
versions of Kubernetes, follow the instructions in the documentation for your
Kubernetes version, by going to the documentation archive version for your
Kubernetes version.
Kubernetes Version | Priority and Preemption State | Enabled by default
------------------ | :---------------------------: | :----------------:
1.8 | alpha | no
1.9 | alpha | no
1.10 | alpha | no
1.11 | beta | yes
1.14 | stable | yes
{{< warning >}}In a cluster where not all users are trusted, a
malicious user could create pods at the highest possible priorities, causing
other pods to be evicted/not get scheduled. To resolve this issue,
[ResourceQuota](/docs/concepts/policy/resource-quotas/) is
augmented to support Pod priority. An admin can create ResourceQuota for users
at specific priority levels, preventing them from creating pods at high
priorities. This feature is in beta since Kubernetes 1.12.
{{< /warning >}}
{{% /capture %}}
{{% capture body %}}
{{< warning >}}
In a cluster where not all users are trusted, a malicious user could create Pods
at the highest possible priorities, causing other Pods to be evicted/not get
scheduled.
An administrator can use ResourceQuota to prevent users from creating pods at
high priorities.
See [limit Priority Class consumption by default](/docs/concepts/policy/resource-quotas/#limit-priority-class-consumption-by-default)
for details.
{{< /warning >}}
## How to use priority and preemption
To use priority and preemption in Kubernetes 1.11 and later, follow these steps:
To use priority and preemption:
1. Add one or more [PriorityClasses](#priorityclass).
@ -77,21 +60,20 @@ Pods.
## How to disable preemption
{{< note >}}
In Kubernetes 1.12+, critical pods rely on scheduler preemption to be scheduled
when a cluster is under resource pressure. For this reason, it is not
recommended to disable preemption.
{{< /note >}}
{{< caution >}}
Critical pods rely on scheduler preemption to be scheduled when a cluster
is under resource pressure. For this reason, it is not recommended to
disable preemption.
{{< /caution >}}
{{< note >}}
In Kubernetes 1.15 and later,
if the feature `NonPreemptingPriority` is enabled,
In Kubernetes 1.15 and later, if the feature `NonPreemptingPriority` is enabled,
PriorityClasses have the option to set `preemptionPolicy: Never`.
This will prevent pods of that PriorityClass from preempting other pods.
{{< /note >}}
In Kubernetes 1.11 and later, preemption is controlled by a kube-scheduler flag
`disablePreemption`, which is set to `false` by default.
Preemption is controlled by a kube-scheduler flag `disablePreemption`, which is
set to `false` by default.
If you want to disable preemption despite the above note, you can set
`disablePreemption` to `true`.
@ -240,12 +222,12 @@ spec:
### Effect of Pod priority on scheduling order
In Kubernetes 1.9 and later, when Pod priority is enabled, scheduler orders
pending Pods by their priority and a pending Pod is placed ahead of other
pending Pods with lower priority in the scheduling queue. As a result, the
higher priority Pod may be scheduled sooner than Pods with lower priority if its
scheduling requirements are met. If such Pod cannot be scheduled, scheduler will
continue and tries to schedule other lower priority Pods.
When Pod priority is enabled, the scheduler orders pending Pods by
their priority and a pending Pod is placed ahead of other pending Pods
with lower priority in the scheduling queue. As a result, the higher
priority Pod may be scheduled sooner than Pods with lower priority if
its scheduling requirements are met. If such Pod cannot be scheduled,
scheduler will continue and tries to schedule other lower priority Pods.
## Preemption
@ -291,12 +273,12 @@ point that scheduler preempts victims and the time that Pod P is scheduled. In
order to minimize this gap, one can set graceful termination period of lower
priority Pods to zero or a small number.
#### PodDisruptionBudget is supported, but not guaranteed!
#### PodDisruptionBudget is supported, but not guaranteed
A [Pod Disruption Budget (PDB)](/docs/concepts/workloads/pods/disruptions/)
allows application owners to limit the number of Pods of a replicated application
that are down simultaneously from voluntary disruptions. Kubernetes 1.9 supports
PDB when preempting Pods, but respecting PDB is best effort. The Scheduler tries
that are down simultaneously from voluntary disruptions. Kubernetes supports
PDB when preempting Pods, but respecting PDB is best effort. The scheduler tries
to find victims whose PDB are not violated by preemption, but if no such victims
are found, preemption will still happen, and lower priority Pods will be removed
despite their PDBs being violated.
@ -347,28 +329,23 @@ gone, and Pod P could possibly be scheduled on Node N.
We may consider adding cross Node preemption in future versions if there is
enough demand and if we find an algorithm with reasonable performance.
## Debugging Pod Priority and Preemption
## Troubleshooting
Pod Priority and Preemption is a major feature that could potentially disrupt
Pod scheduling if it has bugs.
Pod priority and pre-emption can have unwanted side effects. Here are some
examples of potential problems and ways to deal with them.
### Potential problems caused by Priority and Preemption
The followings are some of the potential problems that could be caused by bugs
in the implementation of the feature. This list is not exhaustive.
#### Pods are preempted unnecessarily
### Pods are preempted unnecessarily
Preemption removes existing Pods from a cluster under resource pressure to make
room for higher priority pending Pods. If a user gives high priorities to
certain Pods by mistake, these unintentional high priority Pods may cause
preemption in the cluster. As mentioned above, Pod priority is specified by
setting the `priorityClassName` field of `podSpec`. The integer value of
room for higher priority pending Pods. If you give high priorities to
certain Pods by mistake, these unintentionally high priority Pods may cause
preemption in your cluster. Pod priority is specified by setting the
`priorityClassName` field in the Pod's specification. The integer value for
priority is then resolved and populated to the `priority` field of `podSpec`.
To resolve the problem, `priorityClassName` of the Pods must be changed to use
lower priority classes or should be left empty. Empty `priorityClassName` is
resolved to zero by default.
To address the problem, you can change the `priorityClassName` for those Pods
to use lower priority classes, or leave that field empty. An empty
`priorityClassName` is resolved to zero by default.
When a Pod is preempted, there will be events recorded for the preempted Pod.
Preemption should happen only when a cluster does not have enough resources for
@ -377,29 +354,31 @@ Pod (preemptor) is higher than the victim Pods. Preemption must not happen when
there is no pending Pod, or when the pending Pods have equal or lower priority
than the victims. If preemption happens in such scenarios, please file an issue.
#### Pods are preempted, but the preemptor is not scheduled
### Pods are preempted, but the preemptor is not scheduled
When pods are preempted, they receive their requested graceful termination
period, which is by default 30 seconds, but it can be any different value as
specified in the PodSpec. If the victim Pods do not terminate within this period,
they are force-terminated. Once all the victims go away, the preemptor Pod can
be scheduled.
period, which is by default 30 seconds. If the victim Pods do not terminate within
this period, they are forcibly terminated. Once all the victims go away, the
preemptor Pod can be scheduled.
While the preemptor Pod is waiting for the victims to go away, a higher priority
Pod may be created that fits on the same node. In this case, the scheduler will
Pod may be created that fits on the same Node. In this case, the scheduler will
schedule the higher priority Pod instead of the preemptor.
In the absence of such a higher priority Pod, we expect the preemptor Pod to be
scheduled after the graceful termination period of the victims is over.
This is expected behavior: the Pod with the higher priority should take the place
of a Pod with a lower priority. Other controller actions, such as
[cluster autoscaling](/docs/tasks/administer-cluster/cluster-management/#cluster-autoscaling),
may eventually provide capacity to schedule the pending Pods.
#### Higher priority Pods are preempted before lower priority pods
### Higher priority Pods are preempted before lower priority pods
The scheduler tries to find nodes that can run a pending Pod and if no node is
found, it tries to remove Pods with lower priority from one node to make room
for the pending pod. If a node with low priority Pods is not feasible to run the
pending Pod, the scheduler may choose another node with higher priority Pods
(compared to the Pods on the other node) for preemption. The victims must still
have lower priority than the preemptor Pod.
The scheduler tries to find nodes that can run a pending Pod. If no node is
found, the scheduler tries to remove Pods with lower priority from an arbitrary
node in order to make room for the pending pod.
If a node with low priority Pods is not feasible to run the pending Pod, the scheduler
may choose another node with higher priority Pods (compared to the Pods on the
other node) for preemption. The victims must still have lower priority than the
preemptor Pod.
When there are multiple nodes available for preemption, the scheduler tries to
choose the node with a set of Pods with lowest priority. However, if such Pods
@ -407,13 +386,11 @@ have PodDisruptionBudget that would be violated if they are preempted then the
scheduler may choose another node with higher priority Pods.
When multiple nodes exist for preemption and none of the above scenarios apply,
we expect the scheduler to choose a node with the lowest priority. If that is
not the case, it may indicate a bug in the scheduler.
the scheduler chooses a node with the lowest priority.
## Interactions of Pod priority and QoS
## Interactions between Pod priority and quality of service {#interactions-of-pod-priority-and-qos}
Pod priority and
[QoS](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node/resource-qos.md)
Pod priority and {{< glossary_tooltip text="QoS class" term_id="qos-class" >}}
are two orthogonal features with few interactions and no default restrictions on
setting the priority of a Pod based on its QoS classes. The scheduler's
preemption logic does not consider QoS when choosing preemption targets.
@ -424,15 +401,20 @@ to schedule the preemptor Pod, or if the lowest priority Pods are protected by
`PodDisruptionBudget`.
The only component that considers both QoS and Pod priority is
[Kubelet out-of-resource eviction](/docs/tasks/administer-cluster/out-of-resource/).
[kubelet out-of-resource eviction](/docs/tasks/administer-cluster/out-of-resource/).
The kubelet ranks Pods for eviction first by whether or not their usage of the
starved resource exceeds requests, then by Priority, and then by the consumption
of the starved compute resource relative to the Pods scheduling requests.
See
[Evicting end-user pods](/docs/tasks/administer-cluster/out-of-resource/#evicting-end-user-pods)
for more details. Kubelet out-of-resource eviction does not evict Pods whose
[evicting end-user pods](/docs/tasks/administer-cluster/out-of-resource/#evicting-end-user-pods)
for more details.
kubelet out-of-resource eviction does not evict Pods wheir their
usage does not exceed their requests. If a Pod with lower priority is not
exceeding its requests, it won't be evicted. Another Pod with higher priority
that exceeds its requests may be evicted.
{{% /capture %}}
{{% capture whatsnext %}}
* Read about using ResourceQuotas in connection with PriorityClasses: [limit Priority Class consumption by default](/docs/concepts/policy/resource-quotas/#limit-priority-class-consumption-by-default)
{{% /capture %}}