Reword pod priority and preemption concept (#17508)
* Remove mentions of unsupported Kubernetes versions No need to mention supported-since-this-version details for Kubernetes releases that are now out of support. * Warn about disabling preemption * Tweak wording * Add What's next section * Tweak Pod priority troubleshooting advice Reshape the advice about pod priority troubleshooting to explain user-induced issues (and define expected behavior). If the reader detects behavior that does not match the documentation, they have observed a bug and can report that via the usual routes. * Rewords notes on Pod priority vs. QoS * Move warning into page bodypull/19655/head
parent
f29221f4c4
commit
43212d6bc7
|
@ -16,42 +16,25 @@ importance of a Pod relative to other Pods. If a Pod cannot be scheduled, the
|
|||
scheduler tries to preempt (evict) lower priority Pods to make scheduling of the
|
||||
pending Pod possible.
|
||||
|
||||
In Kubernetes 1.9 and later, Priority also affects scheduling order of Pods and
|
||||
out-of-resource eviction ordering on the Node.
|
||||
|
||||
Pod priority and preemption graduated to beta in Kubernetes 1.11 and to GA in
|
||||
Kubernetes 1.14. They have been enabled by default since 1.11.
|
||||
|
||||
In Kubernetes versions where Pod priority and preemption is still an alpha-level
|
||||
feature, you need to explicitly enable it. To use these features in the older
|
||||
versions of Kubernetes, follow the instructions in the documentation for your
|
||||
Kubernetes version, by going to the documentation archive version for your
|
||||
Kubernetes version.
|
||||
|
||||
Kubernetes Version | Priority and Preemption State | Enabled by default
|
||||
------------------ | :---------------------------: | :----------------:
|
||||
1.8 | alpha | no
|
||||
1.9 | alpha | no
|
||||
1.10 | alpha | no
|
||||
1.11 | beta | yes
|
||||
1.14 | stable | yes
|
||||
|
||||
{{< warning >}}In a cluster where not all users are trusted, a
|
||||
malicious user could create pods at the highest possible priorities, causing
|
||||
other pods to be evicted/not get scheduled. To resolve this issue,
|
||||
[ResourceQuota](/docs/concepts/policy/resource-quotas/) is
|
||||
augmented to support Pod priority. An admin can create ResourceQuota for users
|
||||
at specific priority levels, preventing them from creating pods at high
|
||||
priorities. This feature is in beta since Kubernetes 1.12.
|
||||
{{< /warning >}}
|
||||
|
||||
{{% /capture %}}
|
||||
|
||||
{{% capture body %}}
|
||||
|
||||
|
||||
{{< warning >}}
|
||||
In a cluster where not all users are trusted, a malicious user could create Pods
|
||||
at the highest possible priorities, causing other Pods to be evicted/not get
|
||||
scheduled.
|
||||
An administrator can use ResourceQuota to prevent users from creating pods at
|
||||
high priorities.
|
||||
|
||||
See [limit Priority Class consumption by default](/docs/concepts/policy/resource-quotas/#limit-priority-class-consumption-by-default)
|
||||
for details.
|
||||
{{< /warning >}}
|
||||
|
||||
## How to use priority and preemption
|
||||
|
||||
To use priority and preemption in Kubernetes 1.11 and later, follow these steps:
|
||||
To use priority and preemption:
|
||||
|
||||
1. Add one or more [PriorityClasses](#priorityclass).
|
||||
|
||||
|
@ -77,21 +60,20 @@ Pods.
|
|||
|
||||
## How to disable preemption
|
||||
|
||||
{{< note >}}
|
||||
In Kubernetes 1.12+, critical pods rely on scheduler preemption to be scheduled
|
||||
when a cluster is under resource pressure. For this reason, it is not
|
||||
recommended to disable preemption.
|
||||
{{< /note >}}
|
||||
{{< caution >}}
|
||||
Critical pods rely on scheduler preemption to be scheduled when a cluster
|
||||
is under resource pressure. For this reason, it is not recommended to
|
||||
disable preemption.
|
||||
{{< /caution >}}
|
||||
|
||||
{{< note >}}
|
||||
In Kubernetes 1.15 and later,
|
||||
if the feature `NonPreemptingPriority` is enabled,
|
||||
In Kubernetes 1.15 and later, if the feature `NonPreemptingPriority` is enabled,
|
||||
PriorityClasses have the option to set `preemptionPolicy: Never`.
|
||||
This will prevent pods of that PriorityClass from preempting other pods.
|
||||
{{< /note >}}
|
||||
|
||||
In Kubernetes 1.11 and later, preemption is controlled by a kube-scheduler flag
|
||||
`disablePreemption`, which is set to `false` by default.
|
||||
Preemption is controlled by a kube-scheduler flag `disablePreemption`, which is
|
||||
set to `false` by default.
|
||||
If you want to disable preemption despite the above note, you can set
|
||||
`disablePreemption` to `true`.
|
||||
|
||||
|
@ -240,12 +222,12 @@ spec:
|
|||
|
||||
### Effect of Pod priority on scheduling order
|
||||
|
||||
In Kubernetes 1.9 and later, when Pod priority is enabled, scheduler orders
|
||||
pending Pods by their priority and a pending Pod is placed ahead of other
|
||||
pending Pods with lower priority in the scheduling queue. As a result, the
|
||||
higher priority Pod may be scheduled sooner than Pods with lower priority if its
|
||||
scheduling requirements are met. If such Pod cannot be scheduled, scheduler will
|
||||
continue and tries to schedule other lower priority Pods.
|
||||
When Pod priority is enabled, the scheduler orders pending Pods by
|
||||
their priority and a pending Pod is placed ahead of other pending Pods
|
||||
with lower priority in the scheduling queue. As a result, the higher
|
||||
priority Pod may be scheduled sooner than Pods with lower priority if
|
||||
its scheduling requirements are met. If such Pod cannot be scheduled,
|
||||
scheduler will continue and tries to schedule other lower priority Pods.
|
||||
|
||||
## Preemption
|
||||
|
||||
|
@ -291,12 +273,12 @@ point that scheduler preempts victims and the time that Pod P is scheduled. In
|
|||
order to minimize this gap, one can set graceful termination period of lower
|
||||
priority Pods to zero or a small number.
|
||||
|
||||
#### PodDisruptionBudget is supported, but not guaranteed!
|
||||
#### PodDisruptionBudget is supported, but not guaranteed
|
||||
|
||||
A [Pod Disruption Budget (PDB)](/docs/concepts/workloads/pods/disruptions/)
|
||||
allows application owners to limit the number of Pods of a replicated application
|
||||
that are down simultaneously from voluntary disruptions. Kubernetes 1.9 supports
|
||||
PDB when preempting Pods, but respecting PDB is best effort. The Scheduler tries
|
||||
that are down simultaneously from voluntary disruptions. Kubernetes supports
|
||||
PDB when preempting Pods, but respecting PDB is best effort. The scheduler tries
|
||||
to find victims whose PDB are not violated by preemption, but if no such victims
|
||||
are found, preemption will still happen, and lower priority Pods will be removed
|
||||
despite their PDBs being violated.
|
||||
|
@ -347,28 +329,23 @@ gone, and Pod P could possibly be scheduled on Node N.
|
|||
We may consider adding cross Node preemption in future versions if there is
|
||||
enough demand and if we find an algorithm with reasonable performance.
|
||||
|
||||
## Debugging Pod Priority and Preemption
|
||||
## Troubleshooting
|
||||
|
||||
Pod Priority and Preemption is a major feature that could potentially disrupt
|
||||
Pod scheduling if it has bugs.
|
||||
Pod priority and pre-emption can have unwanted side effects. Here are some
|
||||
examples of potential problems and ways to deal with them.
|
||||
|
||||
### Potential problems caused by Priority and Preemption
|
||||
|
||||
The followings are some of the potential problems that could be caused by bugs
|
||||
in the implementation of the feature. This list is not exhaustive.
|
||||
|
||||
#### Pods are preempted unnecessarily
|
||||
### Pods are preempted unnecessarily
|
||||
|
||||
Preemption removes existing Pods from a cluster under resource pressure to make
|
||||
room for higher priority pending Pods. If a user gives high priorities to
|
||||
certain Pods by mistake, these unintentional high priority Pods may cause
|
||||
preemption in the cluster. As mentioned above, Pod priority is specified by
|
||||
setting the `priorityClassName` field of `podSpec`. The integer value of
|
||||
room for higher priority pending Pods. If you give high priorities to
|
||||
certain Pods by mistake, these unintentionally high priority Pods may cause
|
||||
preemption in your cluster. Pod priority is specified by setting the
|
||||
`priorityClassName` field in the Pod's specification. The integer value for
|
||||
priority is then resolved and populated to the `priority` field of `podSpec`.
|
||||
|
||||
To resolve the problem, `priorityClassName` of the Pods must be changed to use
|
||||
lower priority classes or should be left empty. Empty `priorityClassName` is
|
||||
resolved to zero by default.
|
||||
To address the problem, you can change the `priorityClassName` for those Pods
|
||||
to use lower priority classes, or leave that field empty. An empty
|
||||
`priorityClassName` is resolved to zero by default.
|
||||
|
||||
When a Pod is preempted, there will be events recorded for the preempted Pod.
|
||||
Preemption should happen only when a cluster does not have enough resources for
|
||||
|
@ -377,29 +354,31 @@ Pod (preemptor) is higher than the victim Pods. Preemption must not happen when
|
|||
there is no pending Pod, or when the pending Pods have equal or lower priority
|
||||
than the victims. If preemption happens in such scenarios, please file an issue.
|
||||
|
||||
#### Pods are preempted, but the preemptor is not scheduled
|
||||
### Pods are preempted, but the preemptor is not scheduled
|
||||
|
||||
When pods are preempted, they receive their requested graceful termination
|
||||
period, which is by default 30 seconds, but it can be any different value as
|
||||
specified in the PodSpec. If the victim Pods do not terminate within this period,
|
||||
they are force-terminated. Once all the victims go away, the preemptor Pod can
|
||||
be scheduled.
|
||||
period, which is by default 30 seconds. If the victim Pods do not terminate within
|
||||
this period, they are forcibly terminated. Once all the victims go away, the
|
||||
preemptor Pod can be scheduled.
|
||||
|
||||
While the preemptor Pod is waiting for the victims to go away, a higher priority
|
||||
Pod may be created that fits on the same node. In this case, the scheduler will
|
||||
Pod may be created that fits on the same Node. In this case, the scheduler will
|
||||
schedule the higher priority Pod instead of the preemptor.
|
||||
|
||||
In the absence of such a higher priority Pod, we expect the preemptor Pod to be
|
||||
scheduled after the graceful termination period of the victims is over.
|
||||
This is expected behavior: the Pod with the higher priority should take the place
|
||||
of a Pod with a lower priority. Other controller actions, such as
|
||||
[cluster autoscaling](/docs/tasks/administer-cluster/cluster-management/#cluster-autoscaling),
|
||||
may eventually provide capacity to schedule the pending Pods.
|
||||
|
||||
#### Higher priority Pods are preempted before lower priority pods
|
||||
### Higher priority Pods are preempted before lower priority pods
|
||||
|
||||
The scheduler tries to find nodes that can run a pending Pod and if no node is
|
||||
found, it tries to remove Pods with lower priority from one node to make room
|
||||
for the pending pod. If a node with low priority Pods is not feasible to run the
|
||||
pending Pod, the scheduler may choose another node with higher priority Pods
|
||||
(compared to the Pods on the other node) for preemption. The victims must still
|
||||
have lower priority than the preemptor Pod.
|
||||
The scheduler tries to find nodes that can run a pending Pod. If no node is
|
||||
found, the scheduler tries to remove Pods with lower priority from an arbitrary
|
||||
node in order to make room for the pending pod.
|
||||
If a node with low priority Pods is not feasible to run the pending Pod, the scheduler
|
||||
may choose another node with higher priority Pods (compared to the Pods on the
|
||||
other node) for preemption. The victims must still have lower priority than the
|
||||
preemptor Pod.
|
||||
|
||||
When there are multiple nodes available for preemption, the scheduler tries to
|
||||
choose the node with a set of Pods with lowest priority. However, if such Pods
|
||||
|
@ -407,13 +386,11 @@ have PodDisruptionBudget that would be violated if they are preempted then the
|
|||
scheduler may choose another node with higher priority Pods.
|
||||
|
||||
When multiple nodes exist for preemption and none of the above scenarios apply,
|
||||
we expect the scheduler to choose a node with the lowest priority. If that is
|
||||
not the case, it may indicate a bug in the scheduler.
|
||||
the scheduler chooses a node with the lowest priority.
|
||||
|
||||
## Interactions of Pod priority and QoS
|
||||
## Interactions between Pod priority and quality of service {#interactions-of-pod-priority-and-qos}
|
||||
|
||||
Pod priority and
|
||||
[QoS](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node/resource-qos.md)
|
||||
Pod priority and {{< glossary_tooltip text="QoS class" term_id="qos-class" >}}
|
||||
are two orthogonal features with few interactions and no default restrictions on
|
||||
setting the priority of a Pod based on its QoS classes. The scheduler's
|
||||
preemption logic does not consider QoS when choosing preemption targets.
|
||||
|
@ -424,15 +401,20 @@ to schedule the preemptor Pod, or if the lowest priority Pods are protected by
|
|||
`PodDisruptionBudget`.
|
||||
|
||||
The only component that considers both QoS and Pod priority is
|
||||
[Kubelet out-of-resource eviction](/docs/tasks/administer-cluster/out-of-resource/).
|
||||
[kubelet out-of-resource eviction](/docs/tasks/administer-cluster/out-of-resource/).
|
||||
The kubelet ranks Pods for eviction first by whether or not their usage of the
|
||||
starved resource exceeds requests, then by Priority, and then by the consumption
|
||||
of the starved compute resource relative to the Pods’ scheduling requests.
|
||||
See
|
||||
[Evicting end-user pods](/docs/tasks/administer-cluster/out-of-resource/#evicting-end-user-pods)
|
||||
for more details. Kubelet out-of-resource eviction does not evict Pods whose
|
||||
[evicting end-user pods](/docs/tasks/administer-cluster/out-of-resource/#evicting-end-user-pods)
|
||||
for more details.
|
||||
|
||||
kubelet out-of-resource eviction does not evict Pods wheir their
|
||||
usage does not exceed their requests. If a Pod with lower priority is not
|
||||
exceeding its requests, it won't be evicted. Another Pod with higher priority
|
||||
that exceeds its requests may be evicted.
|
||||
|
||||
{{% /capture %}}
|
||||
{{% capture whatsnext %}}
|
||||
* Read about using ResourceQuotas in connection with PriorityClasses: [limit Priority Class consumption by default](/docs/concepts/policy/resource-quotas/#limit-priority-class-consumption-by-default)
|
||||
{{% /capture %}}
|
||||
|
|
Loading…
Reference in New Issue