130 lines
5.3 KiB
Markdown
130 lines
5.3 KiB
Markdown
---
|
|
layout: blog
|
|
title: "Kubernetes 1.26: Pod Scheduling Readiness"
|
|
date: 2022-12-26
|
|
slug: pod-scheduling-readiness-alpha
|
|
author: >
|
|
Wei Huang (Apple),
|
|
Abdullah Gharaibeh (Google)
|
|
---
|
|
|
|
Kubernetes 1.26 introduced a new Pod feature: _scheduling gates_. In Kubernetes, scheduling gates
|
|
are keys that tell the scheduler when a Pod is ready to be considered for scheduling.
|
|
|
|
## What problem does it solve?
|
|
|
|
When a Pod is created, the scheduler will continuously attempt to find a node that fits it. This
|
|
infinite loop continues until the scheduler either finds a node for the Pod, or the Pod gets deleted.
|
|
|
|
Pods that remain unschedulable for long periods of time (e.g., ones that are blocked on some external event)
|
|
waste scheduling cycles. A scheduling cycle may take ≅20ms or more depending on the complexity of
|
|
the Pod's scheduling constraints. Therefore, at scale, those wasted cycles significantly impact the
|
|
scheduler's performance. See the arrows in the "scheduler" box below.
|
|
|
|
{{< mermaid >}}
|
|
graph LR;
|
|
pod((New Pod))-->queue
|
|
subgraph Scheduler
|
|
queue(scheduler queue)
|
|
sched_cycle[/scheduling cycle/]
|
|
schedulable{schedulable?}
|
|
|
|
queue==>|Pop out|sched_cycle
|
|
sched_cycle==>schedulable
|
|
schedulable==>|No|queue
|
|
subgraph note [Cycles wasted on keep rescheduling 'unready' Pods]
|
|
end
|
|
end
|
|
|
|
classDef plain fill:#ddd,stroke:#fff,stroke-width:1px,color:#000;
|
|
classDef k8s fill:#326ce5,stroke:#fff,stroke-width:1px,color:#fff;
|
|
classDef Scheduler fill:#fff,stroke:#bbb,stroke-width:2px,color:#326ce5;
|
|
classDef note fill:#edf2ae,stroke:#fff,stroke-width:1px;
|
|
class queue,sched_cycle,schedulable k8s;
|
|
class pod plain;
|
|
class note note;
|
|
class Scheduler Scheduler;
|
|
{{< /mermaid >}}
|
|
|
|
Scheduling gates helps address this problem. It allows declaring that newly created Pods are not
|
|
ready for scheduling. When scheduling gates are present on a Pod, the scheduler ignores the Pod
|
|
and therefore saves unnecessary scheduling attempts. Those Pods will also be ignored by Cluster
|
|
Autoscaler if you have it installed in the cluster.
|
|
|
|
Clearing the gates is the responsibility of external controllers with knowledge of when the Pod
|
|
should be considered for scheduling (e.g., a quota manager).
|
|
|
|
{{< mermaid >}}
|
|
graph LR;
|
|
pod((New Pod))-->queue
|
|
subgraph Scheduler
|
|
queue(scheduler queue)
|
|
sched_cycle[/scheduling cycle/]
|
|
schedulable{schedulable?}
|
|
popout{Pop out?}
|
|
|
|
queue==>|PreEnqueue check|popout
|
|
popout-->|Yes|sched_cycle
|
|
popout==>|No|queue
|
|
sched_cycle-->schedulable
|
|
schedulable-->|No|queue
|
|
subgraph note [A knob to gate Pod's scheduling]
|
|
end
|
|
end
|
|
|
|
classDef plain fill:#ddd,stroke:#fff,stroke-width:1px,color:#000;
|
|
classDef k8s fill:#326ce5,stroke:#fff,stroke-width:1px,color:#fff;
|
|
classDef Scheduler fill:#fff,stroke:#bbb,stroke-width:2px,color:#326ce5;
|
|
classDef note fill:#edf2ae,stroke:#fff,stroke-width:1px;
|
|
classDef popout fill:#f96,stroke:#fff,stroke-width:1px;
|
|
class queue,sched_cycle,schedulable k8s;
|
|
class pod plain;
|
|
class note note;
|
|
class popout popout;
|
|
class Scheduler Scheduler;
|
|
{{< /mermaid >}}
|
|
|
|
## How does it work?
|
|
|
|
Scheduling gates in general works very similar to Finalizers. Pods with a non-empty
|
|
`spec.schedulingGates` field will show as status `SchedulingGated` and be blocked from
|
|
scheduling. Note that more than one gate can be added, but they all should be added upon Pod
|
|
creation (e.g., you can add them as part of the spec or via a mutating webhook).
|
|
|
|
```
|
|
NAME READY STATUS RESTARTS AGE
|
|
test-pod 0/1 SchedulingGated 0 10s
|
|
```
|
|
|
|
To clear the gates, you update the Pod by removing all of the items from the Pod's `schedulingGates`
|
|
field. The gates do not need to be removed all at once, but only when all the gates are removed the
|
|
scheduler will start to consider the Pod for scheduling.
|
|
|
|
Under the hood, scheduling gates are implemented as a PreEnqueue scheduler plugin, a new scheduler
|
|
framework extension point that is invoked at the beginning of each scheduling cycle.
|
|
|
|
## Use Cases
|
|
|
|
An important use case this feature enables is dynamic quota management. Kubernetes supports
|
|
[ResourceQuota](/docs/concepts/policy/resource-quotas/), however the API Server enforces quota at
|
|
the time you attempt Pod creation. For example, if a new Pod exceeds the CPU quota, it gets rejected.
|
|
The API Server doesn't queue the Pod; therefore, whoever created the Pod needs to continuously attempt
|
|
to recreate it again. This either means a delay between resources becoming available and the Pod
|
|
actually running, or it means load on the API server and Scheduler due to constant attempts.
|
|
|
|
Scheduling gates allows an external quota manager to address the above limitation of ResourceQuota.
|
|
Specifically, the manager could add a `example.com/quota-check` scheduling gate to all Pods created in the
|
|
cluster (using a mutating webhook). The manager would then remove the gate when there is quota to
|
|
start the Pod.
|
|
|
|
## Whats next?
|
|
|
|
To use this feature, the `PodSchedulingReadiness` feature gate must be enabled in the API Server
|
|
and scheduler. You're more than welcome to test it out and tell us (SIG Scheduling) what you think!
|
|
|
|
## Additional resources
|
|
|
|
- [Pod Scheduling Readiness](/docs/concepts/scheduling-eviction/pod-scheduling-readiness/)
|
|
in the Kubernetes documentation
|
|
- [Kubernetes Enhancement Proposal](https://github.com/kubernetes/enhancements/blob/master/keps/sig-scheduling/3521-pod-scheduling-readiness/README.md)
|