Doc for Alpha feature PodSchedulingReadiness

2022-11-02 10:22:35 -07:00 · 2022-11-02 10:22:35 -07:00 · 21a7c4cc7e
parent b8fc810198
commit 21a7c4cc7e
5 changed files with 132 additions and 0 deletions
--- a/content/en/docs/concepts/scheduling-eviction/_index.md
+++ b/content/en/docs/concepts/scheduling-eviction/_index.md
@ -28,6 +28,7 @@ of terminating one or more Pods on Nodes.
 * [Scheduling Framework](/docs/concepts/scheduling-eviction/scheduling-framework)
 * [Scheduler Performance Tuning](/docs/concepts/scheduling-eviction/scheduler-perf-tuning/)
 * [Resource Bin Packing for Extended Resources](/docs/concepts/scheduling-eviction/resource-bin-packing/)
+* [Pod Scheduling Readiness](/docs/concepts/scheduling-eviction/pod-scheduling-readiness/)

 ## Pod Disruption

--- a/content/en/docs/concepts/scheduling-eviction/pod-scheduling-readiness.md
+++ b/content/en/docs/concepts/scheduling-eviction/pod-scheduling-readiness.md
@ -0,0 +1,110 @@
+---
+title: Pod Scheduling Readiness
+content_type: concept
+weight: 40
+---
+
+<!-- overview -->
+
+{{< feature-state for_k8s_version="v1.26" state="alpha" >}}
+
+Pods were considered ready for scheduling once created. Kubernetes scheduler
+does its due diligence to find nodes to place all pending Pods. However, in a 
+real-world case, some Pods may stay in a "miss-essential-resources" state for a long period.
+These Pods actually churn the scheduler (and downstream integrators like Cluster AutoScaler)
+in an unnecessary manner.
+
+By specifying/removing a Pod's `.spec.schedulingGates`, you can control when a Pod is ready
+to be considered for scheduling.
+
+<!-- body -->
+
+## Configuring Pod schedulingGates
+
+The `schedulingGates` field contains a list of strings, and each string literal is perceived as a
+criteria that Pod should be satisfied before considered schedulable. This field can be initialized
+only when a Pod is created (either by the client, or mutated during admission). After creation,
+each schedulingGate can be removed in arbitrary order, but addition of a new scheduling gate is disallowed.
+
+{{<mermaid>}}
+stateDiagram-v2
+    s1: pod created
+    s2: pod scheduling gated
+    s3: pod scheduling ready
+    s4: pod running
+    if: empty scheduling gates?
+    state if <<choice>>
+    [*] --> s1
+    s1 --> if
+    s2 --> if: scheduling gate removed
+    if --> s2: no
+    if --> s3: yes  
+    s3 --> s4
+    s4 --> [*]
+{{< /mermaid >}}
+
+## Usage example
+
+To mark a Pod not-ready for scheduling, you can create it with one or more scheduling gates like this:
+
+{{< codenew file="pods/pod-with-scheduling-gates.yaml" >}}
+
+After the Pod's creation, you can check its state using:
+
+```bash
+kubectl get pod test-pod
+```
+
+The output reveals it's in `SchedulingGated` state:
+
+```bash
+NAME       READY   STATUS            RESTARTS   AGE
+test-pod   0/1     SchedulingGated   0          7s
+```
+
+You can also check its `schedulingGates` field by running:
+
+```bash
+kubectl get pod test-pod -o jsonpath='{.spec.schedulingGates}'
+```
+
+The output is:
+
+```bash
+[{"name":"foo"},{"name":"bar"}]
+```
+
+To inform scheduler this Pod is ready for scheduling, you can remove its `schedulingGates` entirely
+by re-applying a modified manifest:
+
+{{< codenew file="pods/pod-without-scheduling-gates.yaml" >}}
+
+You can check if the `schedulingGates` is cleared by running:
+
+```bash
+kubectl get pod test-pod -o jsonpath='{.spec.schedulingGates}'
+```
+
+The output is expected to be empty. And you can check its latest status by running:
+
+```bash
+kubectl get pod test-pod -o wide
+```
+
+Given the test-pod doesn't request any CPU/memory resources, it's expected that this Pod's state get
+transited from previous `SchedulingGated` to `Running`:
+
+```bash
+NAME       READY   STATUS    RESTARTS   AGE   IP         NODE  
+test-pod   1/1     Running   0          15s   10.0.0.4   node-2
+```
+
+## Observability
+
+The metric `scheduler_pending_pods` comes with a new label `"gated"` to distinguish whether a Pod
+has been tried scheduling but claimed as unschedulable, or explicitly marked as not ready for
+scheduling. You can use `scheduler_pending_pods{queue="gated"}` to check the metric result.
+
+## {{% heading "whatsnext" %}}
+
+* Read the [PodSchedulingReadiness KEP](https://github.com/kubernetes/enhancements/blob/master/keps/sig-scheduling/3521-pod-scheduling-readiness) for more details
--- a/content/en/docs/reference/command-line-tools-reference/feature-gates.md
+++ b/content/en/docs/reference/command-line-tools-reference/feature-gates.md
@ -152,6 +152,7 @@ For a reference to old feature gates that are removed, please refer to
 | `PodDeletionCost` | `true` | Beta | 1.22 | |
 | `PodDisruptionConditions` | `false` | Alpha | 1.25 | - |
 | `PodHasNetworkCondition` | `false` | Alpha | 1.25 | |
+| `PodSchedulingReadiness` | `false` | Alpha | 1.26 | |
 | `ProbeTerminationGracePeriod` | `false` | Alpha | 1.21 | 1.21 |
 | `ProbeTerminationGracePeriod` | `false` | Beta | 1.22 | 1.24 |
 | `ProbeTerminationGracePeriod` | `true` | Beta | 1.25 | |
@ -652,6 +653,7 @@ Each feature gate is designed for enabling/disabling a specific feature:
  pod stats from the CRI container runtime rather than gathering them from cAdvisor.
 - `PodDisruptionConditions`: Enables support for appending a dedicated pod condition indicating that the pod is being deleted due to a disruption.
 - `PodHasNetworkCondition`: Enable the kubelet to mark the [PodHasNetwork](/docs/concepts/workloads/pods/pod-lifecycle/#pod-has-network) condition on pods.
+- `PodSchedulingReadiness`: Enable setting `schedulingGates` field to control a Pod's [scheduling readiness](/docs/concepts/scheduling-eviction/pod-scheduling-readiness).
 - `PodSecurity`: Enables the `PodSecurity` admission plugin.
 - `PreferNominatedNode`: This flag tells the scheduler whether the nominated
  nodes will be checked first before looping through all the other nodes in
--- a/content/en/examples/pods/pod-with-scheduling-gates.yaml
+++ b/content/en/examples/pods/pod-with-scheduling-gates.yaml
@ -0,0 +1,11 @@
+apiVersion: v1
+kind: Pod
+metadata:
+  name: test-pod
+spec:
+  schedulingGates:
+  - name: foo
+  - name: bar
+  containers:
+  - name: pause
+    image: registry.k8s.io/pause:3.6
--- a/content/en/examples/pods/pod-without-scheduling-gates.yaml
+++ b/content/en/examples/pods/pod-without-scheduling-gates.yaml
@ -0,0 +1,8 @@
+apiVersion: v1
+kind: Pod
+metadata:
+  name: test-pod
+spec:
+  containers:
+  - name: pause
+    image: registry.k8s.io/pause:3.6