Clarify pod scheduling during node graceful termination (#41061)

* clarify the pods scheduling during graceful termination: * Update content/en/docs/concepts/architecture/nodes.md Co-authored-by: Qiming Teng <tengqm@outlook.com> --------- Co-authored-by: Qiming Teng <tengqm@outlook.com>
2023-05-15 13:39:35 -07:00 · 2023-05-15 13:39:35 -07:00 · d22f3b970b
parent b9c88e7ffe
commit d22f3b970b
1 changed files with 26 additions and 1 deletions
--- a/content/en/docs/concepts/architecture/nodes.md
+++ b/content/en/docs/concepts/architecture/nodes.md
@ -396,7 +396,8 @@ The kubelet attempts to detect node system shutdown and terminates pods running

 Kubelet ensures that pods follow the normal
 [pod termination process](/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination)
-during the node shutdown.
+during the node shutdown. During node shutdown, the kubelet does not accept new
+Pods (even if those Pods are already bound to the node).

 The Graceful node shutdown feature depends on systemd since it takes advantage of
 [systemd inhibitor locks](https://www.freedesktop.org/wiki/Software/systemd/inhibit/) to
@ -412,6 +413,20 @@ thus not activating the graceful node shutdown functionality.
 To activate the feature, the two kubelet config settings should be configured appropriately and
 set to non-zero values.

+Once systemd detects or notifies node shutdown, the kubelet sets a `NotReady` condition on
+the Node, with the `reason` set to `"node is shutting down"`. The kube-scheduler honors this condition
+and does not schedule any Pods onto the affected node; other third-party schedulers are
+expected to follow the same logic. This means that new Pods won't be scheduled onto that node
+and therefore none will start.
+
+The kubelet **also** rejects Pods during the `PodAdmission` phase if an ongoing
+node shutdown has been detected, so that even Pods with a
+{{< glossary_tooltip text="toleration" term_id="toleration" >}} for
+`node.kubernetes.io/not-ready:NoSchedule` do not start there.
+
+At the same time when kubelet is setting that condition on its Node via the API, the kubelet also begins
+terminating any Pods that are running locally.
+
 During a graceful shutdown, kubelet terminates pods in two phases:

 1. Terminate regular pods running on the node.
@ -430,6 +445,16 @@ Graceful node shutdown feature is configured with two
    [critical pods](/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/#marking-pod-as-critical)
    during a node shutdown. This value should be less than `shutdownGracePeriod`.

+{{< note >}}
+
+There are cases when Node termination was cancelled by the system (or perhaps manually
+by an administrator). In either of those situations the
+Node will return to the `Ready` state. However Pods which already started the process
+of termination
+will not be restored by kubelet and will need to be re-scheduled.
+
+{{< /note >}}
+
 For example, if `shutdownGracePeriod=30s`, and
 `shutdownGracePeriodCriticalPods=10s`, kubelet will delay the node shutdown by
 30 seconds. During the shutdown, the first 20 (30-10) seconds would be reserved