From d22f3b970b32ed76ae44f0579c81007d429049ef Mon Sep 17 00:00:00 2001
From: Sergey Kanzhelev <S.Kanzhelev@live.com>
Date: Mon, 15 May 2023 13:39:35 -0700
Subject: [PATCH] Clarify pod scheduling during node graceful termination
 (#41061)

* clarify the pods scheduling during graceful termination:

* Update content/en/docs/concepts/architecture/nodes.md

Co-authored-by: Qiming Teng <tengqm@outlook.com>

---------

Co-authored-by: Qiming Teng <tengqm@outlook.com>
---
 .../en/docs/concepts/architecture/nodes.md    | 27 ++++++++++++++++++-
 1 file changed, 26 insertions(+), 1 deletion(-)

diff --git a/content/en/docs/concepts/architecture/nodes.md b/content/en/docs/concepts/architecture/nodes.md
index e963f4fc0e6..fedadc40ef7 100644
--- a/content/en/docs/concepts/architecture/nodes.md
+++ b/content/en/docs/concepts/architecture/nodes.md
@@ -396,7 +396,8 @@ The kubelet attempts to detect node system shutdown and terminates pods running
 
 Kubelet ensures that pods follow the normal
 [pod termination process](/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination)
-during the node shutdown.
+during the node shutdown. During node shutdown, the kubelet does not accept new
+Pods (even if those Pods are already bound to the node).
 
 The Graceful node shutdown feature depends on systemd since it takes advantage of
 [systemd inhibitor locks](https://www.freedesktop.org/wiki/Software/systemd/inhibit/) to
@@ -412,6 +413,20 @@ thus not activating the graceful node shutdown functionality.
 To activate the feature, the two kubelet config settings should be configured appropriately and
 set to non-zero values.
 
+Once systemd detects or notifies node shutdown, the kubelet sets a `NotReady` condition on
+the Node, with the `reason` set to `"node is shutting down"`. The kube-scheduler honors this condition
+and does not schedule any Pods onto the affected node; other third-party schedulers are
+expected to follow the same logic. This means that new Pods won't be scheduled onto that node
+and therefore none will start.
+
+The kubelet **also** rejects Pods during the `PodAdmission` phase if an ongoing
+node shutdown has been detected, so that even Pods with a
+{{< glossary_tooltip text="toleration" term_id="toleration" >}} for
+`node.kubernetes.io/not-ready:NoSchedule` do not start there.
+
+At the same time when kubelet is setting that condition on its Node via the API, the kubelet also begins
+terminating any Pods that are running locally.
+
 During a graceful shutdown, kubelet terminates pods in two phases:
 
 1. Terminate regular pods running on the node.
@@ -430,6 +445,16 @@ Graceful node shutdown feature is configured with two
     [critical pods](/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/#marking-pod-as-critical)
     during a node shutdown. This value should be less than `shutdownGracePeriod`.
 
+{{< note >}}
+
+There are cases when Node termination was cancelled by the system (or perhaps manually
+by an administrator). In either of those situations the
+Node will return to the `Ready` state. However Pods which already started the process
+of termination
+will not be restored by kubelet and will need to be re-scheduled.
+
+{{< /note >}}
+
 For example, if `shutdownGracePeriod=30s`, and
 `shutdownGracePeriodCriticalPods=10s`, kubelet will delay the node shutdown by
 30 seconds. During the shutdown, the first 20 (30-10) seconds would be reserved