From d22f3b970b32ed76ae44f0579c81007d429049ef Mon Sep 17 00:00:00 2001 From: Sergey Kanzhelev Date: Mon, 15 May 2023 13:39:35 -0700 Subject: [PATCH] Clarify pod scheduling during node graceful termination (#41061) * clarify the pods scheduling during graceful termination: * Update content/en/docs/concepts/architecture/nodes.md Co-authored-by: Qiming Teng --------- Co-authored-by: Qiming Teng --- .../en/docs/concepts/architecture/nodes.md | 27 ++++++++++++++++++- 1 file changed, 26 insertions(+), 1 deletion(-) diff --git a/content/en/docs/concepts/architecture/nodes.md b/content/en/docs/concepts/architecture/nodes.md index e963f4fc0e6..fedadc40ef7 100644 --- a/content/en/docs/concepts/architecture/nodes.md +++ b/content/en/docs/concepts/architecture/nodes.md @@ -396,7 +396,8 @@ The kubelet attempts to detect node system shutdown and terminates pods running Kubelet ensures that pods follow the normal [pod termination process](/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination) -during the node shutdown. +during the node shutdown. During node shutdown, the kubelet does not accept new +Pods (even if those Pods are already bound to the node). The Graceful node shutdown feature depends on systemd since it takes advantage of [systemd inhibitor locks](https://www.freedesktop.org/wiki/Software/systemd/inhibit/) to @@ -412,6 +413,20 @@ thus not activating the graceful node shutdown functionality. To activate the feature, the two kubelet config settings should be configured appropriately and set to non-zero values. +Once systemd detects or notifies node shutdown, the kubelet sets a `NotReady` condition on +the Node, with the `reason` set to `"node is shutting down"`. The kube-scheduler honors this condition +and does not schedule any Pods onto the affected node; other third-party schedulers are +expected to follow the same logic. This means that new Pods won't be scheduled onto that node +and therefore none will start. + +The kubelet **also** rejects Pods during the `PodAdmission` phase if an ongoing +node shutdown has been detected, so that even Pods with a +{{< glossary_tooltip text="toleration" term_id="toleration" >}} for +`node.kubernetes.io/not-ready:NoSchedule` do not start there. + +At the same time when kubelet is setting that condition on its Node via the API, the kubelet also begins +terminating any Pods that are running locally. + During a graceful shutdown, kubelet terminates pods in two phases: 1. Terminate regular pods running on the node. @@ -430,6 +445,16 @@ Graceful node shutdown feature is configured with two [critical pods](/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/#marking-pod-as-critical) during a node shutdown. This value should be less than `shutdownGracePeriod`. +{{< note >}} + +There are cases when Node termination was cancelled by the system (or perhaps manually +by an administrator). In either of those situations the +Node will return to the `Ready` state. However Pods which already started the process +of termination +will not be restored by kubelet and will need to be re-scheduled. + +{{< /note >}} + For example, if `shutdownGracePeriod=30s`, and `shutdownGracePeriodCriticalPods=10s`, kubelet will delay the node shutdown by 30 seconds. During the shutdown, the first 20 (30-10) seconds would be reserved