From 3005445d343c0be945836f2d60c096846a177b72 Mon Sep 17 00:00:00 2001 From: Ashutosh Kumar Date: Tue, 12 Apr 2022 19:57:24 +0530 Subject: [PATCH] document non graceful node shutdown feature (#32406) Signed-off-by: Ashutosh Kumar --- .../en/docs/concepts/architecture/nodes.md | 50 +++++++++++++++++++ .../labels-annotations-taints/_index.md | 15 ++++++ 2 files changed, 65 insertions(+) diff --git a/content/en/docs/concepts/architecture/nodes.md b/content/en/docs/concepts/architecture/nodes.md index 4d3534492e..48fa4a782f 100644 --- a/content/en/docs/concepts/architecture/nodes.md +++ b/content/en/docs/concepts/architecture/nodes.md @@ -450,6 +450,56 @@ Reason: Terminated Message: Pod was terminated in response to imminent node shutdown. ``` +{{< /note >}} + +## Non Graceful node shutdown {#non-graceful-node-shutdown} + +{{< feature-state state="alpha" for_k8s_version="v1.24" >}} + +A node shutdown action may not be detected by kubelet's Node Shutdown Mananger, +either because the command does not trigger the inhibitor locks mechanism used by +kubelet or because of a user error, i.e., the ShutdownGracePeriod and +ShutdownGracePeriodCriticalPods are not configured properly. Please refer to above +section [Graceful Node Shutdown](#graceful-node-shutdown) for more details. + +When a node is shutdown but not detected by kubelet's Node Shutdown Manager, the pods +that are part of a StatefulSet will be stuck in terminating status on +the shutdown node and cannot move to a new running node. This is because kubelet on +the shutdown node is not available to delete the pods so the StatefulSet cannot +create a new pod with the same name. If there are volumes used by the pods, the +VolumeAttachments will not be deleted from the original shutdown node so the volumes +used by these pods cannot be attached to a new running node. As a result, the +application running on the StatefulSet cannot function properly. If the original +shutdown node comes up, the pods will be deleted by kubelet and new pods will be +created on a different running node. If the original shutdown node does not come up, +these pods will be stuck in terminating status on the shutdown node forever. + +To mitigate the above situation, a user can manually add the taint `node +kubernetes.io/out-of-service` with either `NoExecute` or `NoSchedule` effect to +a Node marking it out-of-service. +If the `NodeOutOfServiceVolumeDetach` [feature gate](/docs/reference/ +command-line-tools-reference/feature-gates/) is enabled on +`kube-controller-manager`, and a Node is marked out-of-service with this taint, the +pods on the node will be forcefully deleted if there are no matching tolerations on +it and volume detach operations for the pods terminating on the node will happen +immediately. This allows the Pods on the out-of-service node to recover quickly on a +different node. + +During a non-graceful shutdown, Pods are terminated in the two phases: + +1. Force delete the Pods that do not have matching `out-of-service` tolerations. +2. Immediately perform detach volume operation for such pods. + + +{{< note >}} +- Before adding the taint `node.kubernetes.io/out-of-service` , it should be verified +that the node is already in shutdown or power off state (not in the middle of +restarting). +- The user is required to manually remove the out-of-service taint after the pods are +moved to a new node and the user has checked that the shutdown node has been +recovered since the user was the one who originally added the taint. + + {{< /note >}} ### Pod Priority based graceful node shutdown {#pod-priority-graceful-node-shutdown} diff --git a/content/en/docs/reference/labels-annotations-taints/_index.md b/content/en/docs/reference/labels-annotations-taints/_index.md index dcb3e5f13d..bae7f3f14c 100644 --- a/content/en/docs/reference/labels-annotations-taints/_index.md +++ b/content/en/docs/reference/labels-annotations-taints/_index.md @@ -381,6 +381,21 @@ Example: `node.kubernetes.io/pid-pressure:NoSchedule` The kubelet checks D-value of the size of `/proc/sys/kernel/pid_max` and the PIDs consumed by Kubernetes on a node to get the number of available PIDs that referred to as the `pid.available` metric. The metric is then compared to the corresponding threshold that can be set on the kubelet to determine if the node condition and taint should be added/removed. +### node.kubernetes.io/out-of-service + +Example: `node.kubernetes.io/out-of-service:NoExecute` + +A user can manually add the taint to a Node marking it out-of-service. If the `NodeOutOfServiceVolumeDetach` +[feature gate](/docs/reference/command-line-tools-reference/feature-gates/) is enabled on +`kube-controller-manager`, and a Node is marked out-of-service with this taint, the pods on the node will be forcefully deleted if there are no matching tolerations on it and volume detach operations for the pods terminating on the node will happen immediately. This allows the Pods on the out-of-service node to recover quickly on a different node. + +{{< caution >}} +Refer to +[Non-graceful node shutdown](/docs/concepts/architecture/nodes/#non-graceful-node-shutdown) +for further details about when and how to use this taint. +{{< /caution >}} + + ### node.cloudprovider.kubernetes.io/uninitialized Example: `node.cloudprovider.kubernetes.io/uninitialized:NoSchedule`