document non graceful node shutdown feature (#32406)

Signed-off-by: Ashutosh Kumar <sonasingh46@gmail.com>
2022-04-12 19:57:24 +05:30 · 2022-04-12 19:57:24 +05:30 · 3005445d34
parent 15e978d8db
commit 3005445d34
2 changed files with 65 additions and 0 deletions
--- a/content/en/docs/concepts/architecture/nodes.md
+++ b/content/en/docs/concepts/architecture/nodes.md
@ -450,6 +450,56 @@ Reason:         Terminated
 Message:        Pod was terminated in response to imminent node shutdown.
 ```

+{{< /note >}}
+
+## Non Graceful node shutdown {#non-graceful-node-shutdown}
+
+{{< feature-state state="alpha" for_k8s_version="v1.24" >}}
+
+A node shutdown action may not be detected by kubelet's Node Shutdown Mananger, 
+either because the command does not trigger the inhibitor locks mechanism used by 
+kubelet or because of a user error, i.e., the ShutdownGracePeriod and 
+ShutdownGracePeriodCriticalPods are not configured properly. Please refer to above 
+section [Graceful Node Shutdown](#graceful-node-shutdown) for more details.
+
+When a node is shutdown but not detected by kubelet's Node Shutdown Manager, the pods 
+that are part of a StatefulSet will be stuck in terminating status on 
+the shutdown node and cannot move to a new running node. This is because kubelet on 
+the shutdown node is not available to delete the pods so the StatefulSet cannot 
+create a new pod with the same name. If there are volumes used by the pods, the 
+VolumeAttachments will not be deleted from the original shutdown node so the volumes 
+used by these pods cannot be attached to a new running node. As a result, the 
+application running on the StatefulSet cannot function properly. If the original 
+shutdown node comes up, the pods will be deleted by kubelet and new pods will be 
+created on a different running node. If the original shutdown node does not come up,  
+these pods will be stuck in terminating status on the shutdown node forever.
+
+To mitigate the above situation, a  user can manually add the taint `node 
+kubernetes.io/out-of-service` with either `NoExecute` or `NoSchedule` effect to 
+a Node marking it out-of-service. 
+If the `NodeOutOfServiceVolumeDetach`  [feature gate](/docs/reference/
+command-line-tools-reference/feature-gates/) is enabled on
+`kube-controller-manager`, and a Node is marked out-of-service with this taint, the 
+pods on the node will be forcefully deleted if there are no matching tolerations on
+it and volume detach operations for the pods terminating on the node will happen
+immediately. This allows the Pods on the out-of-service node to recover quickly on a
+different node. 
+
+During a non-graceful shutdown, Pods are terminated in the two phases:
+
+1. Force delete the Pods that do not have matching `out-of-service` tolerations.
+2. Immediately perform detach volume operation for such pods. 
+
+
+{{< note >}}
+- Before adding the taint `node.kubernetes.io/out-of-service` , it should be verified
+that the node is already in shutdown or power off state (not in the middle of
+restarting).
+- The user is required to manually remove the out-of-service taint after the pods are
+moved to a new node and the user has checked that the shutdown node has been
+recovered since the user was the one who originally added the taint.
+
+
 {{< /note >}}

 ### Pod Priority based graceful node shutdown {#pod-priority-graceful-node-shutdown}
--- a/content/en/docs/reference/labels-annotations-taints/_index.md
+++ b/content/en/docs/reference/labels-annotations-taints/_index.md
@ -381,6 +381,21 @@ Example: `node.kubernetes.io/pid-pressure:NoSchedule`

 The kubelet checks D-value of the size of `/proc/sys/kernel/pid_max` and the PIDs consumed by Kubernetes on a node to get the number of available PIDs that referred to as the `pid.available` metric. The metric is then compared to the corresponding threshold that can be set on the kubelet to determine if the node condition and taint should be added/removed.

+### node.kubernetes.io/out-of-service
+
+Example: `node.kubernetes.io/out-of-service:NoExecute`
+
+A user can manually add the taint to a Node marking it out-of-service. If the `NodeOutOfServiceVolumeDetach` 
+[feature gate](/docs/reference/command-line-tools-reference/feature-gates/) is enabled on
+`kube-controller-manager`, and a Node is marked out-of-service with this taint, the pods on the node will be forcefully deleted if there are no matching tolerations on it and volume detach operations for the pods terminating on the node will happen immediately. This allows the Pods on the out-of-service node to recover quickly on a different node. 
+
+{{< caution >}}
+Refer to
+[Non-graceful node shutdown](/docs/concepts/architecture/nodes/#non-graceful-node-shutdown)
+for further details about when and how to use this taint.
+{{< /caution >}}
+
+
 ### node.cloudprovider.kubernetes.io/uninitialized

 Example: `node.cloudprovider.kubernetes.io/uninitialized:NoSchedule`