v1.24 blog post: Maxunavailable for StatefulSet
parent
7f3f987c76
commit
1f3cb04a86
|
@ -0,0 +1,149 @@
|
|||
---
|
||||
layout: blog
|
||||
title: 'Kubernetes 1.24: Maximum Unavailable Replicas for StatefulSet'
|
||||
date: 2022-05-27
|
||||
slug: maxunavailable-for-statefulset
|
||||
---
|
||||
|
||||
**Author:** Mayank Kumar (Salesforce)
|
||||
|
||||
Kubernetes [StatefulSets](/docs/concepts/workloads/controllers/statefulset/), since their introduction in
|
||||
1.5 and becoming stable in 1.9, have been widely used to run stateful applications. They provide stable pod identity, persistent
|
||||
per pod storage and ordered graceful deployment, scaling and rolling updates. You can think of StatefulSet as the atomic building
|
||||
block for running complex stateful applications. As the use of Kubernetes has grown, so has the number of scenarios requiring
|
||||
StatefulSets. Many of these scenarios, require faster rolling updates than the currently supported one-pod-at-a-time updates, in the
|
||||
case where you're using the `OrderedReady` Pod management policy for a StatefulSet.
|
||||
|
||||
|
||||
Here are some examples:
|
||||
|
||||
- I am using a StatefulSet to orchestrate a multi-instance, cache based application where the size of the cache is large. The cache
|
||||
starts cold and requires some siginificant amount of time before the container can start. There could be more initial startup tasks
|
||||
that are required. A RollingUpdate on this StatefulSet would take a lot of time before the application is fully updated. If the
|
||||
StatefulSet supported updating more than one pod at a time, it would result in a much faster update.
|
||||
|
||||
- My stateful application is composed of leaders and followers or one writer and multiple readers. I have multiple readers or
|
||||
followers and my application can tolerate multiple pods going down at the same time. I want to update this application more than
|
||||
one pod at a time so that i get the new updates rolled out quickly, especially if the number of instances of my application are
|
||||
large. Note that my application still requires unique identity per pod.
|
||||
|
||||
|
||||
In order to support such scenarios, Kubernetes 1.24 includes a new alpha feature to help. Before you can use the new feature you must
|
||||
enable the `MaxUnavailableStatefulSet` feature flag. Once you enable that, you can specify a new field called `maxUnavailable`, part
|
||||
of the `spec` for a StatefulSet. For example:
|
||||
|
||||
```
|
||||
apiVersion: apps/v1
|
||||
kind: StatefulSet
|
||||
metadata:
|
||||
name: web
|
||||
namespace: default
|
||||
spec:
|
||||
podManagementPolicy: OrderedReady # you must set OrderedReady
|
||||
replicas: 5
|
||||
selector:
|
||||
matchLabels:
|
||||
app: nginx
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: nginx
|
||||
spec:
|
||||
containers:
|
||||
- image: k8s.gcr.io/nginx-slim:0.8
|
||||
imagePullPolicy: IfNotPresent
|
||||
name: nginx
|
||||
updateStrategy:
|
||||
rollingUpdate:
|
||||
maxUnavailable: 2 # this is the new alpha field, whose default value is 1
|
||||
partition: 0
|
||||
type: RollingUpdate
|
||||
```
|
||||
|
||||
If you enable the new feature and you don't specify a value for `maxUnavailable` in a StatefulSet, Kubernetes applies a default
|
||||
`maxUnavailable: 1`. This matches the behavior you would see if you don't enable the new feature.
|
||||
|
||||
I'll run through a scenario based on that example manifest to demonstrate how this feature works. I will deploy a StatefulSet that
|
||||
has 5 replicas, with `maxUnavailable` set to 2 and `partition` set to 0.
|
||||
|
||||
I can trigger a rolling update by changing the image to `k8s.gcr.io/nginx-slim:0.9`. Once I initiate the rolling update, I can
|
||||
watch the pods update 2 at a time as the current value of maxUnavailable is 2. The below output shows a span of time and is not
|
||||
complete. The maxUnavailable can be an absolute number (for example, 2) or a percentage of desired Pods (for example, 10%). The
|
||||
absolute number is calculated from percentage by rounding down.
|
||||
```
|
||||
kubectl get pods --watch
|
||||
```
|
||||
|
||||
```
|
||||
NAME READY STATUS RESTARTS AGE
|
||||
web-0 1/1 Running 0 85s
|
||||
web-1 1/1 Running 0 2m6s
|
||||
web-2 1/1 Running 0 106s
|
||||
web-3 1/1 Running 0 2m47s
|
||||
web-4 1/1 Running 0 2m27s
|
||||
web-4 1/1 Terminating 0 5m43s ----> start terminating 4
|
||||
web-3 1/1 Terminating 0 6m3s ----> start terminating 3
|
||||
web-3 0/1 Terminating 0 6m7s
|
||||
web-3 0/1 Pending 0 0s
|
||||
web-3 0/1 Pending 0 0s
|
||||
web-4 0/1 Terminating 0 5m48s
|
||||
web-4 0/1 Terminating 0 5m48s
|
||||
web-3 0/1 ContainerCreating 0 2s
|
||||
web-3 1/1 Running 0 2s
|
||||
web-4 0/1 Pending 0 0s
|
||||
web-4 0/1 Pending 0 0s
|
||||
web-4 0/1 ContainerCreating 0 0s
|
||||
web-4 1/1 Running 0 1s
|
||||
web-2 1/1 Terminating 0 5m46s ----> start terminating 2 (only after both 4 and 3 are running)
|
||||
web-1 1/1 Terminating 0 6m6s ----> start terminating 1
|
||||
web-2 0/1 Terminating 0 5m47s
|
||||
web-1 0/1 Terminating 0 6m7s
|
||||
web-1 0/1 Pending 0 0s
|
||||
web-1 0/1 Pending 0 0s
|
||||
web-1 0/1 ContainerCreating 0 1s
|
||||
web-1 1/1 Running 0 2s
|
||||
web-2 0/1 Pending 0 0s
|
||||
web-2 0/1 Pending 0 0s
|
||||
web-2 0/1 ContainerCreating 0 0s
|
||||
web-2 1/1 Running 0 1s
|
||||
web-0 1/1 Terminating 0 6m6s ----> start terminating 0 (only after 2 and 1 are running)
|
||||
web-0 0/1 Terminating 0 6m7s
|
||||
web-0 0/1 Pending 0 0s
|
||||
web-0 0/1 Pending 0 0s
|
||||
web-0 0/1 ContainerCreating 0 0s
|
||||
web-0 1/1 Running 0 1s
|
||||
```
|
||||
Note that as soon as the rolling update starts, both 4 and 3 (the two highest ordinal pods) start terminating at the same time. Pods
|
||||
with ordinal 4 and 3 may become ready at their own pace. As soon as both pods 4 and 3 are ready, pods 2 and 1 start terminating at the
|
||||
same time. When pods 2 and 1 are both running and ready, pod 0 starts terminating.
|
||||
|
||||
In Kubernetes, updates to StatefulSets follow a strict ordering when updating Pods. In this example, the update starts at replica 4, then
|
||||
replica 3, then replica 2, and so on, one pod at a time. When going one pod at a time, its not possible for 3 to be running and ready
|
||||
before 4. When `maxUnavailable` is more than 1 (in the example scenario I set `maxUnavailable` to 2), it is possible that replica 3 becomes
|
||||
ready and running before replica 4 is ready—and that is ok. If you're a developer and you set `maxUnavailable` to more than 1, you should
|
||||
know that this outcome is possible and you must ensure that your application is able to handle such ordering issues that occur
|
||||
if any. When you set `maxUnavailable` greater than 1, the ordering is guaranteed in between each batch of pods being updated. That guarantee
|
||||
means that pods in update batch 2 (replicas 2 and 1) cannot start updating until the pods from batch 0 (replicas 4 and 3) are ready.
|
||||
|
||||
Although Kubernetes refers to these as _replicas_, your stateful application may have a different view and each pod of the StatefulSet may
|
||||
be holding completely different data than other pods. The important thing here is that updates to StatefulSets happen in batches, and you can
|
||||
now have a batch size larger than 1 (as an alpha feature).
|
||||
|
||||
Also note, that the above behavior is with `podManagementPolicy: OrderedReady`. If you defined a StatefulSet as `podManagementPolicy: Parallel`,
|
||||
not only `maxUnavailable` number of replicas are terminated at the same time; `maxUnavailable` number of replicas start in `ContainerCreating`
|
||||
phase at the same time as well. This is called bursting.
|
||||
|
||||
So, now you may have a lot of questions about:-
|
||||
- What is the behavior when you set `podManagementPolicy: Parallel`?
|
||||
- What is the behavior when `partition` to a value other than `0`?
|
||||
|
||||
It might be better to try and see it for yourself. This is an alpha feature, and the Kubernetes contributors are looking for feedback on this feature. Did
|
||||
this help you achieve your stateful scenarios Did you find a bug or do you think the behavior as implemented is not intuitive or can
|
||||
break applications or catch them by surprise? Please [open an issue](https://github.com/kubernetes/kubernetes/issues) to let us know.
|
||||
|
||||
Keep an eye on this space, for more blogs to dissect the behavior of this feature in the coming months.
|
||||
## Further reading and next steps {#next-steps}
|
||||
- [Maximum unavailable Pods](/docs/concepts/workloads/controllers/statefulset/#maximum-unavailable-pods)
|
||||
- [KEP for MaxUnavailable for StatefulSet](https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/961-maxunavailable-for-statefulset)
|
||||
- [Implementation](https://github.com/kubernetes/kubernetes/pull/82162/files)
|
||||
- [Enhancement Tracking Issue](https://github.com/kubernetes/enhancements/issues/961)
|
|
@ -323,11 +323,6 @@ After reverting the template, you must also delete any Pods that StatefulSet had
|
|||
already attempted to run with the bad configuration.
|
||||
StatefulSet will then begin to recreate the Pods using the reverted template.
|
||||
|
||||
#### MaxUnavailable
|
||||
The maximum number of pods that can be unavailable during the update.
|
||||
Value can be an absolute number (ex: 5) or a percentage of desired pods (ex: 10%).
|
||||
Absolute number is calculated from percentage by rounding up. This can not be 0.
|
||||
Defaults to 1. This field is alpha-level and is only honored by servers that enable the
|
||||
|
||||
## PersistentVolumeClaim retention
|
||||
|
||||
|
|
Loading…
Reference in New Issue