website/content/en/blog/_posts/2022-05-27-maxunavailable-f...

---
layout: blog
title: 'Kubernetes 1.24: Maximum Unavailable Replicas for StatefulSet'
date: 2022-05-27
slug: maxunavailable-for-statefulset
---

**Author:** Mayank Kumar (Salesforce)

Kubernetes [StatefulSets](/docs/concepts/workloads/controllers/statefulset/), since their introduction in
1.5 and becoming stable in 1.9, have been widely used to run stateful applications. They provide stable pod identity, persistent
per pod storage and ordered graceful deployment, scaling and rolling updates. You can think of StatefulSet as the atomic building
block for running complex stateful applications. As the use of Kubernetes has grown, so has the number of scenarios requiring
StatefulSets. Many of these scenarios, require faster rolling updates than the currently supported one-pod-at-a-time updates, in the
case where you're using the `OrderedReady` Pod management policy for a StatefulSet.


Here are some examples:

-  I am using a StatefulSet to orchestrate a multi-instance, cache based application where the size of the cache is large. The cache
   starts cold and requires some significant amount of time before the container can start. There could be more initial startup tasks
   that are required. A RollingUpdate on this StatefulSet would take a lot of time before the application is fully updated. If the
   StatefulSet supported updating more than one pod at a time, it would result in a much faster update.

-  My stateful application is composed of leaders and followers or one writer and multiple readers. I have multiple readers or
   followers and my application can tolerate multiple pods going down at the same time. I want to update this application more than
   one pod at a time so that i get the new updates rolled out quickly, especially if the number of instances of my application are
   large. Note that my application still requires unique identity per pod.


In order to support such scenarios, Kubernetes 1.24 includes a new alpha feature to help. Before you can use the new feature you must
enable the `MaxUnavailableStatefulSet` feature flag. Once you enable that, you can specify a new field called `maxUnavailable`, part
of the `spec` for a StatefulSet. For example:

```
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: web
  namespace: default
spec:
  podManagementPolicy: OrderedReady  # you must set OrderedReady
  replicas: 5
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      # image changed since publication (previously used registry "k8s.gcr.io")
      - image: registry.k8s.io/nginx-slim:0.8
        imagePullPolicy: IfNotPresent
        name: nginx
  updateStrategy:
    rollingUpdate:
      maxUnavailable: 2 # this is the new alpha field, whose default value is 1
      partition: 0
    type: RollingUpdate
```

If you enable the new feature and you don't specify a value for `maxUnavailable` in a StatefulSet, Kubernetes applies a default
`maxUnavailable: 1`. This matches the behavior you would see if you don't enable the new feature.

I'll run through a scenario based on that example manifest to demonstrate how this feature works. I will deploy a StatefulSet that
has 5 replicas, with `maxUnavailable` set to 2 and `partition` set to 0.

I can trigger a rolling update by changing the image to `registry.k8s.io/nginx-slim:0.9`. Once I initiate the rolling update, I can
watch the pods update 2 at a time as the current value of maxUnavailable is 2. The below output shows a span of time and is not
complete.  The maxUnavailable can be an absolute number (for example, 2) or a percentage of desired Pods (for example, 10%). The
absolute number is calculated from percentage by rounding up to the nearest integer.
```
kubectl get pods --watch
```

```
NAME    READY   STATUS    RESTARTS   AGE
web-0   1/1     Running   0          85s
web-1   1/1     Running   0          2m6s
web-2   1/1     Running   0          106s
web-3   1/1     Running   0          2m47s
web-4   1/1     Running   0          2m27s
web-4   1/1     Terminating   0          5m43s ----> start terminating 4
web-3   1/1     Terminating   0          6m3s  ----> start terminating 3
web-3   0/1     Terminating   0          6m7s
web-3   0/1     Pending       0          0s
web-3   0/1     Pending       0          0s
web-4   0/1     Terminating   0          5m48s
web-4   0/1     Terminating   0          5m48s
web-3   0/1     ContainerCreating   0          2s
web-3   1/1     Running             0          2s
web-4   0/1     Pending             0          0s
web-4   0/1     Pending             0          0s
web-4   0/1     ContainerCreating   0          0s
web-4   1/1     Running             0          1s
web-2   1/1     Terminating         0          5m46s ----> start terminating 2 (only after both 4 and 3 are running)
web-1   1/1     Terminating         0          6m6s  ----> start terminating 1
web-2   0/1     Terminating         0          5m47s
web-1   0/1     Terminating         0          6m7s
web-1   0/1     Pending             0          0s
web-1   0/1     Pending             0          0s
web-1   0/1     ContainerCreating   0          1s
web-1   1/1     Running             0          2s
web-2   0/1     Pending             0          0s
web-2   0/1     Pending             0          0s
web-2   0/1     ContainerCreating   0          0s
web-2   1/1     Running             0          1s
web-0   1/1     Terminating         0          6m6s ----> start terminating 0 (only after 2 and 1 are running)
web-0   0/1     Terminating         0          6m7s
web-0   0/1     Pending             0          0s
web-0   0/1     Pending             0          0s
web-0   0/1     ContainerCreating   0          0s
web-0   1/1     Running             0          1s
```
Note that as soon as the rolling update starts, both 4 and 3 (the two highest ordinal pods) start terminating at the same time. Pods
with ordinal 4 and 3 may become ready at their own pace. As soon as both pods 4 and 3 are ready, pods 2 and 1 start terminating at the
same time. When pods 2 and 1 are both running and ready, pod 0 starts terminating.

In Kubernetes, updates to StatefulSets follow a strict ordering when updating Pods. In this example, the update starts at replica 4, then
replica 3, then replica 2, and so on, one pod at a time. When going one pod at a time, its not possible for 3 to be running and ready
before 4. When `maxUnavailable` is more than 1 (in the example scenario I set `maxUnavailable` to 2), it is possible that replica 3 becomes
ready and running before replica 4 is ready&mdash;and that is ok. If you're a developer and you set `maxUnavailable` to more than 1, you should
know that this outcome is possible and you must ensure that your application is able to handle such ordering issues that occur
if any. When you set `maxUnavailable` greater than 1, the ordering is guaranteed in between each batch of pods being updated. That guarantee
means that pods in update batch 2 (replicas 2 and 1) cannot start updating until the pods from batch 0 (replicas 4 and 3) are ready.

Although Kubernetes refers to these as _replicas_, your stateful application may have a different view and each pod of the StatefulSet may
be holding completely different data than other pods. The important thing here is that updates to StatefulSets happen in batches, and you can
now have a batch size larger than 1 (as an alpha feature).

Also note, that the above behavior is with `podManagementPolicy: OrderedReady`. If you defined a StatefulSet as `podManagementPolicy: Parallel`,
not only `maxUnavailable` number of replicas are terminated at the same time; `maxUnavailable` number of replicas start in `ContainerCreating`
phase at the same time as well. This is called bursting.

So, now you may have a lot of questions about:-
- What is the behavior when you set `podManagementPolicy: Parallel`?
- What is the behavior when `partition` to a value other than `0`?

It might be better to try and see it for yourself. This is an alpha feature, and the Kubernetes contributors are looking for feedback on this feature. Did
this help you achieve your stateful scenarios Did you find a bug or do you think the behavior as implemented is not intuitive or can
break applications or catch them by surprise? Please [open an issue](https://github.com/kubernetes/kubernetes/issues) to let us know.

## Further reading and next steps {#next-steps}
- [Maximum unavailable Pods](/docs/concepts/workloads/controllers/statefulset/#maximum-unavailable-pods)
- [KEP for MaxUnavailable for StatefulSet](https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/961-maxunavailable-for-statefulset)
- [Implementation](https://github.com/kubernetes/kubernetes/pull/82162/files)
- [Enhancement Tracking Issue](https://github.com/kubernetes/enhancements/issues/961)