website/content/en/blog/_posts/2023-04-28-statefulset-migr...

---
layout: blog
title: "Kubernetes 1.27: StatefulSet Start Ordinal Simplifies Migration"
date: 2023-04-28
slug: statefulset-start-ordinal
author: >
   Peter Schuurman (Google)
---

Kubernetes v1.26 introduced a new, alpha-level feature for
[StatefulSets](/docs/concepts/workloads/controllers/statefulset/) that controls
the ordinal numbering of Pod replicas. As of Kubernetes v1.27, this feature is
now beta. Ordinals can start from arbitrary
non-negative numbers. This blog post will discuss how this feature can be
used.

## Background

StatefulSets ordinals provide sequential identities for pod replicas. When using
[`OrderedReady` Pod management](/docs/tutorials/stateful-application/basic-stateful-set/#orderedready-pod-management)
Pods are created from ordinal index `0` up to `N-1`.

With Kubernetes today, orchestrating a StatefulSet migration across clusters is
challenging. Backup and restore solutions exist, but these require the
application to be scaled down to zero replicas prior to migration. In today's
fully connected world, even planned application downtime may not allow you to
meet your business goals. You could use
[Cascading Delete](/docs/tutorials/stateful-application/basic-stateful-set/#cascading-delete)
or
[On Delete](/docs/tutorials/stateful-application/basic-stateful-set/#on-delete)
to migrate individual pods, however this is error prone and tedious to manage.
You lose the self-healing benefit of the StatefulSet controller when your Pods
fail or are evicted.

Kubernetes v1.26 enables a StatefulSet to be responsible for a range of ordinals
within a range {0..N-1} (the ordinals 0, 1, ... up to N-1).
With it, you can scale down a range
{0..k-1} in a source cluster, and scale up the complementary range {k..N-1}
in a destination cluster, while maintaining application availability. This
enables you to retain *at most one* semantics (meaning there is at most one Pod
with a given identity running in a StatefulSet) and
[Rolling Update](/docs/tutorials/stateful-application/basic-stateful-set/#rolling-update)
behavior when orchestrating a migration across clusters.

## Why would I want to use this feature?

Say you're running your StatefulSet in one cluster, and need to migrate it out
to a different cluster. There are many reasons why you would need to do this:
 * **Scalability**: Your StatefulSet has scaled too large for your cluster, and
   has started to disrupt the quality of service for other workloads in your
   cluster.
 * **Isolation**: You're running a StatefulSet in a cluster that is accessed
   by multiple users, and namespace isolation isn't sufficient.
 * **Cluster Configuration**: You want to move your StatefulSet to a different
   cluster to use some environment that is not available on your current
   cluster.
 * **Control Plane Upgrades**: You want to move your StatefulSet to a cluster
   running an upgraded control plane, and can't handle the risk or downtime of
   in-place control plane upgrades.

## How do I use it?

Enable the `StatefulSetStartOrdinal` feature gate on a cluster, and create a
StatefulSet with a customized `.spec.ordinals.start`.

## Try it out

In this demo, I'll use the new mechanism to migrate a
StatefulSet from one Kubernetes cluster to another. The
[redis-cluster](https://github.com/bitnami/charts/tree/main/bitnami/redis-cluster)
Bitnami Helm chart will be used to install Redis.

Tools Required:
 * [yq](https://github.com/mikefarah/yq)
 * [helm](https://helm.sh/docs/helm/helm_install/)

### Pre-requisites {#demo-pre-requisites}

To do this, I need two Kubernetes clusters that can both access common
networking and storage; I've named my clusters `source` and `destination`.
Specifically, I need:

* The `StatefulSetStartOrdinal` feature gate enabled on both clusters.
* Client configuration for `kubectl` that lets me access both clusters as an
  administrator.
* The same `StorageClass` installed on both clusters, and set as the default
  StorageClass for both clusters. This `StorageClass` should provision
  underlying storage that is accessible from either or both clusters.
* A flat network topology that allows for pods to send and receive packets to
  and from Pods in either clusters. If you are creating clusters on a cloud
  provider, this configuration may be called private cloud or private network.

1. Create a demo namespace on both clusters:

   ```
   kubectl create ns kep-3335
   ```

2. Deploy a Redis cluster with six replicas in the source cluster:

   ```
   helm repo add bitnami https://charts.bitnami.com/bitnami
   helm install redis --namespace kep-3335 \
     bitnami/redis-cluster \
     --set persistence.size=1Gi \
     --set cluster.nodes=6
   ```

3. Check the replication status in the source cluster:

   ```
   kubectl exec -it redis-redis-cluster-0 -- /bin/bash -c \
     "redis-cli -c -h redis-redis-cluster -a $(kubectl get secret redis-redis-cluster -o jsonpath="{.data.redis-password}" | base64 -d) CLUSTER NODES;"
   ```

   ```
   2ce30362c188aabc06f3eee5d92892d95b1da5c3 10.104.0.14:6379@16379 myself,master - 0 1669764411000 3 connected 10923-16383
   7743661f60b6b17b5c71d083260419588b4f2451 10.104.0.16:6379@16379 slave 2ce30362c188aabc06f3eee5d92892d95b1da5c3 0 1669764410000 3 connected
   961f35e37c4eea507cfe12f96e3bfd694b9c21d4 10.104.0.18:6379@16379 slave a8765caed08f3e185cef22bd09edf409dc2bcc61 0 1669764411000 1 connected
   7136e37d8864db983f334b85d2b094be47c830e5 10.104.0.15:6379@16379 slave 2cff613d763b22c180cd40668da8e452edef3fc8 0 1669764412595 2 connected
   a8765caed08f3e185cef22bd09edf409dc2bcc61 10.104.0.19:6379@16379 master - 0 1669764411592 1 connected 0-5460
   2cff613d763b22c180cd40668da8e452edef3fc8 10.104.0.17:6379@16379 master - 0 1669764410000 2 connected 5461-10922
   ```

4. Deploy a Redis cluster with zero replicas in the destination cluster:

   ```
   helm install redis --namespace kep-3335 \
     bitnami/redis-cluster \
     --set persistence.size=1Gi \
     --set cluster.nodes=0 \
     --set redis.extraEnvVars\[0\].name=REDIS_NODES,redis.extraEnvVars\[0\].value="redis-redis-cluster-headless.kep-3335.svc.cluster.local" \
     --set existingSecret=redis-redis-cluster
   ```

5. Scale down the `redis-redis-cluster` StatefulSet in the source cluster by 1,
   to remove the replica `redis-redis-cluster-5`:

   ```
   kubectl patch sts redis-redis-cluster -p '{"spec": {"replicas": 5}}'
   ```

6. Migrate dependencies from the source cluster to the destination cluster:

   The following commands copy resources from `source` to `destionation`. Details
   that are not relevant in `destination` cluster are removed (eg: `uid`,
   `resourceVersion`, `status`).

   **Steps for the source cluster**

   Note: If using a `StorageClass` with `reclaimPolicy: Delete` configured, you
         should patch the PVs in `source` with `reclaimPolicy: Retain` prior to
         deletion to retain the underlying storage used in `destination`. See
         [Change the Reclaim Policy of a PersistentVolume](/docs/tasks/administer-cluster/change-pv-reclaim-policy/)
         for more details.

   ```
   kubectl get pvc redis-data-redis-redis-cluster-5 -o yaml | yq 'del(.metadata.uid, .metadata.resourceVersion, .metadata.annotations, .metadata.finalizers, .status)' > /tmp/pvc-redis-data-redis-redis-cluster-5.yaml
   kubectl get pv $(yq '.spec.volumeName' /tmp/pvc-redis-data-redis-redis-cluster-5.yaml) -o yaml | yq 'del(.metadata.uid, .metadata.resourceVersion, .metadata.annotations, .metadata.finalizers, .spec.claimRef, .status)' > /tmp/pv-redis-data-redis-redis-cluster-5.yaml
   kubectl get secret redis-redis-cluster -o yaml | yq 'del(.metadata.uid, .metadata.resourceVersion)' > /tmp/secret-redis-redis-cluster.yaml
   ```

   **Steps for the destination cluster**

   Note: For the PV/PVC, this procedure only works if the underlying storage system
         that your PVs use can support being copied into `destination`. Storage
         that is associated with a specific node or topology may not be supported.
         Additionally, some storage systems may store addtional metadata about
         volumes outside of a PV object, and may require a more specialized
         sequence to import a volume.

   ```
   kubectl create -f /tmp/pv-redis-data-redis-redis-cluster-5.yaml
   kubectl create -f /tmp/pvc-redis-data-redis-redis-cluster-5.yaml
   kubectl create -f /tmp/secret-redis-redis-cluster.yaml
   ```

7. Scale up the `redis-redis-cluster` StatefulSet in the destination cluster by
   1, with a start ordinal of 5:

   ```
   kubectl patch sts redis-redis-cluster -p '{"spec": {"ordinals": {"start": 5}, "replicas": 1}}'
   ```

8. Check the replication status in the destination cluster:

   ```
   kubectl exec -it redis-redis-cluster-5 -- /bin/bash -c \
     "redis-cli -c -h redis-redis-cluster -a $(kubectl get secret redis-redis-cluster -o jsonpath="{.data.redis-password}" | base64 -d) CLUSTER NODES;"
   ```

   I should see that the new replica (labeled `myself`) has joined the Redis
   cluster (the IP address belongs to a different CIDR block than the
   replicas in the source cluster).

   ```
   2cff613d763b22c180cd40668da8e452edef3fc8 10.104.0.17:6379@16379 master - 0 1669766684000 2 connected 5461-10922
   7136e37d8864db983f334b85d2b094be47c830e5 10.108.0.22:6379@16379 myself,slave 2cff613d763b22c180cd40668da8e452edef3fc8 0 1669766685609 2 connected
   2ce30362c188aabc06f3eee5d92892d95b1da5c3 10.104.0.14:6379@16379 master - 0 1669766684000 3 connected 10923-16383
   961f35e37c4eea507cfe12f96e3bfd694b9c21d4 10.104.0.18:6379@16379 slave a8765caed08f3e185cef22bd09edf409dc2bcc61 0 1669766683600 1 connected
   a8765caed08f3e185cef22bd09edf409dc2bcc61 10.104.0.19:6379@16379 master - 0 1669766685000 1 connected 0-5460
   7743661f60b6b17b5c71d083260419588b4f2451 10.104.0.16:6379@16379 slave 2ce30362c188aabc06f3eee5d92892d95b1da5c3 0 1669766686613 3 connected
   ```

9. Repeat steps #5 to #7 for the remainder of the replicas, until the
   Redis StatefulSet in the source cluster is scaled to 0, and the Redis
   StatefulSet in the destination cluster is healthy with 6 total replicas.

## What's Next?

This feature provides a building block for a StatefulSet to be split up across
clusters, but does not prescribe the mechanism as to how the StatefulSet should
be migrated. Migration requires coordination of StatefulSet replicas, along with
orchestration of the storage and network layer. This is dependent on the storage
and connectivity requirements of the application installed by the StatefulSet.
Additionally, many StatefulSets are managed by
[operators](/docs/concepts/extend-kubernetes/operator/), which adds another
layer of complexity to migration.

If you're interested in building enhancements to make these processes easier,
get involved with
[SIG Multicluster](https://github.com/kubernetes/community/blob/master/sig-multicluster)
to contribute!