224 lines
11 KiB
Markdown
224 lines
11 KiB
Markdown
---
|
|
layout: blog
|
|
title: "Kubernetes 1.27: StatefulSet Start Ordinal Simplifies Migration"
|
|
date: 2023-04-28
|
|
slug: statefulset-start-ordinal
|
|
author: >
|
|
Peter Schuurman (Google)
|
|
---
|
|
|
|
Kubernetes v1.26 introduced a new, alpha-level feature for
|
|
[StatefulSets](/docs/concepts/workloads/controllers/statefulset/) that controls
|
|
the ordinal numbering of Pod replicas. As of Kubernetes v1.27, this feature is
|
|
now beta. Ordinals can start from arbitrary
|
|
non-negative numbers. This blog post will discuss how this feature can be
|
|
used.
|
|
|
|
## Background
|
|
|
|
StatefulSets ordinals provide sequential identities for pod replicas. When using
|
|
[`OrderedReady` Pod management](/docs/tutorials/stateful-application/basic-stateful-set/#orderedready-pod-management)
|
|
Pods are created from ordinal index `0` up to `N-1`.
|
|
|
|
With Kubernetes today, orchestrating a StatefulSet migration across clusters is
|
|
challenging. Backup and restore solutions exist, but these require the
|
|
application to be scaled down to zero replicas prior to migration. In today's
|
|
fully connected world, even planned application downtime may not allow you to
|
|
meet your business goals. You could use
|
|
[Cascading Delete](/docs/tutorials/stateful-application/basic-stateful-set/#cascading-delete)
|
|
or
|
|
[On Delete](/docs/tutorials/stateful-application/basic-stateful-set/#on-delete)
|
|
to migrate individual pods, however this is error prone and tedious to manage.
|
|
You lose the self-healing benefit of the StatefulSet controller when your Pods
|
|
fail or are evicted.
|
|
|
|
Kubernetes v1.26 enables a StatefulSet to be responsible for a range of ordinals
|
|
within a range {0..N-1} (the ordinals 0, 1, ... up to N-1).
|
|
With it, you can scale down a range
|
|
{0..k-1} in a source cluster, and scale up the complementary range {k..N-1}
|
|
in a destination cluster, while maintaining application availability. This
|
|
enables you to retain *at most one* semantics (meaning there is at most one Pod
|
|
with a given identity running in a StatefulSet) and
|
|
[Rolling Update](/docs/tutorials/stateful-application/basic-stateful-set/#rolling-update)
|
|
behavior when orchestrating a migration across clusters.
|
|
|
|
## Why would I want to use this feature?
|
|
|
|
Say you're running your StatefulSet in one cluster, and need to migrate it out
|
|
to a different cluster. There are many reasons why you would need to do this:
|
|
* **Scalability**: Your StatefulSet has scaled too large for your cluster, and
|
|
has started to disrupt the quality of service for other workloads in your
|
|
cluster.
|
|
* **Isolation**: You're running a StatefulSet in a cluster that is accessed
|
|
by multiple users, and namespace isolation isn't sufficient.
|
|
* **Cluster Configuration**: You want to move your StatefulSet to a different
|
|
cluster to use some environment that is not available on your current
|
|
cluster.
|
|
* **Control Plane Upgrades**: You want to move your StatefulSet to a cluster
|
|
running an upgraded control plane, and can't handle the risk or downtime of
|
|
in-place control plane upgrades.
|
|
|
|
## How do I use it?
|
|
|
|
Enable the `StatefulSetStartOrdinal` feature gate on a cluster, and create a
|
|
StatefulSet with a customized `.spec.ordinals.start`.
|
|
|
|
## Try it out
|
|
|
|
In this demo, I'll use the new mechanism to migrate a
|
|
StatefulSet from one Kubernetes cluster to another. The
|
|
[redis-cluster](https://github.com/bitnami/charts/tree/main/bitnami/redis-cluster)
|
|
Bitnami Helm chart will be used to install Redis.
|
|
|
|
Tools Required:
|
|
* [yq](https://github.com/mikefarah/yq)
|
|
* [helm](https://helm.sh/docs/helm/helm_install/)
|
|
|
|
### Pre-requisites {#demo-pre-requisites}
|
|
|
|
To do this, I need two Kubernetes clusters that can both access common
|
|
networking and storage; I've named my clusters `source` and `destination`.
|
|
Specifically, I need:
|
|
|
|
* The `StatefulSetStartOrdinal` feature gate enabled on both clusters.
|
|
* Client configuration for `kubectl` that lets me access both clusters as an
|
|
administrator.
|
|
* The same `StorageClass` installed on both clusters, and set as the default
|
|
StorageClass for both clusters. This `StorageClass` should provision
|
|
underlying storage that is accessible from either or both clusters.
|
|
* A flat network topology that allows for pods to send and receive packets to
|
|
and from Pods in either clusters. If you are creating clusters on a cloud
|
|
provider, this configuration may be called private cloud or private network.
|
|
|
|
1. Create a demo namespace on both clusters:
|
|
|
|
```
|
|
kubectl create ns kep-3335
|
|
```
|
|
|
|
2. Deploy a Redis cluster with six replicas in the source cluster:
|
|
|
|
```
|
|
helm repo add bitnami https://charts.bitnami.com/bitnami
|
|
helm install redis --namespace kep-3335 \
|
|
bitnami/redis-cluster \
|
|
--set persistence.size=1Gi \
|
|
--set cluster.nodes=6
|
|
```
|
|
|
|
3. Check the replication status in the source cluster:
|
|
|
|
```
|
|
kubectl exec -it redis-redis-cluster-0 -- /bin/bash -c \
|
|
"redis-cli -c -h redis-redis-cluster -a $(kubectl get secret redis-redis-cluster -o jsonpath="{.data.redis-password}" | base64 -d) CLUSTER NODES;"
|
|
```
|
|
|
|
```
|
|
2ce30362c188aabc06f3eee5d92892d95b1da5c3 10.104.0.14:6379@16379 myself,master - 0 1669764411000 3 connected 10923-16383
|
|
7743661f60b6b17b5c71d083260419588b4f2451 10.104.0.16:6379@16379 slave 2ce30362c188aabc06f3eee5d92892d95b1da5c3 0 1669764410000 3 connected
|
|
961f35e37c4eea507cfe12f96e3bfd694b9c21d4 10.104.0.18:6379@16379 slave a8765caed08f3e185cef22bd09edf409dc2bcc61 0 1669764411000 1 connected
|
|
7136e37d8864db983f334b85d2b094be47c830e5 10.104.0.15:6379@16379 slave 2cff613d763b22c180cd40668da8e452edef3fc8 0 1669764412595 2 connected
|
|
a8765caed08f3e185cef22bd09edf409dc2bcc61 10.104.0.19:6379@16379 master - 0 1669764411592 1 connected 0-5460
|
|
2cff613d763b22c180cd40668da8e452edef3fc8 10.104.0.17:6379@16379 master - 0 1669764410000 2 connected 5461-10922
|
|
```
|
|
|
|
4. Deploy a Redis cluster with zero replicas in the destination cluster:
|
|
|
|
```
|
|
helm install redis --namespace kep-3335 \
|
|
bitnami/redis-cluster \
|
|
--set persistence.size=1Gi \
|
|
--set cluster.nodes=0 \
|
|
--set redis.extraEnvVars\[0\].name=REDIS_NODES,redis.extraEnvVars\[0\].value="redis-redis-cluster-headless.kep-3335.svc.cluster.local" \
|
|
--set existingSecret=redis-redis-cluster
|
|
```
|
|
|
|
5. Scale down the `redis-redis-cluster` StatefulSet in the source cluster by 1,
|
|
to remove the replica `redis-redis-cluster-5`:
|
|
|
|
```
|
|
kubectl patch sts redis-redis-cluster -p '{"spec": {"replicas": 5}}'
|
|
```
|
|
|
|
6. Migrate dependencies from the source cluster to the destination cluster:
|
|
|
|
The following commands copy resources from `source` to `destionation`. Details
|
|
that are not relevant in `destination` cluster are removed (eg: `uid`,
|
|
`resourceVersion`, `status`).
|
|
|
|
**Steps for the source cluster**
|
|
|
|
Note: If using a `StorageClass` with `reclaimPolicy: Delete` configured, you
|
|
should patch the PVs in `source` with `reclaimPolicy: Retain` prior to
|
|
deletion to retain the underlying storage used in `destination`. See
|
|
[Change the Reclaim Policy of a PersistentVolume](/docs/tasks/administer-cluster/change-pv-reclaim-policy/)
|
|
for more details.
|
|
|
|
```
|
|
kubectl get pvc redis-data-redis-redis-cluster-5 -o yaml | yq 'del(.metadata.uid, .metadata.resourceVersion, .metadata.annotations, .metadata.finalizers, .status)' > /tmp/pvc-redis-data-redis-redis-cluster-5.yaml
|
|
kubectl get pv $(yq '.spec.volumeName' /tmp/pvc-redis-data-redis-redis-cluster-5.yaml) -o yaml | yq 'del(.metadata.uid, .metadata.resourceVersion, .metadata.annotations, .metadata.finalizers, .spec.claimRef, .status)' > /tmp/pv-redis-data-redis-redis-cluster-5.yaml
|
|
kubectl get secret redis-redis-cluster -o yaml | yq 'del(.metadata.uid, .metadata.resourceVersion)' > /tmp/secret-redis-redis-cluster.yaml
|
|
```
|
|
|
|
**Steps for the destination cluster**
|
|
|
|
Note: For the PV/PVC, this procedure only works if the underlying storage system
|
|
that your PVs use can support being copied into `destination`. Storage
|
|
that is associated with a specific node or topology may not be supported.
|
|
Additionally, some storage systems may store addtional metadata about
|
|
volumes outside of a PV object, and may require a more specialized
|
|
sequence to import a volume.
|
|
|
|
```
|
|
kubectl create -f /tmp/pv-redis-data-redis-redis-cluster-5.yaml
|
|
kubectl create -f /tmp/pvc-redis-data-redis-redis-cluster-5.yaml
|
|
kubectl create -f /tmp/secret-redis-redis-cluster.yaml
|
|
```
|
|
|
|
7. Scale up the `redis-redis-cluster` StatefulSet in the destination cluster by
|
|
1, with a start ordinal of 5:
|
|
|
|
```
|
|
kubectl patch sts redis-redis-cluster -p '{"spec": {"ordinals": {"start": 5}, "replicas": 1}}'
|
|
```
|
|
|
|
8. Check the replication status in the destination cluster:
|
|
|
|
```
|
|
kubectl exec -it redis-redis-cluster-5 -- /bin/bash -c \
|
|
"redis-cli -c -h redis-redis-cluster -a $(kubectl get secret redis-redis-cluster -o jsonpath="{.data.redis-password}" | base64 -d) CLUSTER NODES;"
|
|
```
|
|
|
|
I should see that the new replica (labeled `myself`) has joined the Redis
|
|
cluster (the IP address belongs to a different CIDR block than the
|
|
replicas in the source cluster).
|
|
|
|
```
|
|
2cff613d763b22c180cd40668da8e452edef3fc8 10.104.0.17:6379@16379 master - 0 1669766684000 2 connected 5461-10922
|
|
7136e37d8864db983f334b85d2b094be47c830e5 10.108.0.22:6379@16379 myself,slave 2cff613d763b22c180cd40668da8e452edef3fc8 0 1669766685609 2 connected
|
|
2ce30362c188aabc06f3eee5d92892d95b1da5c3 10.104.0.14:6379@16379 master - 0 1669766684000 3 connected 10923-16383
|
|
961f35e37c4eea507cfe12f96e3bfd694b9c21d4 10.104.0.18:6379@16379 slave a8765caed08f3e185cef22bd09edf409dc2bcc61 0 1669766683600 1 connected
|
|
a8765caed08f3e185cef22bd09edf409dc2bcc61 10.104.0.19:6379@16379 master - 0 1669766685000 1 connected 0-5460
|
|
7743661f60b6b17b5c71d083260419588b4f2451 10.104.0.16:6379@16379 slave 2ce30362c188aabc06f3eee5d92892d95b1da5c3 0 1669766686613 3 connected
|
|
```
|
|
|
|
9. Repeat steps #5 to #7 for the remainder of the replicas, until the
|
|
Redis StatefulSet in the source cluster is scaled to 0, and the Redis
|
|
StatefulSet in the destination cluster is healthy with 6 total replicas.
|
|
|
|
## What's Next?
|
|
|
|
This feature provides a building block for a StatefulSet to be split up across
|
|
clusters, but does not prescribe the mechanism as to how the StatefulSet should
|
|
be migrated. Migration requires coordination of StatefulSet replicas, along with
|
|
orchestration of the storage and network layer. This is dependent on the storage
|
|
and connectivity requirements of the application installed by the StatefulSet.
|
|
Additionally, many StatefulSets are managed by
|
|
[operators](/docs/concepts/extend-kubernetes/operator/), which adds another
|
|
layer of complexity to migration.
|
|
|
|
If you're interested in building enhancements to make these processes easier,
|
|
get involved with
|
|
[SIG Multicluster](https://github.com/kubernetes/community/blob/master/sig-multicluster)
|
|
to contribute!
|