From 6c6ace0a33a8d59c46cbbf85c7bec60f0d8d6e13 Mon Sep 17 00:00:00 2001 From: Dave Chen Date: Mon, 13 Nov 2023 16:00:59 +0800 Subject: [PATCH] Update trouble shooting to include the issue of etcd upgrade For the isse which is reported recently: https://github.com/kubernetes/kubeadm/issues/2957 We'd better to provide some tips to workaround this known issue. Signed-off-by: Dave Chen --- .../tools/kubeadm/troubleshooting-kubeadm.md | 79 +++++++++++++++++++ 1 file changed, 79 insertions(+) diff --git a/content/en/docs/setup/production-environment/tools/kubeadm/troubleshooting-kubeadm.md b/content/en/docs/setup/production-environment/tools/kubeadm/troubleshooting-kubeadm.md index fe3f08e578..23b3c978f6 100644 --- a/content/en/docs/setup/production-environment/tools/kubeadm/troubleshooting-kubeadm.md +++ b/content/en/docs/setup/production-environment/tools/kubeadm/troubleshooting-kubeadm.md @@ -431,3 +431,82 @@ See [Enabling signed kubelet serving certificates](/docs/tasks/administer-cluste to understand how to configure the kubelets in a kubeadm cluster to have properly signed serving certificates. Also see [How to run the metrics-server securely](https://github.com/kubernetes-sigs/metrics-server/blob/master/FAQ.md#how-to-run-metrics-server-securely). + +## Upgrade fails due to etcd hash not changing + +Only applicable to upgrading a control plane node with a kubeadm binary v1.28.3 or later, +where the node is currently managed by kubeadm versions v1.28.0, v1.28.1 or v1.28.2. + +Here is the error message you may encounter: +``` +[upgrade/etcd] Failed to upgrade etcd: couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state. Errors faced: static Pod hash for component etcd on Node kinder-upgrade-control-plane-1 did not change after 5m0s: timed out waiting for the condition +[upgrade/etcd] Waiting for previous etcd to become available +I0907 10:10:09.109104 3704 etcd.go:588] [etcd] attempting to see if all cluster endpoints ([https://172.17.0.6:2379/ https://172.17.0.4:2379/ https://172.17.0.3:2379/]) are available 1/10 +[upgrade/etcd] Etcd was rolled back and is now available +static Pod hash for component etcd on Node kinder-upgrade-control-plane-1 did not change after 5m0s: timed out waiting for the condition +couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state. Errors faced +k8s.io/kubernetes/cmd/kubeadm/app/phases/upgrade.rollbackOldManifests + cmd/kubeadm/app/phases/upgrade/staticpods.go:525 +k8s.io/kubernetes/cmd/kubeadm/app/phases/upgrade.upgradeComponent + cmd/kubeadm/app/phases/upgrade/staticpods.go:254 +k8s.io/kubernetes/cmd/kubeadm/app/phases/upgrade.performEtcdStaticPodUpgrade + cmd/kubeadm/app/phases/upgrade/staticpods.go:338 +... +``` + +The reason for this failure is that the affected versions generate an etcd manifest file with unwanted defaults in the PodSpec. +This will result in a diff from the manifest comparison, and kubeadm will expect a change in the Pod hash, but the kubelet will never update the hash. + +There are two way to workaround this issue if you see it in your cluster: +- The etcd upgrade can be skipped between the affected versions and v1.28.3 (or later) by using: +```shell +kubeadm upgrade {apply|node} [version] --etcd-upgrade=false +``` + +This is not recommended in case a new etcd version was introduced by a later v1.28 patch version. + +- Before upgrade, patch the manifest for the etcd static pod, to remove the problematic defaulted attributes: + + ```patch + diff --git a/etc/kubernetes/manifests/etcd_defaults.yaml b/etc/kubernetes/manifests/etcd_origin.yaml + index d807ccbe0aa..46b35f00e15 100644 + --- a/etc/kubernetes/manifests/etcd_defaults.yaml + +++ b/etc/kubernetes/manifests/etcd_origin.yaml + @@ -43,7 +43,6 @@ spec: + scheme: HTTP + initialDelaySeconds: 10 + periodSeconds: 10 + - successThreshold: 1 + timeoutSeconds: 15 + name: etcd + resources: + @@ -59,26 +58,18 @@ spec: + scheme: HTTP + initialDelaySeconds: 10 + periodSeconds: 10 + - successThreshold: 1 + timeoutSeconds: 15 + - terminationMessagePath: /dev/termination-log + - terminationMessagePolicy: File + volumeMounts: + - mountPath: /var/lib/etcd + name: etcd-data + - mountPath: /etc/kubernetes/pki/etcd + name: etcd-certs + - dnsPolicy: ClusterFirst + - enableServiceLinks: true + hostNetwork: true + priority: 2000001000 + priorityClassName: system-node-critical + - restartPolicy: Always + - schedulerName: default-scheduler + securityContext: + seccompProfile: + type: RuntimeDefault + - terminationGracePeriodSeconds: 30 + volumes: + - hostPath: + path: /etc/kubernetes/pki/etcd + ``` + +More information can be found in the [tracking issue](https://github.com/kubernetes/kubeadm/issues/2927) for this bug. \ No newline at end of file