Update trouble shooting to include the issue of etcd upgrade
For the isse which is reported recently: https://github.com/kubernetes/kubeadm/issues/2957 We'd better to provide some tips to workaround this known issue. Signed-off-by: Dave Chen <dave.chen@arm.com>pull/44058/head
parent
0082100f78
commit
6c6ace0a33
|
@ -431,3 +431,82 @@ See [Enabling signed kubelet serving certificates](/docs/tasks/administer-cluste
|
|||
to understand how to configure the kubelets in a kubeadm cluster to have properly signed serving certificates.
|
||||
|
||||
Also see [How to run the metrics-server securely](https://github.com/kubernetes-sigs/metrics-server/blob/master/FAQ.md#how-to-run-metrics-server-securely).
|
||||
|
||||
## Upgrade fails due to etcd hash not changing
|
||||
|
||||
Only applicable to upgrading a control plane node with a kubeadm binary v1.28.3 or later,
|
||||
where the node is currently managed by kubeadm versions v1.28.0, v1.28.1 or v1.28.2.
|
||||
|
||||
Here is the error message you may encounter:
|
||||
```
|
||||
[upgrade/etcd] Failed to upgrade etcd: couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state. Errors faced: static Pod hash for component etcd on Node kinder-upgrade-control-plane-1 did not change after 5m0s: timed out waiting for the condition
|
||||
[upgrade/etcd] Waiting for previous etcd to become available
|
||||
I0907 10:10:09.109104 3704 etcd.go:588] [etcd] attempting to see if all cluster endpoints ([https://172.17.0.6:2379/ https://172.17.0.4:2379/ https://172.17.0.3:2379/]) are available 1/10
|
||||
[upgrade/etcd] Etcd was rolled back and is now available
|
||||
static Pod hash for component etcd on Node kinder-upgrade-control-plane-1 did not change after 5m0s: timed out waiting for the condition
|
||||
couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state. Errors faced
|
||||
k8s.io/kubernetes/cmd/kubeadm/app/phases/upgrade.rollbackOldManifests
|
||||
cmd/kubeadm/app/phases/upgrade/staticpods.go:525
|
||||
k8s.io/kubernetes/cmd/kubeadm/app/phases/upgrade.upgradeComponent
|
||||
cmd/kubeadm/app/phases/upgrade/staticpods.go:254
|
||||
k8s.io/kubernetes/cmd/kubeadm/app/phases/upgrade.performEtcdStaticPodUpgrade
|
||||
cmd/kubeadm/app/phases/upgrade/staticpods.go:338
|
||||
...
|
||||
```
|
||||
|
||||
The reason for this failure is that the affected versions generate an etcd manifest file with unwanted defaults in the PodSpec.
|
||||
This will result in a diff from the manifest comparison, and kubeadm will expect a change in the Pod hash, but the kubelet will never update the hash.
|
||||
|
||||
There are two way to workaround this issue if you see it in your cluster:
|
||||
- The etcd upgrade can be skipped between the affected versions and v1.28.3 (or later) by using:
|
||||
```shell
|
||||
kubeadm upgrade {apply|node} [version] --etcd-upgrade=false
|
||||
```
|
||||
|
||||
This is not recommended in case a new etcd version was introduced by a later v1.28 patch version.
|
||||
|
||||
- Before upgrade, patch the manifest for the etcd static pod, to remove the problematic defaulted attributes:
|
||||
|
||||
```patch
|
||||
diff --git a/etc/kubernetes/manifests/etcd_defaults.yaml b/etc/kubernetes/manifests/etcd_origin.yaml
|
||||
index d807ccbe0aa..46b35f00e15 100644
|
||||
--- a/etc/kubernetes/manifests/etcd_defaults.yaml
|
||||
+++ b/etc/kubernetes/manifests/etcd_origin.yaml
|
||||
@@ -43,7 +43,6 @@ spec:
|
||||
scheme: HTTP
|
||||
initialDelaySeconds: 10
|
||||
periodSeconds: 10
|
||||
- successThreshold: 1
|
||||
timeoutSeconds: 15
|
||||
name: etcd
|
||||
resources:
|
||||
@@ -59,26 +58,18 @@ spec:
|
||||
scheme: HTTP
|
||||
initialDelaySeconds: 10
|
||||
periodSeconds: 10
|
||||
- successThreshold: 1
|
||||
timeoutSeconds: 15
|
||||
- terminationMessagePath: /dev/termination-log
|
||||
- terminationMessagePolicy: File
|
||||
volumeMounts:
|
||||
- mountPath: /var/lib/etcd
|
||||
name: etcd-data
|
||||
- mountPath: /etc/kubernetes/pki/etcd
|
||||
name: etcd-certs
|
||||
- dnsPolicy: ClusterFirst
|
||||
- enableServiceLinks: true
|
||||
hostNetwork: true
|
||||
priority: 2000001000
|
||||
priorityClassName: system-node-critical
|
||||
- restartPolicy: Always
|
||||
- schedulerName: default-scheduler
|
||||
securityContext:
|
||||
seccompProfile:
|
||||
type: RuntimeDefault
|
||||
- terminationGracePeriodSeconds: 30
|
||||
volumes:
|
||||
- hostPath:
|
||||
path: /etc/kubernetes/pki/etcd
|
||||
```
|
||||
|
||||
More information can be found in the [tracking issue](https://github.com/kubernetes/kubeadm/issues/2927) for this bug.
|
Loading…
Reference in New Issue