2017-03-29 22:11:59 +00:00
|
|
|
---
|
2018-02-18 19:29:37 +00:00
|
|
|
reviewers:
|
2017-03-29 22:11:59 +00:00
|
|
|
- davidopp
|
2017-06-08 20:41:57 +00:00
|
|
|
title: Troubleshoot Clusters
|
2020-05-30 19:10:23 +00:00
|
|
|
content_type: concept
|
2017-03-29 22:11:59 +00:00
|
|
|
---
|
|
|
|
|
2020-05-30 19:10:23 +00:00
|
|
|
<!-- overview -->
|
2018-06-22 18:20:04 +00:00
|
|
|
|
2017-03-29 22:11:59 +00:00
|
|
|
This doc is about cluster troubleshooting; we assume you have already ruled out your application as the root cause of the
|
|
|
|
problem you are experiencing. See
|
2017-08-01 07:19:45 +00:00
|
|
|
the [application troubleshooting guide](/docs/tasks/debug-application-cluster/debug-application) for tips on application debugging.
|
2017-03-29 22:11:59 +00:00
|
|
|
You may also visit [troubleshooting document](/docs/troubleshooting/) for more information.
|
|
|
|
|
2018-06-22 18:20:04 +00:00
|
|
|
|
|
|
|
|
2020-05-30 19:10:23 +00:00
|
|
|
|
|
|
|
<!-- body -->
|
2018-06-22 18:20:04 +00:00
|
|
|
|
2017-03-29 22:11:59 +00:00
|
|
|
## Listing your cluster
|
|
|
|
|
|
|
|
The first thing to debug in your cluster is if your nodes are all registered correctly.
|
|
|
|
|
|
|
|
Run
|
|
|
|
|
|
|
|
```shell
|
|
|
|
kubectl get nodes
|
|
|
|
```
|
|
|
|
|
|
|
|
And verify that all of the nodes you expect to see are present and that they are all in the `Ready` state.
|
|
|
|
|
2020-03-09 16:29:38 +00:00
|
|
|
To get detailed information about the overall health of your cluster, you can run:
|
|
|
|
|
|
|
|
```shell
|
|
|
|
kubectl cluster-info dump
|
|
|
|
```
|
2017-03-29 22:11:59 +00:00
|
|
|
## Looking at logs
|
|
|
|
|
|
|
|
For now, digging deeper into the cluster requires logging into the relevant machines. Here are the locations
|
|
|
|
of the relevant log files. (note that on systemd-based systems, you may need to use `journalctl` instead)
|
|
|
|
|
|
|
|
### Master
|
|
|
|
|
2019-03-19 17:24:33 +00:00
|
|
|
* `/var/log/kube-apiserver.log` - API Server, responsible for serving the API
|
|
|
|
* `/var/log/kube-scheduler.log` - Scheduler, responsible for making scheduling decisions
|
|
|
|
* `/var/log/kube-controller-manager.log` - Controller that manages replication controllers
|
2017-03-29 22:11:59 +00:00
|
|
|
|
|
|
|
### Worker Nodes
|
|
|
|
|
2019-03-19 17:24:33 +00:00
|
|
|
* `/var/log/kubelet.log` - Kubelet, responsible for running containers on the node
|
|
|
|
* `/var/log/kube-proxy.log` - Kube Proxy, responsible for service load balancing
|
2017-03-29 22:11:59 +00:00
|
|
|
|
|
|
|
## A general overview of cluster failure modes
|
|
|
|
|
|
|
|
This is an incomplete list of things that could go wrong, and how to adjust your cluster setup to mitigate the problems.
|
|
|
|
|
2019-03-19 17:24:33 +00:00
|
|
|
### Root causes:
|
2017-03-29 22:11:59 +00:00
|
|
|
|
|
|
|
- VM(s) shutdown
|
|
|
|
- Network partition within cluster, or between cluster and users
|
|
|
|
- Crashes in Kubernetes software
|
|
|
|
- Data loss or unavailability of persistent storage (e.g. GCE PD or AWS EBS volume)
|
2020-03-01 07:12:40 +00:00
|
|
|
- Operator error, for example misconfigured Kubernetes software or application software
|
2017-03-29 22:11:59 +00:00
|
|
|
|
2019-03-19 17:24:33 +00:00
|
|
|
### Specific scenarios:
|
2017-03-29 22:11:59 +00:00
|
|
|
|
|
|
|
- Apiserver VM shutdown or apiserver crashing
|
|
|
|
- Results
|
|
|
|
- unable to stop, update, or start new pods, services, replication controller
|
|
|
|
- existing pods and services should continue to work normally, unless they depend on the Kubernetes API
|
|
|
|
- Apiserver backing storage lost
|
|
|
|
- Results
|
|
|
|
- apiserver should fail to come up
|
|
|
|
- kubelets will not be able to reach it but will continue to run the same pods and provide the same service proxying
|
|
|
|
- manual recovery or recreation of apiserver state necessary before apiserver is restarted
|
|
|
|
- Supporting services (node controller, replication controller manager, scheduler, etc) VM shutdown or crashes
|
|
|
|
- currently those are colocated with the apiserver, and their unavailability has similar consequences as apiserver
|
|
|
|
- in future, these will be replicated as well and may not be co-located
|
|
|
|
- they do not have their own persistent state
|
|
|
|
- Individual node (VM or physical machine) shuts down
|
|
|
|
- Results
|
|
|
|
- pods on that Node stop running
|
|
|
|
- Network partition
|
|
|
|
- Results
|
|
|
|
- partition A thinks the nodes in partition B are down; partition B thinks the apiserver is down. (Assuming the master VM ends up in partition A.)
|
|
|
|
- Kubelet software fault
|
|
|
|
- Results
|
|
|
|
- crashing kubelet cannot start new pods on the node
|
|
|
|
- kubelet might delete the pods or not
|
|
|
|
- node marked unhealthy
|
|
|
|
- replication controllers start new pods elsewhere
|
|
|
|
- Cluster operator error
|
|
|
|
- Results
|
|
|
|
- loss of pods, services, etc
|
|
|
|
- lost of apiserver backing store
|
|
|
|
- users unable to read API
|
|
|
|
- etc.
|
|
|
|
|
2019-03-19 17:24:33 +00:00
|
|
|
### Mitigations:
|
2017-03-29 22:11:59 +00:00
|
|
|
|
|
|
|
- Action: Use IaaS provider's automatic VM restarting feature for IaaS VMs
|
|
|
|
- Mitigates: Apiserver VM shutdown or apiserver crashing
|
|
|
|
- Mitigates: Supporting services VM shutdown or crashes
|
|
|
|
|
|
|
|
- Action: Use IaaS providers reliable storage (e.g. GCE PD or AWS EBS volume) for VMs with apiserver+etcd
|
|
|
|
- Mitigates: Apiserver backing storage lost
|
|
|
|
|
2019-11-08 03:16:21 +00:00
|
|
|
- Action: Use [high-availability](/docs/admin/high-availability) configuration
|
|
|
|
- Mitigates: Control plane node shutdown or control plane components (scheduler, API server, controller-manager) crashing
|
2017-03-29 22:11:59 +00:00
|
|
|
- Will tolerate one or more simultaneous node or component failures
|
2019-11-08 03:16:21 +00:00
|
|
|
- Mitigates: API server backing storage (i.e., etcd's data directory) lost
|
|
|
|
- Assumes HA (highly-available) etcd configuration
|
2017-03-29 22:11:59 +00:00
|
|
|
|
|
|
|
- Action: Snapshot apiserver PDs/EBS-volumes periodically
|
|
|
|
- Mitigates: Apiserver backing storage lost
|
|
|
|
- Mitigates: Some cases of operator error
|
|
|
|
- Mitigates: Some cases of Kubernetes software fault
|
|
|
|
|
|
|
|
- Action: use replication controller and services in front of pods
|
|
|
|
- Mitigates: Node shutdown
|
|
|
|
- Mitigates: Kubelet software fault
|
|
|
|
|
|
|
|
- Action: applications (containers) designed to tolerate unexpected restarts
|
|
|
|
- Mitigates: Node shutdown
|
|
|
|
- Mitigates: Kubelet software fault
|
|
|
|
|
2020-05-30 19:10:23 +00:00
|
|
|
|