website/content/en/docs/tasks/debug-application-cluster/debug-cluster.md

---
reviewers:
- davidopp
title: Troubleshoot Clusters
content_type: concept
---

<!-- overview -->

This doc is about cluster troubleshooting; we assume you have already ruled out your application as the root cause of the
problem you are experiencing. See
the [application troubleshooting guide](/docs/tasks/debug-application-cluster/debug-application) for tips on application debugging.
You may also visit [troubleshooting document](/docs/troubleshooting/) for more information.


<!-- body -->

## Listing your cluster

The first thing to debug in your cluster is if your nodes are all registered correctly.

Run

```shell
kubectl get nodes
```

And verify that all of the nodes you expect to see are present and that they are all in the `Ready` state.

To get detailed information about the overall health of your cluster, you can run:

```shell
kubectl cluster-info dump
```
## Looking at logs

For now, digging deeper into the cluster requires logging into the relevant machines.  Here are the locations
of the relevant log files.  (note that on systemd-based systems, you may need to use `journalctl` instead)

### Master

   * `/var/log/kube-apiserver.log` - API Server, responsible for serving the API
   * `/var/log/kube-scheduler.log` - Scheduler, responsible for making scheduling decisions
   * `/var/log/kube-controller-manager.log` - Controller that manages replication controllers

### Worker Nodes

   * `/var/log/kubelet.log` - Kubelet, responsible for running containers on the node
   * `/var/log/kube-proxy.log` - Kube Proxy, responsible for service load balancing

## A general overview of cluster failure modes

This is an incomplete list of things that could go wrong, and how to adjust your cluster setup to mitigate the problems.

### Root causes:

  - VM(s) shutdown
  - Network partition within cluster, or between cluster and users
  - Crashes in Kubernetes software
  - Data loss or unavailability of persistent storage (e.g. GCE PD or AWS EBS volume)
  - Operator error, for example misconfigured Kubernetes software or application software

### Specific scenarios:

  - Apiserver VM shutdown or apiserver crashing
    - Results
      - unable to stop, update, or start new pods, services, replication controller
      - existing pods and services should continue to work normally, unless they depend on the Kubernetes API
  - Apiserver backing storage lost
    - Results
      - apiserver should fail to come up
      - kubelets will not be able to reach it but will continue to run the same pods and provide the same service proxying
      - manual recovery or recreation of apiserver state necessary before apiserver is restarted
  - Supporting services (node controller, replication controller manager, scheduler, etc) VM shutdown or crashes
    - currently those are colocated with the apiserver, and their unavailability has similar consequences as apiserver
    - in future, these will be replicated as well and may not be co-located
    - they do not have their own persistent state
  - Individual node (VM or physical machine) shuts down
    - Results
      - pods on that Node stop running
  - Network partition
    - Results
      - partition A thinks the nodes in partition B are down; partition B thinks the apiserver is down. (Assuming the master VM ends up in partition A.)
  - Kubelet software fault
    - Results
      - crashing kubelet cannot start new pods on the node
      - kubelet might delete the pods or not
      - node marked unhealthy
      - replication controllers start new pods elsewhere
  - Cluster operator error
    - Results
      - loss of pods, services, etc
      - lost of apiserver backing store
      - users unable to read API
      - etc.

### Mitigations:

- Action: Use IaaS provider's automatic VM restarting feature for IaaS VMs
  - Mitigates: Apiserver VM shutdown or apiserver crashing
  - Mitigates: Supporting services VM shutdown or crashes

- Action: Use IaaS providers reliable storage (e.g. GCE PD or AWS EBS volume) for VMs with apiserver+etcd
  - Mitigates: Apiserver backing storage lost

- Action: Use [high-availability](/docs/admin/high-availability) configuration
  - Mitigates: Control plane node shutdown or control plane components (scheduler, API server, controller-manager) crashing
    - Will tolerate one or more simultaneous node or component failures
  - Mitigates: API server backing storage (i.e., etcd's data directory) lost
    - Assumes HA (highly-available) etcd configuration

- Action: Snapshot apiserver PDs/EBS-volumes periodically
  - Mitigates: Apiserver backing storage lost
  - Mitigates: Some cases of operator error
  - Mitigates: Some cases of Kubernetes software fault

- Action: use replication controller and services in front of pods
  - Mitigates: Node shutdown
  - Mitigates: Kubelet software fault

- Action: applications (containers) designed to tolerate unexpected restarts
  - Mitigates: Node shutdown
  - Mitigates: Kubelet software fault
Move Support topics. (#3108) * Move Support topics. * Update stage-documentation-changes.md * Update stage-documentation-changes.md 2017-03-29 22:11:59 +00:00			`---`
In front matter, change approvers to reviewers. (#7433) 2018-02-18 19:29:37 +00:00			`reviewers:`
Move Support topics. (#3108) * Move Support topics. * Update stage-documentation-changes.md * Update stage-documentation-changes.md 2017-03-29 22:11:59 +00:00			`- davidopp`
Change Task titles to imperative: Monitor. (#4028) 2017-06-08 20:41:57 +00:00			`title: Troubleshoot Clusters`
add en pages 2020-05-30 19:10:23 +00:00			`content_type: concept`
Move Support topics. (#3108) * Move Support topics. * Update stage-documentation-changes.md * Update stage-documentation-changes.md 2017-03-29 22:11:59 +00:00			`---`

add en pages 2020-05-30 19:10:23 +00:00			`<!-- overview -->`
Apply templates to all concepts and tasks to fix double bullets in TOC (#9149) * Apply concept template to fix double bullet issue. * Apply concept template * Apply templates to tasks 2018-06-22 18:20:04 +00:00
Move Support topics. (#3108) * Move Support topics. * Update stage-documentation-changes.md * Update stage-documentation-changes.md 2017-03-29 22:11:59 +00:00			`This doc is about cluster troubleshooting; we assume you have already ruled out your application as the root cause of the`
			`problem you are experiencing. See`
Fixed application troubleshooting guide link (#4441) 2017-08-01 07:19:45 +00:00			`the [application troubleshooting guide](/docs/tasks/debug-application-cluster/debug-application) for tips on application debugging.`
Move Support topics. (#3108) * Move Support topics. * Update stage-documentation-changes.md * Update stage-documentation-changes.md 2017-03-29 22:11:59 +00:00			`You may also visit [troubleshooting document](/docs/troubleshooting/) for more information.`

Apply templates to all concepts and tasks to fix double bullets in TOC (#9149) * Apply concept template to fix double bullet issue. * Apply concept template * Apply templates to tasks 2018-06-22 18:20:04 +00:00

add en pages 2020-05-30 19:10:23 +00:00
			`<!-- body -->`
Apply templates to all concepts and tasks to fix double bullets in TOC (#9149) * Apply concept template to fix double bullet issue. * Apply concept template * Apply templates to tasks 2018-06-22 18:20:04 +00:00
Move Support topics. (#3108) * Move Support topics. * Update stage-documentation-changes.md * Update stage-documentation-changes.md 2017-03-29 22:11:59 +00:00			`## Listing your cluster`

			`The first thing to debug in your cluster is if your nodes are all registered correctly.`

			`Run`

			```shell
			`kubectl get nodes`
			```

			And verify that all of the nodes you expect to see are present and that they are all in the `Ready` state.

Added cluster-health check instructions (#18958) * Added cluster-health check instructions * Added cluster info command Added cluster info command * Fixed Cluster health check instruction * Fixed typo in documentation 2020-03-09 16:29:38 +00:00			`To get detailed information about the overall health of your cluster, you can run:`

			```shell
			`kubectl cluster-info dump`
			```
Move Support topics. (#3108) * Move Support topics. * Update stage-documentation-changes.md * Update stage-documentation-changes.md 2017-03-29 22:11:59 +00:00			`## Looking at logs`

			`For now, digging deeper into the cluster requires logging into the relevant machines. Here are the locations`
			of the relevant log files. (note that on systemd-based systems, you may need to use `journalctl` instead)

			`### Master`

Update debug-cluster and crictl files to get them consistent with the style guidelines (#13183) 2019-03-19 17:24:33 +00:00			* `/var/log/kube-apiserver.log` - API Server, responsible for serving the API
			* `/var/log/kube-scheduler.log` - Scheduler, responsible for making scheduling decisions
			* `/var/log/kube-controller-manager.log` - Controller that manages replication controllers
Move Support topics. (#3108) * Move Support topics. * Update stage-documentation-changes.md * Update stage-documentation-changes.md 2017-03-29 22:11:59 +00:00
			`### Worker Nodes`

Update debug-cluster and crictl files to get them consistent with the style guidelines (#13183) 2019-03-19 17:24:33 +00:00			* `/var/log/kubelet.log` - Kubelet, responsible for running containers on the node
			* `/var/log/kube-proxy.log` - Kube Proxy, responsible for service load balancing
Move Support topics. (#3108) * Move Support topics. * Update stage-documentation-changes.md * Update stage-documentation-changes.md 2017-03-29 22:11:59 +00:00
			`## A general overview of cluster failure modes`

			`This is an incomplete list of things that could go wrong, and how to adjust your cluster setup to mitigate the problems.`

Update debug-cluster and crictl files to get them consistent with the style guidelines (#13183) 2019-03-19 17:24:33 +00:00			`### Root causes:`
Move Support topics. (#3108) * Move Support topics. * Update stage-documentation-changes.md * Update stage-documentation-changes.md 2017-03-29 22:11:59 +00:00
			`- VM(s) shutdown`
			`- Network partition within cluster, or between cluster and users`
			`- Crashes in Kubernetes software`
			`- Data loss or unavailability of persistent storage (e.g. GCE PD or AWS EBS volume)`
Update some instances of latin abbreviation e.g. to alternative phrases (#19182) 2020-03-01 07:12:40 +00:00			`- Operator error, for example misconfigured Kubernetes software or application software`
Move Support topics. (#3108) * Move Support topics. * Update stage-documentation-changes.md * Update stage-documentation-changes.md 2017-03-29 22:11:59 +00:00
Update debug-cluster and crictl files to get them consistent with the style guidelines (#13183) 2019-03-19 17:24:33 +00:00			`### Specific scenarios:`
Move Support topics. (#3108) * Move Support topics. * Update stage-documentation-changes.md * Update stage-documentation-changes.md 2017-03-29 22:11:59 +00:00
			`- Apiserver VM shutdown or apiserver crashing`
			`- Results`
			`- unable to stop, update, or start new pods, services, replication controller`
			`- existing pods and services should continue to work normally, unless they depend on the Kubernetes API`
			`- Apiserver backing storage lost`
			`- Results`
			`- apiserver should fail to come up`
			`- kubelets will not be able to reach it but will continue to run the same pods and provide the same service proxying`
			`- manual recovery or recreation of apiserver state necessary before apiserver is restarted`
			`- Supporting services (node controller, replication controller manager, scheduler, etc) VM shutdown or crashes`
			`- currently those are colocated with the apiserver, and their unavailability has similar consequences as apiserver`
			`- in future, these will be replicated as well and may not be co-located`
			`- they do not have their own persistent state`
			`- Individual node (VM or physical machine) shuts down`
			`- Results`
			`- pods on that Node stop running`
			`- Network partition`
			`- Results`
			`- partition A thinks the nodes in partition B are down; partition B thinks the apiserver is down. (Assuming the master VM ends up in partition A.)`
			`- Kubelet software fault`
			`- Results`
			`- crashing kubelet cannot start new pods on the node`
			`- kubelet might delete the pods or not`
			`- node marked unhealthy`
			`- replication controllers start new pods elsewhere`
			`- Cluster operator error`
			`- Results`
			`- loss of pods, services, etc`
			`- lost of apiserver backing store`
			`- users unable to read API`
			`- etc.`

Update debug-cluster and crictl files to get them consistent with the style guidelines (#13183) 2019-03-19 17:24:33 +00:00			`### Mitigations:`
Move Support topics. (#3108) * Move Support topics. * Update stage-documentation-changes.md * Update stage-documentation-changes.md 2017-03-29 22:11:59 +00:00
			`- Action: Use IaaS provider's automatic VM restarting feature for IaaS VMs`
			`- Mitigates: Apiserver VM shutdown or apiserver crashing`
			`- Mitigates: Supporting services VM shutdown or crashes`

			`- Action: Use IaaS providers reliable storage (e.g. GCE PD or AWS EBS volume) for VMs with apiserver+etcd`
			`- Mitigates: Apiserver backing storage lost`

Remove 'experimental' from high-availability, and '.' from etcd bullet (#17254) * Remove 'experimental' from high-availability, and '.' from etcd bullet s/controller-managing/controller-manager s/Apiserver/API server s/Assuming/Assumes * Update content/en/docs/tasks/debug-application-cluster/debug-cluster.md Co-Authored-By: Stephen Augustus <justaugustus@users.noreply.github.com> * Update content/en/docs/tasks/debug-application-cluster/debug-cluster.md Co-Authored-By: Stephen Augustus <justaugustus@users.noreply.github.com> * Update content/en/docs/tasks/debug-application-cluster/debug-cluster.md Co-Authored-By: Stephen Augustus <justaugustus@users.noreply.github.com> 2019-11-08 03:16:21 +00:00			`- Action: Use [high-availability](/docs/admin/high-availability) configuration`
			`- Mitigates: Control plane node shutdown or control plane components (scheduler, API server, controller-manager) crashing`
Move Support topics. (#3108) * Move Support topics. * Update stage-documentation-changes.md * Update stage-documentation-changes.md 2017-03-29 22:11:59 +00:00			`- Will tolerate one or more simultaneous node or component failures`
Remove 'experimental' from high-availability, and '.' from etcd bullet (#17254) * Remove 'experimental' from high-availability, and '.' from etcd bullet s/controller-managing/controller-manager s/Apiserver/API server s/Assuming/Assumes * Update content/en/docs/tasks/debug-application-cluster/debug-cluster.md Co-Authored-By: Stephen Augustus <justaugustus@users.noreply.github.com> * Update content/en/docs/tasks/debug-application-cluster/debug-cluster.md Co-Authored-By: Stephen Augustus <justaugustus@users.noreply.github.com> * Update content/en/docs/tasks/debug-application-cluster/debug-cluster.md Co-Authored-By: Stephen Augustus <justaugustus@users.noreply.github.com> 2019-11-08 03:16:21 +00:00			`- Mitigates: API server backing storage (i.e., etcd's data directory) lost`
			`- Assumes HA (highly-available) etcd configuration`
Move Support topics. (#3108) * Move Support topics. * Update stage-documentation-changes.md * Update stage-documentation-changes.md 2017-03-29 22:11:59 +00:00
			`- Action: Snapshot apiserver PDs/EBS-volumes periodically`
			`- Mitigates: Apiserver backing storage lost`
			`- Mitigates: Some cases of operator error`
			`- Mitigates: Some cases of Kubernetes software fault`

			`- Action: use replication controller and services in front of pods`
			`- Mitigates: Node shutdown`
			`- Mitigates: Kubelet software fault`

			`- Action: applications (containers) designed to tolerate unexpected restarts`
			`- Mitigates: Node shutdown`
			`- Mitigates: Kubelet software fault`

add en pages 2020-05-30 19:10:23 +00:00