website/docs/tasks/debug-application-cluster/monitor-node-health.md

249 lines
7.6 KiB
Markdown

---
assignees:
- Random-Liu
- dchen1107
title: Monitoring Node Health
---
* TOC
{:toc}
## Node Problem Detector
*Node problem detector* is a [DaemonSet](/docs/admin/daemons/) monitoring the
node health. It collects node problems from various daemons and reports them
to the apiserver as [NodeCondition](/docs/admin/node/#node-condition) and
[Event](/docs/api-reference/v1/definitions/#_v1_event).
It supports some known kernel issue detection now, and will detect more and
more node problems over time.
Currently Kubernetes won't take any action on the node conditions and events
generated by node problem detector. In the future, a remedy system could be
introduced to deal with node problems.
See more information
[here](https://github.com/kubernetes/node-problem-detector).
## Limitations
* The kernel issue detection of node problem detector only supports file based
kernel log now. It doesn't support log tools like journald.
* The kernel issue detection of node problem detector has assumption on kernel
log format, and now it only works on Ubuntu and Debian. However, it is easy to extend
it to [support other log format](/docs/admin/node-problem/#support-other-log-format).
## Enable/Disable in GCE cluster
Node problem detector is [running as a cluster addon](/docs/admin/cluster-large/#addon-resources) enabled by default in the
gce cluster.
You can enable/disable it by setting the environment variable
`KUBE_ENABLE_NODE_PROBLEM_DETECTOR` before `kube-up.sh`.
## Use in Other Environment
To enable node problem detector in other environment outside of GCE, you can use
either `kubectl` or addon pod.
### Kubectl
This is the recommended way to start node problem detector outside of GCE. It
provides more flexible management, such as overwriting the default
configuration to fit it into your environment or detect
customized node problems.
* **Step 1:** Create `node-problem-detector.yaml`:
```yaml
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
name: node-problem-detector-v0.1
namespace: kube-system
labels:
k8s-app: node-problem-detector
version: v0.1
kubernetes.io/cluster-service: "true"
spec:
template:
metadata:
labels:
k8s-app: node-problem-detector
version: v0.1
kubernetes.io/cluster-service: "true"
spec:
hostNetwork: true
containers:
- name: node-problem-detector
image: gcr.io/google_containers/node-problem-detector:v0.1
securityContext:
privileged: true
resources:
limits:
cpu: "200m"
memory: "100Mi"
requests:
cpu: "20m"
memory: "20Mi"
volumeMounts:
- name: log
mountPath: /log
readOnly: true
volumes:
- name: log
hostPath:
path: /var/log/
```
***Notice that you should make sure the system log directory is right for your
OS distro.***
* **Step 2:** Start node problem detector with `kubectl`:
```shell
kubectl create -f node-problem-detector.yaml
```
### Addon Pod
This is for those who have their own cluster bootstrap solution, and don't need
to overwrite the default configuration. They could leverage the addon pod to
further automate the deployment.
Just create `node-problem-detector.yaml`, and put it under the addon pods directory
`/etc/kubernetes/addons/node-problem-detector` on master node.
## Overwrite the Configuration
The [default configuration](https://github.com/kubernetes/node-problem-detector/tree/v0.1/config)
is embedded when building the docker image of node problem detector.
However, you can use [ConfigMap](/docs/user-guide/configmap/) to overwrite it
following the steps:
* **Step 1:** Change the config files in `config/`.
* **Step 2:** Create the ConfigMap `node-problem-detector-config` with `kubectl create configmap
node-problem-detector-config --from-file=config/`.
* **Step 3:** Change the `node-problem-detector.yaml` to use the ConfigMap:
```yaml
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
name: node-problem-detector-v0.1
namespace: kube-system
labels:
k8s-app: node-problem-detector
version: v0.1
kubernetes.io/cluster-service: "true"
spec:
template:
metadata:
labels:
k8s-app: node-problem-detector
version: v0.1
kubernetes.io/cluster-service: "true"
spec:
hostNetwork: true
containers:
- name: node-problem-detector
image: gcr.io/google_containers/node-problem-detector:v0.1
securityContext:
privileged: true
resources:
limits:
cpu: "200m"
memory: "100Mi"
requests:
cpu: "20m"
memory: "20Mi"
volumeMounts:
- name: log
mountPath: /log
readOnly: true
- name: config # Overwrite the config/ directory with ConfigMap volume
mountPath: /config
readOnly: true
volumes:
- name: log
hostPath:
path: /var/log/
- name: config # Define ConfigMap volume
configMap:
name: node-problem-detector-config
```
* **Step 4:** Re-create the node problem detector with the new yaml file:
```shell
kubectl delete -f node-problem-detector.yaml # If you have a node-problem-detector running
kubectl create -f node-problem-detector.yaml
```
***Notice that this approach only applies to node problem detector started with `kubectl`.***
For node problem detector running as cluster addon, because addon manager doesn't support
ConfigMap, configuration overwriting is not supported now.
## Kernel Monitor
*Kernel Monitor* is a problem daemon in node problem detector. It monitors kernel log
and detects known kernel issues following predefined rules.
The Kernel Monitor matches kernel issues according to a set of predefined rule list in
[`config/kernel-monitor.json`](https://github.com/kubernetes/node-problem-detector/blob/v0.1/config/kernel-monitor.json).
The rule list is extensible, and you can always extend it by overwriting the
configuration.
### Add New NodeConditions
To support new node conditions, you can extend the `conditions` field in
`config/kernel-monitor.json` with new condition definition:
```json
{
"type": "NodeConditionType",
"reason": "CamelCaseDefaultNodeConditionReason",
"message": "arbitrary default node condition message"
}
```
### Detect New Problems
To detect new problems, you can extend the `rules` field in `config/kernel-monitor.json`
with new rule definition:
```json
{
"type": "temporary/permanent",
"condition": "NodeConditionOfPermanentIssue",
"reason": "CamelCaseShortReason",
"message": "regexp matching the issue in the kernel log"
}
```
### Change Log Path
Kernel log in different OS distros may locate in different path. The `log`
field in `config/kernel-monitor.json` is the log path inside the container.
You can always configure it to match your OS distro.
### Support Other Log Format
Kernel monitor uses [`Translator`](https://github.com/kubernetes/node-problem-detector/blob/v0.1/pkg/kernelmonitor/translator/translator.go)
plugin to translate kernel log the internal data structure. It is easy to
implement a new translator for a new log format.
## Caveats
It is recommended to run the node problem detector in your cluster to monitor
the node health. However, you should be aware that this will introduce extra
resource overhead on each node. Usually this is fine, because:
* The kernel log is generated relatively slowly.
* Resource limit is set for node problem detector.
* Even under high load, the resource usage is acceptable.
(see [benchmark result](https://github.com/kubernetes/node-problem-detector/issues/2#issuecomment-220255629))