2017-03-15 00:25:29 +00:00
|
|
|
---
|
|
|
|
assignees:
|
|
|
|
- Random-Liu
|
|
|
|
- dchen1107
|
|
|
|
title: Monitoring Node Health
|
|
|
|
---
|
|
|
|
|
|
|
|
* TOC
|
|
|
|
{:toc}
|
|
|
|
|
|
|
|
## Node Problem Detector
|
|
|
|
|
|
|
|
*Node problem detector* is a [DaemonSet](/docs/admin/daemons/) monitoring the
|
|
|
|
node health. It collects node problems from various daemons and reports them
|
|
|
|
to the apiserver as [NodeCondition](/docs/admin/node/#node-condition) and
|
|
|
|
[Event](/docs/api-reference/v1/definitions/#_v1_event).
|
|
|
|
|
|
|
|
It supports some known kernel issue detection now, and will detect more and
|
|
|
|
more node problems over time.
|
|
|
|
|
|
|
|
Currently Kubernetes won't take any action on the node conditions and events
|
|
|
|
generated by node problem detector. In the future, a remedy system could be
|
|
|
|
introduced to deal with node problems.
|
|
|
|
|
|
|
|
See more information
|
|
|
|
[here](https://github.com/kubernetes/node-problem-detector).
|
|
|
|
|
|
|
|
## Limitations
|
|
|
|
|
|
|
|
* The kernel issue detection of node problem detector only supports file based
|
|
|
|
kernel log now. It doesn't support log tools like journald.
|
|
|
|
|
|
|
|
* The kernel issue detection of node problem detector has assumption on kernel
|
|
|
|
log format, and now it only works on Ubuntu and Debian. However, it is easy to extend
|
|
|
|
it to [support other log format](/docs/admin/node-problem/#support-other-log-format).
|
|
|
|
|
|
|
|
## Enable/Disable in GCE cluster
|
|
|
|
|
2017-03-19 13:03:39 +00:00
|
|
|
Node problem detector is [running as a cluster addon](/docs/admin/cluster-large/#addon-resources) enabled by default in the
|
2017-03-15 00:25:29 +00:00
|
|
|
gce cluster.
|
|
|
|
|
|
|
|
You can enable/disable it by setting the environment variable
|
|
|
|
`KUBE_ENABLE_NODE_PROBLEM_DETECTOR` before `kube-up.sh`.
|
|
|
|
|
|
|
|
## Use in Other Environment
|
|
|
|
|
|
|
|
To enable node problem detector in other environment outside of GCE, you can use
|
|
|
|
either `kubectl` or addon pod.
|
|
|
|
|
|
|
|
### Kubectl
|
|
|
|
|
|
|
|
This is the recommended way to start node problem detector outside of GCE. It
|
|
|
|
provides more flexible management, such as overwriting the default
|
|
|
|
configuration to fit it into your environment or detect
|
|
|
|
customized node problems.
|
|
|
|
|
|
|
|
* **Step 1:** Create `node-problem-detector.yaml`:
|
|
|
|
|
|
|
|
```yaml
|
|
|
|
apiVersion: extensions/v1beta1
|
|
|
|
kind: DaemonSet
|
|
|
|
metadata:
|
|
|
|
name: node-problem-detector-v0.1
|
|
|
|
namespace: kube-system
|
|
|
|
labels:
|
|
|
|
k8s-app: node-problem-detector
|
|
|
|
version: v0.1
|
|
|
|
kubernetes.io/cluster-service: "true"
|
|
|
|
spec:
|
|
|
|
template:
|
|
|
|
metadata:
|
|
|
|
labels:
|
|
|
|
k8s-app: node-problem-detector
|
|
|
|
version: v0.1
|
|
|
|
kubernetes.io/cluster-service: "true"
|
|
|
|
spec:
|
|
|
|
hostNetwork: true
|
|
|
|
containers:
|
|
|
|
- name: node-problem-detector
|
|
|
|
image: gcr.io/google_containers/node-problem-detector:v0.1
|
|
|
|
securityContext:
|
|
|
|
privileged: true
|
|
|
|
resources:
|
|
|
|
limits:
|
|
|
|
cpu: "200m"
|
|
|
|
memory: "100Mi"
|
|
|
|
requests:
|
|
|
|
cpu: "20m"
|
|
|
|
memory: "20Mi"
|
|
|
|
volumeMounts:
|
|
|
|
- name: log
|
|
|
|
mountPath: /log
|
|
|
|
readOnly: true
|
|
|
|
volumes:
|
|
|
|
- name: log
|
|
|
|
hostPath:
|
|
|
|
path: /var/log/
|
|
|
|
```
|
|
|
|
|
|
|
|
***Notice that you should make sure the system log directory is right for your
|
|
|
|
OS distro.***
|
|
|
|
|
|
|
|
* **Step 2:** Start node problem detector with `kubectl`:
|
|
|
|
|
|
|
|
```shell
|
|
|
|
kubectl create -f node-problem-detector.yaml
|
|
|
|
```
|
|
|
|
|
|
|
|
### Addon Pod
|
|
|
|
|
|
|
|
This is for those who have their own cluster bootstrap solution, and don't need
|
|
|
|
to overwrite the default configuration. They could leverage the addon pod to
|
|
|
|
further automate the deployment.
|
|
|
|
|
|
|
|
Just create `node-problem-detector.yaml`, and put it under the addon pods directory
|
|
|
|
`/etc/kubernetes/addons/node-problem-detector` on master node.
|
|
|
|
|
|
|
|
## Overwrite the Configuration
|
|
|
|
|
|
|
|
The [default configuration](https://github.com/kubernetes/node-problem-detector/tree/v0.1/config)
|
|
|
|
is embedded when building the docker image of node problem detector.
|
|
|
|
|
|
|
|
However, you can use [ConfigMap](/docs/user-guide/configmap/) to overwrite it
|
|
|
|
following the steps:
|
|
|
|
|
|
|
|
* **Step 1:** Change the config files in `config/`.
|
|
|
|
* **Step 2:** Create the ConfigMap `node-problem-detector-config` with `kubectl create configmap
|
|
|
|
node-problem-detector-config --from-file=config/`.
|
|
|
|
* **Step 3:** Change the `node-problem-detector.yaml` to use the ConfigMap:
|
|
|
|
|
|
|
|
```yaml
|
|
|
|
apiVersion: extensions/v1beta1
|
|
|
|
kind: DaemonSet
|
|
|
|
metadata:
|
|
|
|
name: node-problem-detector-v0.1
|
|
|
|
namespace: kube-system
|
|
|
|
labels:
|
|
|
|
k8s-app: node-problem-detector
|
|
|
|
version: v0.1
|
|
|
|
kubernetes.io/cluster-service: "true"
|
|
|
|
spec:
|
|
|
|
template:
|
|
|
|
metadata:
|
|
|
|
labels:
|
|
|
|
k8s-app: node-problem-detector
|
|
|
|
version: v0.1
|
|
|
|
kubernetes.io/cluster-service: "true"
|
|
|
|
spec:
|
|
|
|
hostNetwork: true
|
|
|
|
containers:
|
|
|
|
- name: node-problem-detector
|
|
|
|
image: gcr.io/google_containers/node-problem-detector:v0.1
|
|
|
|
securityContext:
|
|
|
|
privileged: true
|
|
|
|
resources:
|
|
|
|
limits:
|
|
|
|
cpu: "200m"
|
|
|
|
memory: "100Mi"
|
|
|
|
requests:
|
|
|
|
cpu: "20m"
|
|
|
|
memory: "20Mi"
|
|
|
|
volumeMounts:
|
|
|
|
- name: log
|
|
|
|
mountPath: /log
|
|
|
|
readOnly: true
|
|
|
|
- name: config # Overwrite the config/ directory with ConfigMap volume
|
|
|
|
mountPath: /config
|
|
|
|
readOnly: true
|
|
|
|
volumes:
|
|
|
|
- name: log
|
|
|
|
hostPath:
|
|
|
|
path: /var/log/
|
|
|
|
- name: config # Define ConfigMap volume
|
|
|
|
configMap:
|
|
|
|
name: node-problem-detector-config
|
|
|
|
```
|
|
|
|
|
|
|
|
* **Step 4:** Re-create the node problem detector with the new yaml file:
|
|
|
|
|
|
|
|
```shell
|
|
|
|
kubectl delete -f node-problem-detector.yaml # If you have a node-problem-detector running
|
|
|
|
kubectl create -f node-problem-detector.yaml
|
|
|
|
```
|
|
|
|
|
|
|
|
***Notice that this approach only applies to node problem detector started with `kubectl`.***
|
|
|
|
|
|
|
|
For node problem detector running as cluster addon, because addon manager doesn't support
|
|
|
|
ConfigMap, configuration overwriting is not supported now.
|
|
|
|
|
|
|
|
## Kernel Monitor
|
|
|
|
|
|
|
|
*Kernel Monitor* is a problem daemon in node problem detector. It monitors kernel log
|
|
|
|
and detects known kernel issues following predefined rules.
|
|
|
|
|
|
|
|
The Kernel Monitor matches kernel issues according to a set of predefined rule list in
|
|
|
|
[`config/kernel-monitor.json`](https://github.com/kubernetes/node-problem-detector/blob/v0.1/config/kernel-monitor.json).
|
2017-03-20 06:23:19 +00:00
|
|
|
The rule list is extensible, and you can always extend it by overwriting the
|
|
|
|
configuration.
|
2017-03-15 00:25:29 +00:00
|
|
|
|
|
|
|
### Add New NodeConditions
|
|
|
|
|
|
|
|
To support new node conditions, you can extend the `conditions` field in
|
|
|
|
`config/kernel-monitor.json` with new condition definition:
|
|
|
|
|
|
|
|
```json
|
|
|
|
{
|
|
|
|
"type": "NodeConditionType",
|
|
|
|
"reason": "CamelCaseDefaultNodeConditionReason",
|
|
|
|
"message": "arbitrary default node condition message"
|
|
|
|
}
|
|
|
|
```
|
|
|
|
|
|
|
|
### Detect New Problems
|
|
|
|
|
|
|
|
To detect new problems, you can extend the `rules` field in `config/kernel-monitor.json`
|
|
|
|
with new rule definition:
|
|
|
|
|
|
|
|
```json
|
|
|
|
{
|
|
|
|
"type": "temporary/permanent",
|
|
|
|
"condition": "NodeConditionOfPermanentIssue",
|
|
|
|
"reason": "CamelCaseShortReason",
|
|
|
|
"message": "regexp matching the issue in the kernel log"
|
|
|
|
}
|
|
|
|
```
|
|
|
|
|
|
|
|
### Change Log Path
|
|
|
|
|
|
|
|
Kernel log in different OS distros may locate in different path. The `log`
|
|
|
|
field in `config/kernel-monitor.json` is the log path inside the container.
|
|
|
|
You can always configure it to match your OS distro.
|
|
|
|
|
|
|
|
### Support Other Log Format
|
|
|
|
|
|
|
|
Kernel monitor uses [`Translator`](https://github.com/kubernetes/node-problem-detector/blob/v0.1/pkg/kernelmonitor/translator/translator.go)
|
|
|
|
plugin to translate kernel log the internal data structure. It is easy to
|
|
|
|
implement a new translator for a new log format.
|
|
|
|
|
|
|
|
## Caveats
|
|
|
|
|
|
|
|
It is recommended to run the node problem detector in your cluster to monitor
|
|
|
|
the node health. However, you should be aware that this will introduce extra
|
|
|
|
resource overhead on each node. Usually this is fine, because:
|
|
|
|
|
|
|
|
* The kernel log is generated relatively slowly.
|
|
|
|
* Resource limit is set for node problem detector.
|
|
|
|
* Even under high load, the resource usage is acceptable.
|
|
|
|
(see [benchmark result](https://github.com/kubernetes/node-problem-detector/issues/2#issuecomment-220255629))
|