clean up node problem detector task page
parent
34e8b55faf
commit
d3f374c0d8
|
@ -1,133 +1,117 @@
|
||||||
---
|
---
|
||||||
|
title: Monitor Node Health
|
||||||
|
content_type: task
|
||||||
reviewers:
|
reviewers:
|
||||||
- Random-Liu
|
- Random-Liu
|
||||||
- dchen1107
|
- dchen1107
|
||||||
content_type: task
|
|
||||||
title: Monitor Node Health
|
|
||||||
---
|
---
|
||||||
|
|
||||||
<!-- overview -->
|
<!-- overview -->
|
||||||
|
|
||||||
*Node problem detector* is a [DaemonSet](/docs/concepts/workloads/controllers/daemonset/) monitoring the
|
*Node problem detector* is a daemon for monitoring and reporting about a node's health.
|
||||||
node health. It collects node problems from various daemons and reports them
|
You can run node problem detector as a `DaemonSet`
|
||||||
to the apiserver as [NodeCondition](/docs/concepts/architecture/nodes/#condition)
|
or as a standalone daemon. Node problem detector collects information about node problems from various daemons
|
||||||
|
and reports these conditions to the API server as [NodeCondition](/docs/concepts/architecture/nodes/#condition)
|
||||||
and [Event](/docs/reference/generated/kubernetes-api/{{< param "version" >}}/#event-v1-core).
|
and [Event](/docs/reference/generated/kubernetes-api/{{< param "version" >}}/#event-v1-core).
|
||||||
|
|
||||||
It supports some known kernel issue detection now, and will detect more and
|
To learn how to install and use the node problem detector, see the
|
||||||
more node problems over time.
|
[Node problem detector project documentation](https://github.com/kubernetes/node-problem-detector).
|
||||||
|
|
||||||
Currently Kubernetes won't take any action on the node conditions and events
|
|
||||||
generated by node problem detector. In the future, a remedy system could be
|
|
||||||
introduced to deal with node problems.
|
|
||||||
|
|
||||||
See more information
|
|
||||||
[here](https://github.com/kubernetes/node-problem-detector).
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
## {{% heading "prerequisites" %}}
|
## {{% heading "prerequisites" %}}
|
||||||
|
|
||||||
|
{{< include "task-tutorial-prereqs.md" >}}
|
||||||
{{< include "task-tutorial-prereqs.md" >}} {{< version-check >}}
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
<!-- steps -->
|
<!-- steps -->
|
||||||
|
|
||||||
## Limitations
|
## Limitations
|
||||||
|
|
||||||
* The kernel issue detection of node problem detector only supports file based
|
* Node problem detector only supports file based kernel log.
|
||||||
kernel log now. It doesn't support log tools like journald.
|
Log tools such as `journald` are not supported.
|
||||||
|
|
||||||
* The kernel issue detection of node problem detector has assumption on kernel
|
* Node problem detector uses the kernel log format for reporting kernel issues.
|
||||||
log format, and now it only works on Ubuntu and Debian. However, it is easy to extend
|
To learn how to extend the kernel log format, see [Add support for another log format](#support-other-log-format).
|
||||||
it to [support other log format](/docs/tasks/debug-application-cluster/monitor-node-health/#support-other-log-format).
|
|
||||||
|
|
||||||
## Enable/Disable in GCE cluster
|
## Enabling node problem detector
|
||||||
|
|
||||||
Node problem detector is [running as a cluster addon](/docs/setup/best-practices/cluster-large/#addon-resources) enabled by default in the
|
Some cloud providers enable node problem detector as an {{< glossary_tooltip text="Addon" term_id="addons" >}}.
|
||||||
gce cluster.
|
You can also enable node problem detector with `kubectl` or by creating an Addon pod.
|
||||||
|
|
||||||
You can enable/disable it by setting the environment variable
|
### Using kubectl to enable node problem detector {#using-kubectl}
|
||||||
`KUBE_ENABLE_NODE_PROBLEM_DETECTOR` before `kube-up.sh`.
|
|
||||||
|
|
||||||
## Use in Other Environment
|
`kubectl` provides the most flexible management of node problem detector.
|
||||||
|
You can overwrite the default configuration to fit it into your environment or
|
||||||
|
to detect customized node problems. For example:
|
||||||
|
|
||||||
To enable node problem detector in other environment outside of GCE, you can use
|
1. Create a node problem detector configuration similar to `node-problem-detector.yaml`:
|
||||||
either `kubectl` or addon pod.
|
|
||||||
|
|
||||||
### Kubectl
|
|
||||||
|
|
||||||
This is the recommended way to start node problem detector outside of GCE. It
|
|
||||||
provides more flexible management, such as overwriting the default
|
|
||||||
configuration to fit it into your environment or detect
|
|
||||||
customized node problems.
|
|
||||||
|
|
||||||
* **Step 1:** `node-problem-detector.yaml`:
|
|
||||||
|
|
||||||
{{< codenew file="debug/node-problem-detector.yaml" >}}
|
{{< codenew file="debug/node-problem-detector.yaml" >}}
|
||||||
|
|
||||||
|
{{< note >}}
|
||||||
|
You should verify that the system log directory is right for your operating system distribution.
|
||||||
|
{{< /note >}}
|
||||||
|
|
||||||
***Notice that you should make sure the system log directory is right for your
|
1. Start node problem detector with `kubectl`:
|
||||||
OS distro.***
|
|
||||||
|
|
||||||
* **Step 2:** Start node problem detector with `kubectl`:
|
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
kubectl apply -f https://k8s.io/examples/debug/node-problem-detector.yaml
|
kubectl apply -f https://k8s.io/examples/debug/node-problem-detector.yaml
|
||||||
```
|
```
|
||||||
|
|
||||||
### Addon Pod
|
### Using an Addon pod to enable node problem detector {#using-addon-pod}
|
||||||
|
|
||||||
This is for those who have their own cluster bootstrap solution, and don't need
|
If you are using a custom cluster bootstrap solution and don't need
|
||||||
to overwrite the default configuration. They could leverage the addon pod to
|
to overwrite the default configuration, you can leverage the Addon pod to
|
||||||
further automate the deployment.
|
further automate the deployment.
|
||||||
|
|
||||||
Just create `node-problem-detector.yaml`, and put it under the addon pods directory
|
Create `node-problem-detector.yaml`, and save the configuration in the Addon pod's
|
||||||
`/etc/kubernetes/addons/node-problem-detector` on master node.
|
directory `/etc/kubernetes/addons/node-problem-detector` on a control plane node.
|
||||||
|
|
||||||
## Overwrite the Configuration
|
## Overwrite the configuration
|
||||||
|
|
||||||
The [default configuration](https://github.com/kubernetes/node-problem-detector/tree/v0.1/config)
|
The [default configuration](https://github.com/kubernetes/node-problem-detector/tree/v0.1/config)
|
||||||
is embedded when building the Docker image of node problem detector.
|
is embedded when building the Docker image of node problem detector.
|
||||||
|
|
||||||
However, you can use [ConfigMap](/docs/tasks/configure-pod-container/configure-pod-configmap/) to overwrite it
|
However, you can use a [`ConfigMap`](/docs/tasks/configure-pod-container/configure-pod-configmap/)
|
||||||
following the steps:
|
to overwrite the configuration:
|
||||||
|
|
||||||
* **Step 1:** Change the config files in `config/`.
|
1. Change the configuration files in `config/`
|
||||||
* **Step 2:** Create the ConfigMap `node-problem-detector-config` with `kubectl create configmap
|
1. Create the `ConfigMap` `node-problem-detector-config`:
|
||||||
node-problem-detector-config --from-file=config/`.
|
|
||||||
* **Step 3:** Change the `node-problem-detector.yaml` to use the ConfigMap:
|
```shell
|
||||||
|
kubectl create configmap node-problem-detector-config --from-file=config/
|
||||||
|
```
|
||||||
|
|
||||||
|
1. Change the `node-problem-detector.yaml` to use the `ConfigMap`:
|
||||||
|
|
||||||
{{< codenew file="debug/node-problem-detector-configmap.yaml" >}}
|
{{< codenew file="debug/node-problem-detector-configmap.yaml" >}}
|
||||||
|
|
||||||
|
1. Recreate the node problem detector with the new configuration file:
|
||||||
* **Step 4:** Re-create the node problem detector with the new yaml file:
|
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
kubectl delete -f https://k8s.io/examples/debug/node-problem-detector.yaml # If you have a node-problem-detector running
|
# If you have a node-problem-detector running, delete before recreating
|
||||||
|
kubectl delete -f https://k8s.io/examples/debug/node-problem-detector.yaml
|
||||||
kubectl apply -f https://k8s.io/examples/debug/node-problem-detector-configmap.yaml
|
kubectl apply -f https://k8s.io/examples/debug/node-problem-detector-configmap.yaml
|
||||||
```
|
```
|
||||||
|
|
||||||
***Notice that this approach only applies to node problem detector started with `kubectl`.***
|
{{< note >}}
|
||||||
|
This approach only applies to a node problem detector started with `kubectl`.
|
||||||
|
{{< /note >}}
|
||||||
|
|
||||||
For node problem detector running as cluster addon, because addon manager doesn't support
|
Overwriting a configuration is not supported if a node problem detector runs as a cluster Addon.
|
||||||
ConfigMap, configuration overwriting is not supported now.
|
The Addon manager does not support `ConfigMap`.
|
||||||
|
|
||||||
## Kernel Monitor
|
## Kernel Monitor
|
||||||
|
|
||||||
*Kernel Monitor* is a problem daemon in node problem detector. It monitors kernel log
|
*Kernel Monitor* is a system log monitor daemon supported in the node problem detector.
|
||||||
and detects known kernel issues following predefined rules.
|
Kernel monitor watches the kernel log and detects known kernel issues following predefined rules.
|
||||||
|
|
||||||
The Kernel Monitor matches kernel issues according to a set of predefined rule list in
|
The Kernel Monitor matches kernel issues according to a set of predefined rule list in
|
||||||
[`config/kernel-monitor.json`](https://github.com/kubernetes/node-problem-detector/blob/v0.1/config/kernel-monitor.json).
|
[`config/kernel-monitor.json`](https://github.com/kubernetes/node-problem-detector/blob/v0.1/config/kernel-monitor.json). The rule list is extensible. You can extend the rule list by overwriting the
|
||||||
The rule list is extensible, and you can always extend it by overwriting the
|
|
||||||
configuration.
|
configuration.
|
||||||
|
|
||||||
### Add New NodeConditions
|
### Add new NodeConditions
|
||||||
|
|
||||||
To support new node conditions, you can extend the `conditions` field in
|
To support a new `NodeCondition`, you can extend the `conditions` field in
|
||||||
`config/kernel-monitor.json` with new condition definition:
|
`config/kernel-monitor.json` with a new condition definition such as:
|
||||||
|
|
||||||
```json
|
```json
|
||||||
{
|
{
|
||||||
|
@ -137,10 +121,10 @@ To support new node conditions, you can extend the `conditions` field in
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
### Detect New Problems
|
### Detect new problems
|
||||||
|
|
||||||
To detect new problems, you can extend the `rules` field in `config/kernel-monitor.json`
|
To detect new problems, you can extend the `rules` field in `config/kernel-monitor.json`
|
||||||
with new rule definition:
|
with a new rule definition:
|
||||||
|
|
||||||
```json
|
```json
|
||||||
{
|
{
|
||||||
|
@ -151,31 +135,28 @@ with new rule definition:
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
### Change Log Path
|
### Configure path for the kernel log device {#kernel-log-device-path}
|
||||||
|
|
||||||
Kernel log in different OS distros may locate in different path. The `log`
|
Check your kernel log path location in your operating system (OS) distribution.
|
||||||
field in `config/kernel-monitor.json` is the log path inside the container.
|
The Linux kernel [log device](https://www.kernel.org/doc/Documentation/ABI/testing/dev-kmsg) is usually presented as `/dev/kmsg`. However, the log path location varies by OS distribution.
|
||||||
You can always configure it to match your OS distro.
|
The `log` field in `config/kernel-monitor.json` represents the log path inside the container.
|
||||||
|
You can configure the `log` field to match the device path as seen by the node problem detector.
|
||||||
### Support Other Log Format
|
|
||||||
|
|
||||||
Kernel monitor uses [`Translator`](https://github.com/kubernetes/node-problem-detector/blob/v0.1/pkg/kernelmonitor/translator/translator.go)
|
|
||||||
plugin to translate kernel log the internal data structure. It is easy to
|
|
||||||
implement a new translator for a new log format.
|
|
||||||
|
|
||||||
|
### Add support for another log format {#support-other-log-format}
|
||||||
|
|
||||||
|
Kernel monitor uses the
|
||||||
|
[`Translator`](https://github.com/kubernetes/node-problem-detector/blob/v0.1/pkg/kernelmonitor/translator/translator.go) plugin to translate the internal data structure of the kernel log.
|
||||||
|
You can implement a new translator for a new log format.
|
||||||
|
|
||||||
<!-- discussion -->
|
<!-- discussion -->
|
||||||
|
|
||||||
## Caveats
|
## Recommendations and restrictions
|
||||||
|
|
||||||
It is recommended to run the node problem detector in your cluster to monitor
|
|
||||||
the node health. However, you should be aware that this will introduce extra
|
|
||||||
resource overhead on each node. Usually this is fine, because:
|
|
||||||
|
|
||||||
* The kernel log is generated relatively slowly.
|
|
||||||
* Resource limit is set for node problem detector.
|
|
||||||
* Even under high load, the resource usage is acceptable.
|
|
||||||
(see [benchmark result](https://github.com/kubernetes/node-problem-detector/issues/2#issuecomment-220255629))
|
|
||||||
|
|
||||||
|
It is recommended to run the node problem detector in your cluster to monitor node health.
|
||||||
|
When running the node problem detector, you can expect extra resource overhead on each node.
|
||||||
|
Usually this is fine, because:
|
||||||
|
|
||||||
|
* The kernel log grows relatively slowly.
|
||||||
|
* A resource limit is set for the node problem detector.
|
||||||
|
* Even under high load, the resource usage is acceptable. For more information, see the node problem detector
|
||||||
|
[benchmark result](https://github.com/kubernetes/node-problem-detector/issues/2#issuecomment-220255629).
|
||||||
|
|
Loading…
Reference in New Issue