clean up node problem detector task page
parent
34e8b55faf
commit
d3f374c0d8
|
@ -1,133 +1,117 @@
|
|||
---
|
||||
title: Monitor Node Health
|
||||
content_type: task
|
||||
reviewers:
|
||||
- Random-Liu
|
||||
- dchen1107
|
||||
content_type: task
|
||||
title: Monitor Node Health
|
||||
---
|
||||
|
||||
<!-- overview -->
|
||||
|
||||
*Node problem detector* is a [DaemonSet](/docs/concepts/workloads/controllers/daemonset/) monitoring the
|
||||
node health. It collects node problems from various daemons and reports them
|
||||
to the apiserver as [NodeCondition](/docs/concepts/architecture/nodes/#condition)
|
||||
*Node problem detector* is a daemon for monitoring and reporting about a node's health.
|
||||
You can run node problem detector as a `DaemonSet`
|
||||
or as a standalone daemon. Node problem detector collects information about node problems from various daemons
|
||||
and reports these conditions to the API server as [NodeCondition](/docs/concepts/architecture/nodes/#condition)
|
||||
and [Event](/docs/reference/generated/kubernetes-api/{{< param "version" >}}/#event-v1-core).
|
||||
|
||||
It supports some known kernel issue detection now, and will detect more and
|
||||
more node problems over time.
|
||||
|
||||
Currently Kubernetes won't take any action on the node conditions and events
|
||||
generated by node problem detector. In the future, a remedy system could be
|
||||
introduced to deal with node problems.
|
||||
|
||||
See more information
|
||||
[here](https://github.com/kubernetes/node-problem-detector).
|
||||
|
||||
|
||||
To learn how to install and use the node problem detector, see the
|
||||
[Node problem detector project documentation](https://github.com/kubernetes/node-problem-detector).
|
||||
|
||||
## {{% heading "prerequisites" %}}
|
||||
|
||||
|
||||
{{< include "task-tutorial-prereqs.md" >}} {{< version-check >}}
|
||||
|
||||
|
||||
{{< include "task-tutorial-prereqs.md" >}}
|
||||
|
||||
<!-- steps -->
|
||||
|
||||
## Limitations
|
||||
|
||||
* The kernel issue detection of node problem detector only supports file based
|
||||
kernel log now. It doesn't support log tools like journald.
|
||||
* Node problem detector only supports file based kernel log.
|
||||
Log tools such as `journald` are not supported.
|
||||
|
||||
* The kernel issue detection of node problem detector has assumption on kernel
|
||||
log format, and now it only works on Ubuntu and Debian. However, it is easy to extend
|
||||
it to [support other log format](/docs/tasks/debug-application-cluster/monitor-node-health/#support-other-log-format).
|
||||
* Node problem detector uses the kernel log format for reporting kernel issues.
|
||||
To learn how to extend the kernel log format, see [Add support for another log format](#support-other-log-format).
|
||||
|
||||
## Enable/Disable in GCE cluster
|
||||
## Enabling node problem detector
|
||||
|
||||
Node problem detector is [running as a cluster addon](/docs/setup/best-practices/cluster-large/#addon-resources) enabled by default in the
|
||||
gce cluster.
|
||||
Some cloud providers enable node problem detector as an {{< glossary_tooltip text="Addon" term_id="addons" >}}.
|
||||
You can also enable node problem detector with `kubectl` or by creating an Addon pod.
|
||||
|
||||
You can enable/disable it by setting the environment variable
|
||||
`KUBE_ENABLE_NODE_PROBLEM_DETECTOR` before `kube-up.sh`.
|
||||
### Using kubectl to enable node problem detector {#using-kubectl}
|
||||
|
||||
## Use in Other Environment
|
||||
`kubectl` provides the most flexible management of node problem detector.
|
||||
You can overwrite the default configuration to fit it into your environment or
|
||||
to detect customized node problems. For example:
|
||||
|
||||
To enable node problem detector in other environment outside of GCE, you can use
|
||||
either `kubectl` or addon pod.
|
||||
1. Create a node problem detector configuration similar to `node-problem-detector.yaml`:
|
||||
|
||||
### Kubectl
|
||||
{{< codenew file="debug/node-problem-detector.yaml" >}}
|
||||
|
||||
This is the recommended way to start node problem detector outside of GCE. It
|
||||
provides more flexible management, such as overwriting the default
|
||||
configuration to fit it into your environment or detect
|
||||
customized node problems.
|
||||
{{< note >}}
|
||||
You should verify that the system log directory is right for your operating system distribution.
|
||||
{{< /note >}}
|
||||
|
||||
* **Step 1:** `node-problem-detector.yaml`:
|
||||
1. Start node problem detector with `kubectl`:
|
||||
|
||||
{{< codenew file="debug/node-problem-detector.yaml" >}}
|
||||
```shell
|
||||
kubectl apply -f https://k8s.io/examples/debug/node-problem-detector.yaml
|
||||
```
|
||||
|
||||
### Using an Addon pod to enable node problem detector {#using-addon-pod}
|
||||
|
||||
***Notice that you should make sure the system log directory is right for your
|
||||
OS distro.***
|
||||
|
||||
* **Step 2:** Start node problem detector with `kubectl`:
|
||||
|
||||
```shell
|
||||
kubectl apply -f https://k8s.io/examples/debug/node-problem-detector.yaml
|
||||
```
|
||||
|
||||
### Addon Pod
|
||||
|
||||
This is for those who have their own cluster bootstrap solution, and don't need
|
||||
to overwrite the default configuration. They could leverage the addon pod to
|
||||
If you are using a custom cluster bootstrap solution and don't need
|
||||
to overwrite the default configuration, you can leverage the Addon pod to
|
||||
further automate the deployment.
|
||||
|
||||
Just create `node-problem-detector.yaml`, and put it under the addon pods directory
|
||||
`/etc/kubernetes/addons/node-problem-detector` on master node.
|
||||
Create `node-problem-detector.yaml`, and save the configuration in the Addon pod's
|
||||
directory `/etc/kubernetes/addons/node-problem-detector` on a control plane node.
|
||||
|
||||
## Overwrite the Configuration
|
||||
## Overwrite the configuration
|
||||
|
||||
The [default configuration](https://github.com/kubernetes/node-problem-detector/tree/v0.1/config)
|
||||
is embedded when building the Docker image of node problem detector.
|
||||
|
||||
However, you can use [ConfigMap](/docs/tasks/configure-pod-container/configure-pod-configmap/) to overwrite it
|
||||
following the steps:
|
||||
However, you can use a [`ConfigMap`](/docs/tasks/configure-pod-container/configure-pod-configmap/)
|
||||
to overwrite the configuration:
|
||||
|
||||
* **Step 1:** Change the config files in `config/`.
|
||||
* **Step 2:** Create the ConfigMap `node-problem-detector-config` with `kubectl create configmap
|
||||
node-problem-detector-config --from-file=config/`.
|
||||
* **Step 3:** Change the `node-problem-detector.yaml` to use the ConfigMap:
|
||||
1. Change the configuration files in `config/`
|
||||
1. Create the `ConfigMap` `node-problem-detector-config`:
|
||||
|
||||
{{< codenew file="debug/node-problem-detector-configmap.yaml" >}}
|
||||
```shell
|
||||
kubectl create configmap node-problem-detector-config --from-file=config/
|
||||
```
|
||||
|
||||
1. Change the `node-problem-detector.yaml` to use the `ConfigMap`:
|
||||
|
||||
* **Step 4:** Re-create the node problem detector with the new yaml file:
|
||||
{{< codenew file="debug/node-problem-detector-configmap.yaml" >}}
|
||||
|
||||
```shell
|
||||
kubectl delete -f https://k8s.io/examples/debug/node-problem-detector.yaml # If you have a node-problem-detector running
|
||||
kubectl apply -f https://k8s.io/examples/debug/node-problem-detector-configmap.yaml
|
||||
```
|
||||
1. Recreate the node problem detector with the new configuration file:
|
||||
|
||||
***Notice that this approach only applies to node problem detector started with `kubectl`.***
|
||||
```shell
|
||||
# If you have a node-problem-detector running, delete before recreating
|
||||
kubectl delete -f https://k8s.io/examples/debug/node-problem-detector.yaml
|
||||
kubectl apply -f https://k8s.io/examples/debug/node-problem-detector-configmap.yaml
|
||||
```
|
||||
|
||||
For node problem detector running as cluster addon, because addon manager doesn't support
|
||||
ConfigMap, configuration overwriting is not supported now.
|
||||
{{< note >}}
|
||||
This approach only applies to a node problem detector started with `kubectl`.
|
||||
{{< /note >}}
|
||||
|
||||
Overwriting a configuration is not supported if a node problem detector runs as a cluster Addon.
|
||||
The Addon manager does not support `ConfigMap`.
|
||||
|
||||
## Kernel Monitor
|
||||
|
||||
*Kernel Monitor* is a problem daemon in node problem detector. It monitors kernel log
|
||||
and detects known kernel issues following predefined rules.
|
||||
*Kernel Monitor* is a system log monitor daemon supported in the node problem detector.
|
||||
Kernel monitor watches the kernel log and detects known kernel issues following predefined rules.
|
||||
|
||||
The Kernel Monitor matches kernel issues according to a set of predefined rule list in
|
||||
[`config/kernel-monitor.json`](https://github.com/kubernetes/node-problem-detector/blob/v0.1/config/kernel-monitor.json).
|
||||
The rule list is extensible, and you can always extend it by overwriting the
|
||||
[`config/kernel-monitor.json`](https://github.com/kubernetes/node-problem-detector/blob/v0.1/config/kernel-monitor.json). The rule list is extensible. You can extend the rule list by overwriting the
|
||||
configuration.
|
||||
|
||||
### Add New NodeConditions
|
||||
### Add new NodeConditions
|
||||
|
||||
To support new node conditions, you can extend the `conditions` field in
|
||||
`config/kernel-monitor.json` with new condition definition:
|
||||
To support a new `NodeCondition`, you can extend the `conditions` field in
|
||||
`config/kernel-monitor.json` with a new condition definition such as:
|
||||
|
||||
```json
|
||||
{
|
||||
|
@ -137,10 +121,10 @@ To support new node conditions, you can extend the `conditions` field in
|
|||
}
|
||||
```
|
||||
|
||||
### Detect New Problems
|
||||
### Detect new problems
|
||||
|
||||
To detect new problems, you can extend the `rules` field in `config/kernel-monitor.json`
|
||||
with new rule definition:
|
||||
with a new rule definition:
|
||||
|
||||
```json
|
||||
{
|
||||
|
@ -151,31 +135,28 @@ with new rule definition:
|
|||
}
|
||||
```
|
||||
|
||||
### Change Log Path
|
||||
### Configure path for the kernel log device {#kernel-log-device-path}
|
||||
|
||||
Kernel log in different OS distros may locate in different path. The `log`
|
||||
field in `config/kernel-monitor.json` is the log path inside the container.
|
||||
You can always configure it to match your OS distro.
|
||||
|
||||
### Support Other Log Format
|
||||
|
||||
Kernel monitor uses [`Translator`](https://github.com/kubernetes/node-problem-detector/blob/v0.1/pkg/kernelmonitor/translator/translator.go)
|
||||
plugin to translate kernel log the internal data structure. It is easy to
|
||||
implement a new translator for a new log format.
|
||||
Check your kernel log path location in your operating system (OS) distribution.
|
||||
The Linux kernel [log device](https://www.kernel.org/doc/Documentation/ABI/testing/dev-kmsg) is usually presented as `/dev/kmsg`. However, the log path location varies by OS distribution.
|
||||
The `log` field in `config/kernel-monitor.json` represents the log path inside the container.
|
||||
You can configure the `log` field to match the device path as seen by the node problem detector.
|
||||
|
||||
### Add support for another log format {#support-other-log-format}
|
||||
|
||||
Kernel monitor uses the
|
||||
[`Translator`](https://github.com/kubernetes/node-problem-detector/blob/v0.1/pkg/kernelmonitor/translator/translator.go) plugin to translate the internal data structure of the kernel log.
|
||||
You can implement a new translator for a new log format.
|
||||
|
||||
<!-- discussion -->
|
||||
|
||||
## Caveats
|
||||
|
||||
It is recommended to run the node problem detector in your cluster to monitor
|
||||
the node health. However, you should be aware that this will introduce extra
|
||||
resource overhead on each node. Usually this is fine, because:
|
||||
|
||||
* The kernel log is generated relatively slowly.
|
||||
* Resource limit is set for node problem detector.
|
||||
* Even under high load, the resource usage is acceptable.
|
||||
(see [benchmark result](https://github.com/kubernetes/node-problem-detector/issues/2#issuecomment-220255629))
|
||||
## Recommendations and restrictions
|
||||
|
||||
It is recommended to run the node problem detector in your cluster to monitor node health.
|
||||
When running the node problem detector, you can expect extra resource overhead on each node.
|
||||
Usually this is fine, because:
|
||||
|
||||
* The kernel log grows relatively slowly.
|
||||
* A resource limit is set for the node problem detector.
|
||||
* Even under high load, the resource usage is acceptable. For more information, see the node problem detector
|
||||
[benchmark result](https://github.com/kubernetes/node-problem-detector/issues/2#issuecomment-220255629).
|
||||
|
|
Loading…
Reference in New Issue