[zh] Resync tasks files[3]

pull/27887/head
howieyuen 2021-05-06 11:59:12 +08:00
parent 3316bae823
commit 3ca030c235
1 changed files with 181 additions and 179 deletions

View File

@ -3,159 +3,131 @@ content_type: task
title: 节点健康监测
---
<!--
title: Monitor Node Health
content_type: task
reviewers:
- Random-Liu
- dchen1107
content_type: task
title: Monitor Node Health
-->
<!-- overview -->
<!--
*Node problem detector* is a [DaemonSet](/docs/concepts/workloads/controllers/daemonset/) monitoring the
node health. It collects node problems from various daemons and reports them
to the apiserver as [NodeCondition](/docs/concepts/architecture/nodes/#condition)
*Node Problem Detector* is a daemon for monitoring and reporting about a node's health.
You can run Node Problem Detector as a `DaemonSet` or as a standalone daemon.
Node Problem Detector collects information about node problems from various daemons
and reports these conditions to the API server as [NodeCondition](/docs/concepts/architecture/nodes/#condition)
and [Event](/docs/reference/generated/kubernetes-api/{{< param "version" >}}/#event-v1-core).
To learn how to install and use Node Problem Detector, see
[Node Problem Detector project documentation](https://github.com/kubernetes/node-problem-detector).
-->
*节点问题探测器* 是一个 [DaemonSet](/zh/docs/concepts/workloads/controllers/daemonset/)
用来监控节点健康。它从各种守护进程收集节点问题,并以
*节点问题检测器Node Problem Detector*是一个守护程序,用于监视和报告节点的健康状况。
你可以将节点问题探测器以 `DaemonSet` 或独立守护程序运行。
节点问题检测器从各种守护进程收集节点问题,并以
[NodeCondition](/zh/docs/concepts/architecture/nodes/#condition) 和
[Event](/docs/reference/generated/kubernetes-api/{{< param "version" >}}/#event-v1-core)
的形式报告给 API 服务器。
<!--
It supports some known kernel issue detection now, and will detect more and
more node problems over time.
-->
它现在支持一些已知的内核问题检测,并将随着时间的推移,检测更多节点问题。
<!--
Currently Kubernetes won't take any action on the node conditions and events
generated by node problem detector. In the future, a remedy system could be
introduced to deal with node problems.
-->
目前Kubernetes 不会对节点问题检测器监测到的节点状态和事件采取任何操作。
将来可能会引入一个补救系统来处理这些节点问题。
<!--
See more information
[here](https://github.com/kubernetes/node-problem-detector).
-->
更多信息请参阅 [这里](https://github.com/kubernetes/node-problem-detector)。
要了解如何安装和使用节点问题检测器,请参阅
[节点问题探测器项目文档](https://github.com/kubernetes/node-problem-detector)。
## {{% heading "prerequisites" %}}
{{< include "task-tutorial-prereqs.md" >}} {{< version-check >}}
{{< include "task-tutorial-prereqs.md" >}}
<!-- steps -->
<!--
## Limitations
* The kernel issue detection of node problem detector only supports file based
kernel log now. It doesn't support log tools like journald.
* Node Problem Detector only supports file based kernel log.
Log tools such as `journald` are not supported.
* Node Problem Detector uses the kernel log format for reporting kernel issues.
To learn how to extend the kernel log format, see [Add support for another log format](#support-other-log-format).
-->
## 局限性 {#limitations}
* 节点问题检测器的内核问题检测现在只支持基于文件类型的内核日志。
* 节点问题检测器只支持基于文件类型的内核日志。
它不支持像 journald 这样的命令行日志工具。
* 节点问题检测器使用内核日志格式来报告内核问题。
要了解如何扩展内核日志格式,请参阅[添加对另一个日志格式的支持](#support-other-log-format)。
<!--
* The kernel issue detection of node problem detector has assumption on kernel
log format, and now it only works on Ubuntu and Debian. However, it is easy to extend
it to [support other log format](/docs/tasks/debug-application-cluster/monitor-node-health/#support-other-log-format).
## Enabling Node Problem Detector
Some cloud providers enable Node Problem Detector as an {{< glossary_tooltip text="Addon" term_id="addons" >}}.
You can also enable Node Problem Detector with `kubectl` or by creating an Addon pod.
-->
* 节点问题检测器的内核问题检测对内核日志格式有一定要求,现在它只适用于 Ubuntu 和 Debian。
不过将其扩展为[支持其它日志格式](#support-other-log-format) 也很容易。
## 启用节点问题检测器
一些云供应商将节点问题检测器以{{< glossary_tooltip text="插件" term_id="addons" >}}形式启用。
你还可以使用 `kubectl` 或创建插件 Pod 来启用节点问题探测器。
<!--
## Enable/Disable in GCE cluster
## Using kubectl to enable Node Problem Detector {#using-kubectl}
Node problem detector is [running as a cluster addon](/docs/setup/cluster-large/#addon-resources) enabled by default in the
gce cluster.
`kubectl` provides the most flexible management of Node Problem Detector.
You can overwrite the default configuration to fit it into your environment or
to detect customized node problems. For example:
-->
## 在 GCE 集群中启用/禁用
## 使用 kubectl 启用节点问题检测器 {#using-kubectl}
节点问题检测器在 gce 集群中以
[集群插件的形式](/zh/docs/setup/best-practices/cluster-large/#addon-resources)
默认启用。
`kubectl` 提供了节点问题探测器最灵活的管理。
你可以覆盖默认配置使其适合你的环境或检测自定义节点问题。例如:
<!--
You can enable/disable it by setting the environment variable
`KUBE_ENABLE_NODE_PROBLEM_DETECTOR` before `kube-up.sh`.
1. Create a Node Problem Detector configuration similar to `node-problem-detector.yaml`:
{{< codenew file="debug/node-problem-detector.yaml" >}}
{{< note >}}
You should verify that the system log directory is right for your operating system distribution.
{{< /note >}}
1. Start node problem detector with `kubectl`:
```shell
kubectl apply -f https://k8s.io/examples/debug/node-problem-detector.yaml
```
-->
你可以在运行 `kube-up.sh` 之前,以设置环境变量 `KUBE_ENABLE_NODE_PROBLEM_DETECTOR` 的形式启用/禁用它。
1. 创建类似于 `node-strought-detector.yaml` 的节点问题检测器配置:
{{< codenew file="debug/node-problem-detector.yaml" >}}
{{< note >}}
你应该检查系统日志目录是否适用于操作系统发行版本。
{{< /note >}}
1. 使用 `kubectl` 启动节点问题检测器:
```shell
kubectl apply -f https://k8s.io/examples/debug/node-problem-detector.yaml
```
<!--
## Use in Other Environment
### Using an Addon pod to enable Node Problem Detector {#using-addon-pod}
To enable node problem detector in other environment outside of GCE, you can use
either `kubectl` or addon pod.
If you are using a custom cluster bootstrap solution and don't need
to overwrite the default configuration, you can leverage the Addon pod to
further automate the deployment.
Create `node-problem-detector.yaml`, and save the configuration in the Addon pod's
directory `/etc/kubernetes/addons/node-problem-detector` on a control plane node.
-->
## 在其它环境中使用 {#use-in-other-environment}
### 使用插件 pod 启用节点问题检测器 {#using-addon-pod}
要在 GCE 之外的其他环境中启用节点问题检测器,你可以使用 `kubectl` 或插件 pod。
如果你使用的是自定义集群引导解决方案,不需要覆盖默认配置,
可以利用插件 Pod 进一步自动化部署。
<!--
### Kubectl
This is the recommended way to start node problem detector outside of GCE. It
provides more flexible management, such as overwriting the default
configuration to fit it into your environment or detect
customized node problems.
-->
### Kubectl
这是在 GCE 之外启动节点问题检测器的推荐方法。
它的管理更加灵活,例如覆盖默认配置以使其适合你的环境或检测自定义节点问题。
<!--
* **Step 1:** `node-problem-detector.yaml`:
-->
* **步骤 1:** `node-problem-detector.yaml`:
{{< codenew file="debug/node-problem-detector.yaml" >}}
<!--
***Notice that you should make sure the system log directory is right for your
OS distro.***
-->
***请注意保证你的系统日志路径与你的 OS 发行版相对应。***
<!--
* **Step 2:** Start node problem detector with `kubectl`:
-->
* **步骤 2:** 执行 `kubectl` 来启动节点问题检测器:
```shell
kubectl create -f https://k8s.io/examples/debug/node-problem-detector.yaml
```
<!--
### Addon Pod
This is for those who have their own cluster bootstrap solution, and don't need
to overwrite the default configuration. They could leverage the addon pod to
further automate the deployment.
-->
### 插件 Pod {#addon-pod}
这适用于拥有自己的集群引导程序解决方案的用户,并且不需要覆盖默认配置。
他们可以利用插件 Pod 进一步自动化部署。
<!--
Just create `node-problem-detector.yaml`, and put it under the addon pods directory
`/etc/kubernetes/addons/node-problem-detector` on master node.
-->
只需创建 `node-problem-detector.yaml`,并将其放在主节点上的插件 pod 目录
`/etc/kubernetes/addons/node-problem-detector` 下。
创建 `node-strick-detector.yaml`,并在控制平面节点上保存配置到插件 Pod 的目录
`/etc/kubernetes/addons/node-problem-detector`
<!--
## Overwrite the Configuration
The [default configuration](https://github.com/kubernetes/node-problem-detector/tree/v0.1/config)
is embedded when building the docker image of node problem detector.
is embedded when building the Docker image of Node Problem Detector.
-->
## 覆盖配置文件
@ -163,73 +135,97 @@ is embedded when building the docker image of node problem detector.
[默认配置](https://github.com/kubernetes/node-problem-detector/tree/v0.1/config)。
<!--
However, you can use [ConfigMap](/docs/tasks/configure-pod-container/configure-pod-configmap/) to overwrite it
following the steps:
However, you can use a [`ConfigMap`](/docs/tasks/configure-pod-container/configure-pod-configmap/)
to overwrite the configuration:
-->
不过,你可以像下面这样使用 [ConfigMap](/zh/docs/tasks/configure-pod-container/configure-pod-configmap/)
不过,你可以像下面这样使用 [`ConfigMap`](/zh/docs/tasks/configure-pod-container/configure-pod-configmap/)
将其覆盖:
<!--
* **Step 1:** Change the config files in `config/`.
* **Step 2:** Create the ConfigMap `node-problem-detector-config` with `kubectl create configmap
node-problem-detector-config --from-file=config/`.
* **Step 3:** Change the `node-problem-detector.yaml` to use the ConfigMap:
1. Change the configuration files in `config/`
1. Create the `ConfigMap` `node-problem-detector-config`:
```shell
kubectl create configmap node-problem-detector-config --from-file=config/
```
1. Change the `node-problem-detector.yaml` to use the `ConfigMap`:
{{< codenew file="debug/node-problem-detector-configmap.yaml" >}}
1. Recreate the Node Problem Detector with the new configuration file:
```shell
# If you have a node-problem-detector running, delete before recreating
kubectl delete -f https://k8s.io/examples/debug/node-problem-detector.yaml
kubectl apply -f https://k8s.io/examples/debug/node-problem-detector-configmap.yaml
```
-->
* **步骤 1:** 在 `config/` 中更改配置文件。
* **步骤 2:** 使用 `kubectl create configmap node-problem-detector-config --from-file=config/` 创建 `node-problem-detector-config`
* **步骤 3:** 更改 `node-problem-detector.yaml` 以使用 ConfigMap:
1. 更改 `config/` 中的配置文件
1. 创建 `ConfigMap` `node-strick-detector-config`
```shell
kubectl create configmap node-problem-detector-config --from-file=config/
```
{{< codenew file="debug/node-problem-detector-configmap.yaml" >}}
1. 更改 `node-problem-detector.yaml` 以使用 ConfigMap:
{{< codenew file="debug/node-problem-detector-configmap.yaml" >}}
<!--
* **Step 4:** Re-create the node problem detector with the new yaml file:
1. 使用新的配置文件重新创建节点问题检测器:
```shell
# 如果你正在运行节点问题检测器,请先删除,然后再重新创建
kubectl delete -f https://k8s.io/examples/debug/node-problem-detector.yaml
kubectl apply -f https://k8s.io/examples/debug/node-problem-detector-configmap.yaml
```
<!--
{{< note >}}
This approach only applies to a Node Problem Detector started with `kubectl`.
{{< /note >}}
Overwriting a configuration is not supported if a Node Problem Detector runs as a cluster Addon.
The Addon manager does not support `ConfigMap`.
-->
* **步骤 4:** 使用新的 yaml 文件重新创建节点问题检测器:
{{< note >}}
此方法仅适用于通过 `kubectl` 启动的节点问题检测器。
{{< /note >}}
```shell
kubectl delete -f https://k8s.io/examples/debug/node-problem-detector.yaml # If you have a node-problem-detector running
kubectl create -f https://k8s.io/examples/debug/node-problem-detector-configmap.yaml
```
<!--
***Notice that this approach only applies to node problem detector started with `kubectl`.***
-->
***请注意,此方法仅适用于通过 `kubectl` 启动的节点问题检测器。***
<!--
For node problem detector running as cluster addon, because addon manager doesn't support
ConfigMap, configuration overwriting is not supported now.
-->
由于插件管理器不支持ConfigMap因此现在不支持对于作为集群插件运行的节点问题检测器的配置进行覆盖。
如果节点问题检测器作为集群插件运行,则不支持覆盖配置。
插件管理器不支持 `ConfigMap`
<!--
## Kernel Monitor
*Kernel Monitor* is a problem daemon in node problem detector. It monitors kernel log
and detects known kernel issues following predefined rules.
*Kernel Monitor* is a system log monitor daemon supported in the Node Problem Detector.
Kernel monitor watches the kernel log and detects known kernel issues following predefined rules.
-->
## 内核监视器
*内核监视器* 是节点问题检测器中的问题守护进程。它监视内核日志并按照预定义规则检测已知内核问题。
*内核监视器Kernel Monitor*是节点问题检测器中支持的系统日志监视器守护进程。
内核监视器观察内核日志并根据预定义规则检测已知的内核问题。
<!--
The Kernel Monitor matches kernel issues according to a set of predefined rule list in
[`config/kernel-monitor.json`](https://github.com/kubernetes/node-problem-detector/blob/v0.1/config/kernel-monitor.json).
The rule list is extensible, and you can always extend it by overwriting the
configuration.
[`config/kernel-monitor.json`](https://github.com/kubernetes/node-problem-detector/blob/v0.1/config/kernel-monitor.json). The rule list is extensible. You can expand the rule list by overwriting the
configuration.
-->
内核监视器根据 [`config/kernel-monitor.json`](https://github.com/kubernetes/node-problem-detector/blob/v0.1/config/kernel-monitor.json) 中的一组预定义规则列表匹配内核问题。
内核监视器根据 [`config/kernel-monitor.json`](https://github.com/kubernetes/node-problem-detector/blob/v0.1/config/kernel-monitor.json)
中的一组预定义规则列表匹配内核问题。
规则列表是可扩展的,你始终可以通过覆盖配置来扩展它。
<!--
### Add New NodeConditions
### Add new NodeConditions
To support new node conditions, you can extend the `conditions` field in
`config/kernel-monitor.json` with new condition definition:
To support a new `NodeCondition`, create a condition definition within the `conditions` field in
`config/kernel-monitor.json`, for example:
```
-->
### 添加新的 NodeCondition
你可以使用新的状态描述来扩展 `config/kernel-monitor.json` 中的 `conditions` 字段以支持新的节点状态。
要支持新的 `NodeCondition`,请在 `config/kernel-monitor.json` 中的
`conditions` 字段中创建一个条件定义:
```json
{
@ -240,14 +236,14 @@ To support new node conditions, you can extend the `conditions` field in
```
<!--
### Detect New Problems
### Detect new problems
To detect new problems, you can extend the `rules` field in `config/kernel-monitor.json`
with new rule definition:
with a new rule definition:
-->
### 检测新的问题
你可以使用新的规则描述来扩展 `config/kernel-monitor.json` 中的 `rules` 字段以检测新问题
你可以使用新的规则描述来扩展 `config/kernel-monitor.json` 中的 `rules` 字段以检测新问题
```json
{
@ -259,51 +255,57 @@ with new rule definition:
```
<!--
### Change Log Path
### Configure path for the kernel log device {#kernel-log-device-path}
Kernel log in different OS distros may locate in different path. The `log`
field in `config/kernel-monitor.json` is the log path inside the container.
You can always configure it to match your OS distro.
Check your kernel log path location in your operating system (OS) distribution.
The Linux kernel [log device](https://www.kernel.org/doc/Documentation/ABI/testing/dev-kmsg) is usually presented as `/dev/kmsg`. However, the log path location varies by OS distribution.
The `log` field in `config/kernel-monitor.json` represents the log path inside the container.
You can configure the `log` field to match the device path as seen by the Node Problem Detector.
-->
### 更改日志路径
### 配置内核日志设备的路径 {#kernel-log-device-path}
不同操作系统发行版的内核日志的可能不同。 `config/kernel-monitor.json` 中的 `log` 字段是容器内的日志路径。你始终可以修改配置使其与你的 OS 发行版匹配。
检查你的操作系统OS发行版本中的内核日志路径位置。
Linux 内核[日志设备](https://www.kernel.org/doc/documentation/abi/testing/dev-kmsg)
通常呈现为 `/dev/kmsg`
但是,日志路径位置因 OS 发行版本而异。
`config/kernel-monitor.json` 中的 `log` 字段表示容器内的日志路径。
你可以配置 `log` 字段以匹配节点问题检测器所示的设备路径。
<!--
### Support Other Log Format
### Add support for another log format {#support-other-log-format}
Kernel monitor uses [`Translator`](https://github.com/kubernetes/node-problem-detector/blob/v0.1/pkg/kernelmonitor/translator/translator.go)
plugin to translate kernel log the internal data structure. It is easy to
implement a new translator for a new log format.
Kernel monitor uses the
[`Translator`](https://github.com/kubernetes/node-problem-detector/blob/v0.1/pkg/kernelmonitor/translator/translator.go) plugin to translate the internal data structure of the kernel log.
You can implement a new translator for a new log format.
-->
### 支持其它日志格式 {#support-other-log-format}
### 添加对其它日志格式的支持 {#support-other-log-format}
内核监视器使用 [`Translator`] 插件将内核日志转换为内部数据结构。
我们可以很容易为新的日志格式实现新的翻译器。
内核监视器使用
[`Translator`](https://github.com/kubernetes/node-problem-detector/blob/v0.1/pkg/kernelmonitor/translator.go)
插件转换内核日志的内部数据结构。
你可以为新的日志格式实现新的转换器。
<!-- discussion -->
<!--
## Caveats
## Recommendations and restrictions
It is recommended to run the node problem detector in your cluster to monitor
the node health. However, you should be aware that this will introduce extra
resource overhead on each node. Usually this is fine, because:
It is recommended to run the Node Problem Detector in your cluster to monitor node health.
When running the Node Problem Detector, you can expect extra resource overhead on each node.
Usually this is fine, because:
* The kernel log grows relatively slowly.
* A resource limit is set for the Node Problem Detector.
* Even under high load, the resource usage is acceptable. For more information, see the Node Problem Detector
[benchmark result](https://github.com/kubernetes/node-problem-detector/issues/2#issuecomment-220255629).
-->
## 注意事项 {#caveats}
我们建议在集群中运行节点问题检测器来监视节点运行状况。
但是,你应该知道这将在每个节点上引入额外的资源开销。一般情况下没有影响,因为:
<!--
* The kernel log is generated relatively slowly.
* Resource limit is set for node problem detector.
* Even under high load, the resource usage is acceptable.
(see [benchmark result](https://github.com/kubernetes/node-problem-detector/issues/2#issuecomment-220255629))
-->
* 内核日志生成相对较慢。
* 节点问题检测器有资源限制。
* 即使在高负载下,资源使用也是可以接受的。
(参阅 [基准测试结果](https://github.com/kubernetes/node-problem-detector/issues/2#issuecomment-220255629))
## 建议和限制
建议在集群中运行节点问题检测器以监控节点运行状况。
运行节点问题检测器时,你可以预期每个节点上的额外资源开销。
通常这是可接受的,因为:
* 内核日志增长相对缓慢。
* 已经为节点问题检测器设置了资源限制。
* 即使在高负载下,资源使用也是可接受的。有关更多信息,请参阅节点问题检测器
[基准结果](https://github.com/kubernetes/node-problem-detector/issues/2.suecomment-220255629)。