310 lines
11 KiB
Markdown
310 lines
11 KiB
Markdown
---
|
||
content_type: task
|
||
title: 节点健康监测
|
||
---
|
||
<!--
|
||
reviewers:
|
||
- Random-Liu
|
||
- dchen1107
|
||
content_type: task
|
||
title: Monitor Node Health
|
||
-->
|
||
|
||
<!-- overview -->
|
||
<!--
|
||
*Node problem detector* is a [DaemonSet](/docs/concepts/workloads/controllers/daemonset/) monitoring the
|
||
node health. It collects node problems from various daemons and reports them
|
||
to the apiserver as [NodeCondition](/docs/concepts/architecture/nodes/#condition)
|
||
and [Event](/docs/reference/generated/kubernetes-api/{{< param "version" >}}/#event-v1-core).
|
||
-->
|
||
|
||
*节点问题探测器* 是一个 [DaemonSet](/zh/docs/concepts/workloads/controllers/daemonset/),
|
||
用来监控节点健康。它从各种守护进程收集节点问题,并以
|
||
[NodeCondition](/zh/docs/concepts/architecture/nodes/#condition) 和
|
||
[Event](/docs/reference/generated/kubernetes-api/{{< param "version" >}}/#event-v1-core)
|
||
的形式报告给 API 服务器。
|
||
|
||
<!--
|
||
It supports some known kernel issue detection now, and will detect more and
|
||
more node problems over time.
|
||
-->
|
||
它现在支持一些已知的内核问题检测,并将随着时间的推移,检测更多节点问题。
|
||
|
||
<!--
|
||
Currently Kubernetes won't take any action on the node conditions and events
|
||
generated by node problem detector. In the future, a remedy system could be
|
||
introduced to deal with node problems.
|
||
-->
|
||
目前,Kubernetes 不会对节点问题检测器监测到的节点状态和事件采取任何操作。
|
||
将来可能会引入一个补救系统来处理这些节点问题。
|
||
|
||
<!--
|
||
See more information
|
||
[here](https://github.com/kubernetes/node-problem-detector).
|
||
-->
|
||
更多信息请参阅 [这里](https://github.com/kubernetes/node-problem-detector)。
|
||
|
||
## {{% heading "prerequisites" %}}
|
||
|
||
{{< include "task-tutorial-prereqs.md" >}} {{< version-check >}}
|
||
|
||
<!-- steps -->
|
||
|
||
<!--
|
||
## Limitations
|
||
|
||
* The kernel issue detection of node problem detector only supports file based
|
||
kernel log now. It doesn't support log tools like journald.
|
||
-->
|
||
## 局限性 {#limitations}
|
||
|
||
* 节点问题检测器的内核问题检测现在只支持基于文件类型的内核日志。
|
||
它不支持像 journald 这样的命令行日志工具。
|
||
|
||
<!--
|
||
* The kernel issue detection of node problem detector has assumption on kernel
|
||
log format, and now it only works on Ubuntu and Debian. However, it is easy to extend
|
||
it to [support other log format](/docs/tasks/debug-application-cluster/monitor-node-health/#support-other-log-format).
|
||
-->
|
||
* 节点问题检测器的内核问题检测对内核日志格式有一定要求,现在它只适用于 Ubuntu 和 Debian。
|
||
不过将其扩展为[支持其它日志格式](#support-other-log-format) 也很容易。
|
||
|
||
<!--
|
||
## Enable/Disable in GCE cluster
|
||
|
||
Node problem detector is [running as a cluster addon](/docs/setup/cluster-large/#addon-resources) enabled by default in the
|
||
gce cluster.
|
||
-->
|
||
## 在 GCE 集群中启用/禁用
|
||
|
||
节点问题检测器在 gce 集群中以
|
||
[集群插件的形式](/zh/docs/setup/best-practices/cluster-large/#addon-resources)
|
||
默认启用。
|
||
|
||
<!--
|
||
You can enable/disable it by setting the environment variable
|
||
`KUBE_ENABLE_NODE_PROBLEM_DETECTOR` before `kube-up.sh`.
|
||
-->
|
||
你可以在运行 `kube-up.sh` 之前,以设置环境变量 `KUBE_ENABLE_NODE_PROBLEM_DETECTOR` 的形式启用/禁用它。
|
||
|
||
<!--
|
||
## Use in Other Environment
|
||
|
||
To enable node problem detector in other environment outside of GCE, you can use
|
||
either `kubectl` or addon pod.
|
||
-->
|
||
## 在其它环境中使用 {#use-in-other-environment}
|
||
|
||
要在 GCE 之外的其他环境中启用节点问题检测器,你可以使用 `kubectl` 或插件 pod。
|
||
|
||
<!--
|
||
### Kubectl
|
||
|
||
This is the recommended way to start node problem detector outside of GCE. It
|
||
provides more flexible management, such as overwriting the default
|
||
configuration to fit it into your environment or detect
|
||
customized node problems.
|
||
-->
|
||
### Kubectl
|
||
|
||
这是在 GCE 之外启动节点问题检测器的推荐方法。
|
||
它的管理更加灵活,例如覆盖默认配置以使其适合你的环境或检测自定义节点问题。
|
||
|
||
<!--
|
||
* **Step 1:** `node-problem-detector.yaml`:
|
||
-->
|
||
* **步骤 1:** `node-problem-detector.yaml`:
|
||
|
||
{{< codenew file="debug/node-problem-detector.yaml" >}}
|
||
|
||
<!--
|
||
***Notice that you should make sure the system log directory is right for your
|
||
OS distro.***
|
||
-->
|
||
***请注意保证你的系统日志路径与你的 OS 发行版相对应。***
|
||
|
||
<!--
|
||
* **Step 2:** Start node problem detector with `kubectl`:
|
||
-->
|
||
* **步骤 2:** 执行 `kubectl` 来启动节点问题检测器:
|
||
|
||
```shell
|
||
kubectl create -f https://k8s.io/examples/debug/node-problem-detector.yaml
|
||
```
|
||
|
||
<!--
|
||
### Addon Pod
|
||
|
||
This is for those who have their own cluster bootstrap solution, and don't need
|
||
to overwrite the default configuration. They could leverage the addon pod to
|
||
further automate the deployment.
|
||
-->
|
||
### 插件 Pod {#addon-pod}
|
||
|
||
这适用于拥有自己的集群引导程序解决方案的用户,并且不需要覆盖默认配置。
|
||
他们可以利用插件 Pod 进一步自动化部署。
|
||
|
||
<!--
|
||
Just create `node-problem-detector.yaml`, and put it under the addon pods directory
|
||
`/etc/kubernetes/addons/node-problem-detector` on master node.
|
||
-->
|
||
只需创建 `node-problem-detector.yaml`,并将其放在主节点上的插件 pod 目录
|
||
`/etc/kubernetes/addons/node-problem-detector` 下。
|
||
|
||
<!--
|
||
## Overwrite the Configuration
|
||
|
||
The [default configuration](https://github.com/kubernetes/node-problem-detector/tree/v0.1/config)
|
||
is embedded when building the docker image of node problem detector.
|
||
-->
|
||
## 覆盖配置文件
|
||
|
||
构建节点问题检测器的 docker 镜像时,会嵌入
|
||
[默认配置](https://github.com/kubernetes/node-problem-detector/tree/v0.1/config)。
|
||
|
||
<!--
|
||
However, you can use [ConfigMap](/docs/tasks/configure-pod-container/configure-pod-configmap/) to overwrite it
|
||
following the steps:
|
||
-->
|
||
不过,你可以像下面这样使用 [ConfigMap](/zh/docs/tasks/configure-pod-container/configure-pod-configmap/)
|
||
将其覆盖:
|
||
|
||
<!--
|
||
* **Step 1:** Change the config files in `config/`.
|
||
* **Step 2:** Create the ConfigMap `node-problem-detector-config` with `kubectl create configmap
|
||
node-problem-detector-config --from-file=config/`.
|
||
* **Step 3:** Change the `node-problem-detector.yaml` to use the ConfigMap:
|
||
-->
|
||
* **步骤 1:** 在 `config/` 中更改配置文件。
|
||
* **步骤 2:** 使用 `kubectl create configmap node-problem-detector-config --from-file=config/` 创建 `node-problem-detector-config` 。
|
||
* **步骤 3:** 更改 `node-problem-detector.yaml` 以使用 ConfigMap:
|
||
|
||
{{< codenew file="debug/node-problem-detector-configmap.yaml" >}}
|
||
|
||
<!--
|
||
* **Step 4:** Re-create the node problem detector with the new yaml file:
|
||
-->
|
||
* **步骤 4:** 使用新的 yaml 文件重新创建节点问题检测器:
|
||
|
||
```shell
|
||
kubectl delete -f https://k8s.io/examples/debug/node-problem-detector.yaml # If you have a node-problem-detector running
|
||
kubectl create -f https://k8s.io/examples/debug/node-problem-detector-configmap.yaml
|
||
```
|
||
|
||
<!--
|
||
***Notice that this approach only applies to node problem detector started with `kubectl`.***
|
||
-->
|
||
***请注意,此方法仅适用于通过 `kubectl` 启动的节点问题检测器。***
|
||
|
||
<!--
|
||
For node problem detector running as cluster addon, because addon manager doesn't support
|
||
ConfigMap, configuration overwriting is not supported now.
|
||
-->
|
||
由于插件管理器不支持ConfigMap,因此现在不支持对于作为集群插件运行的节点问题检测器的配置进行覆盖。
|
||
|
||
<!--
|
||
## Kernel Monitor
|
||
|
||
*Kernel Monitor* is a problem daemon in node problem detector. It monitors kernel log
|
||
and detects known kernel issues following predefined rules.
|
||
-->
|
||
## 内核监视器
|
||
|
||
*内核监视器* 是节点问题检测器中的问题守护进程。它监视内核日志并按照预定义规则检测已知内核问题。
|
||
|
||
<!--
|
||
The Kernel Monitor matches kernel issues according to a set of predefined rule list in
|
||
[`config/kernel-monitor.json`](https://github.com/kubernetes/node-problem-detector/blob/v0.1/config/kernel-monitor.json).
|
||
The rule list is extensible, and you can always extend it by overwriting the
|
||
configuration.
|
||
-->
|
||
内核监视器根据 [`config/kernel-monitor.json`](https://github.com/kubernetes/node-problem-detector/blob/v0.1/config/kernel-monitor.json) 中的一组预定义规则列表匹配内核问题。
|
||
规则列表是可扩展的,你始终可以通过覆盖配置来扩展它。
|
||
|
||
<!--
|
||
### Add New NodeConditions
|
||
|
||
To support new node conditions, you can extend the `conditions` field in
|
||
`config/kernel-monitor.json` with new condition definition:
|
||
-->
|
||
### 添加新的 NodeCondition
|
||
|
||
你可以使用新的状态描述来扩展 `config/kernel-monitor.json` 中的 `conditions` 字段以支持新的节点状态。
|
||
|
||
```json
|
||
{
|
||
"type": "NodeConditionType",
|
||
"reason": "CamelCaseDefaultNodeConditionReason",
|
||
"message": "arbitrary default node condition message"
|
||
}
|
||
```
|
||
|
||
<!--
|
||
### Detect New Problems
|
||
|
||
To detect new problems, you can extend the `rules` field in `config/kernel-monitor.json`
|
||
with new rule definition:
|
||
-->
|
||
### 检测新的问题
|
||
|
||
你可以使用新的规则描述来扩展 `config/kernel-monitor.json` 中的 `rules` 字段以检测新问题。
|
||
|
||
```json
|
||
{
|
||
"type": "temporary/permanent",
|
||
"condition": "NodeConditionOfPermanentIssue",
|
||
"reason": "CamelCaseShortReason",
|
||
"message": "regexp matching the issue in the kernel log"
|
||
}
|
||
```
|
||
|
||
<!--
|
||
### Change Log Path
|
||
|
||
Kernel log in different OS distros may locate in different path. The `log`
|
||
field in `config/kernel-monitor.json` is the log path inside the container.
|
||
You can always configure it to match your OS distro.
|
||
-->
|
||
### 更改日志路径
|
||
|
||
不同操作系统发行版的内核日志的可能不同。 `config/kernel-monitor.json` 中的 `log` 字段是容器内的日志路径。你始终可以修改配置使其与你的 OS 发行版匹配。
|
||
|
||
<!--
|
||
### Support Other Log Format
|
||
|
||
Kernel monitor uses [`Translator`](https://github.com/kubernetes/node-problem-detector/blob/v0.1/pkg/kernelmonitor/translator/translator.go)
|
||
plugin to translate kernel log the internal data structure. It is easy to
|
||
implement a new translator for a new log format.
|
||
-->
|
||
### 支持其它日志格式 {#support-other-log-format}
|
||
|
||
内核监视器使用 [`Translator`] 插件将内核日志转换为内部数据结构。
|
||
我们可以很容易为新的日志格式实现新的翻译器。
|
||
|
||
<!-- discussion -->
|
||
|
||
<!--
|
||
## Caveats
|
||
|
||
It is recommended to run the node problem detector in your cluster to monitor
|
||
the node health. However, you should be aware that this will introduce extra
|
||
resource overhead on each node. Usually this is fine, because:
|
||
-->
|
||
## 注意事项 {#caveats}
|
||
|
||
我们建议在集群中运行节点问题检测器来监视节点运行状况。
|
||
但是,你应该知道这将在每个节点上引入额外的资源开销。一般情况下没有影响,因为:
|
||
|
||
<!--
|
||
* The kernel log is generated relatively slowly.
|
||
* Resource limit is set for node problem detector.
|
||
* Even under high load, the resource usage is acceptable.
|
||
(see [benchmark result](https://github.com/kubernetes/node-problem-detector/issues/2#issuecomment-220255629))
|
||
-->
|
||
* 内核日志生成相对较慢。
|
||
* 节点问题检测器有资源限制。
|
||
* 即使在高负载下,资源使用也是可以接受的。
|
||
(参阅 [基准测试结果](https://github.com/kubernetes/node-problem-detector/issues/2#issuecomment-220255629))
|
||
|
||
|