Document control plane monitoring (#17578)
* Document control plane monitoring * Update content/en/docs/concepts/cluster-administration/monitoring.md Co-Authored-By: Tim Bannister <tim@scalefactory.com> * Update content/en/docs/concepts/cluster-administration/monitoring.md Co-Authored-By: Tim Bannister <tim@scalefactory.com> * Merge controller-metrics.md into monitoring.md Co-authored-by: Tim Bannister <tim@scalefactory.com>pull/19058/head
parent
d43ac11196
commit
26827a909e
|
@ -1,50 +0,0 @@
|
||||||
---
|
|
||||||
title: Controller manager metrics
|
|
||||||
content_template: templates/concept
|
|
||||||
weight: 100
|
|
||||||
---
|
|
||||||
|
|
||||||
{{% capture overview %}}
|
|
||||||
Controller manager metrics provide important insight into the performance and health of
|
|
||||||
the controller manager.
|
|
||||||
|
|
||||||
{{% /capture %}}
|
|
||||||
|
|
||||||
{{% capture body %}}
|
|
||||||
## What are controller manager metrics
|
|
||||||
|
|
||||||
Controller manager metrics provide important insight into the performance and health of the controller manager.
|
|
||||||
These metrics include common Go language runtime metrics such as go_routine count and controller specific metrics such as
|
|
||||||
etcd request latencies or Cloudprovider (AWS, GCE, OpenStack) API latencies that can be used
|
|
||||||
to gauge the health of a cluster.
|
|
||||||
|
|
||||||
Starting from Kubernetes 1.7, detailed Cloudprovider metrics are available for storage operations for GCE, AWS, Vsphere and OpenStack.
|
|
||||||
These metrics can be used to monitor health of persistent volume operations.
|
|
||||||
|
|
||||||
For example, for GCE these metrics are called:
|
|
||||||
|
|
||||||
```
|
|
||||||
cloudprovider_gce_api_request_duration_seconds { request = "instance_list"}
|
|
||||||
cloudprovider_gce_api_request_duration_seconds { request = "disk_insert"}
|
|
||||||
cloudprovider_gce_api_request_duration_seconds { request = "disk_delete"}
|
|
||||||
cloudprovider_gce_api_request_duration_seconds { request = "attach_disk"}
|
|
||||||
cloudprovider_gce_api_request_duration_seconds { request = "detach_disk"}
|
|
||||||
cloudprovider_gce_api_request_duration_seconds { request = "list_disk"}
|
|
||||||
```
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
## Configuration
|
|
||||||
|
|
||||||
|
|
||||||
In a cluster, controller-manager metrics are available from `http://localhost:10252/metrics`
|
|
||||||
from the host where the controller-manager is running.
|
|
||||||
|
|
||||||
The metrics are emitted in [prometheus format](https://prometheus.io/docs/instrumenting/exposition_formats/) and are human readable.
|
|
||||||
|
|
||||||
In a production environment you may want to configure prometheus or some other metrics scraper
|
|
||||||
to periodically gather these metrics and make them available in some kind of time series database.
|
|
||||||
|
|
||||||
{{% /capture %}}
|
|
||||||
|
|
||||||
|
|
|
@ -0,0 +1,132 @@
|
||||||
|
---
|
||||||
|
title: Metrics For The Kubernetes Control Plane
|
||||||
|
reviewers:
|
||||||
|
- brancz
|
||||||
|
- logicalhan
|
||||||
|
- RainbowMango
|
||||||
|
content_template: templates/concept
|
||||||
|
weight: 60
|
||||||
|
aliases:
|
||||||
|
- controller-metrics.md
|
||||||
|
---
|
||||||
|
|
||||||
|
{{% capture overview %}}
|
||||||
|
|
||||||
|
System component metrics can give a better look into what is happening inside them. Metrics are particularly useful for building dashboards and alerts.
|
||||||
|
|
||||||
|
Metrics in Kubernetes control plane are emitted in [prometheus format](https://prometheus.io/docs/instrumenting/exposition_formats/) and are human readable.
|
||||||
|
|
||||||
|
{{% /capture %}}
|
||||||
|
|
||||||
|
{{% capture body %}}
|
||||||
|
|
||||||
|
## Metrics in Kubernetes
|
||||||
|
|
||||||
|
In most cases metrics are available on `/metrics` endpoint of the HTTP server. For components that doesn't expose endpoint by default it can be enabled using `--bind-address` flag.
|
||||||
|
|
||||||
|
Examples of those components:
|
||||||
|
* {{< glossary_tooltip term_id="kube-controller-manager" text="kube-controller-manager" >}}
|
||||||
|
* {{< glossary_tooltip term_id="kube-proxy" text="kube-proxy" >}}
|
||||||
|
* {{< glossary_tooltip term_id="kube-apiserver" text="kube-apiserver" >}}
|
||||||
|
* {{< glossary_tooltip term_id="kube-scheduler" text="kube-scheduler" >}}
|
||||||
|
* {{< glossary_tooltip term_id="kubelet" text="kubelet" >}}
|
||||||
|
|
||||||
|
In a production environment you may want to configure [Prometheus Server](https://prometheus.io/) or some other metrics scraper
|
||||||
|
to periodically gather these metrics and make them available in some kind of time series database.
|
||||||
|
|
||||||
|
Note that {{< glossary_tooltip term_id="kubelet" text="kubelet" >}} also exposes metrics in `/metrics/cadvisor`, `/metrics/resource` and `/metrics/probes` endpoints. Those metrics do not have same lifecycle.
|
||||||
|
|
||||||
|
If your cluster uses {{< glossary_tooltip term_id="rbac" text="RBAC" >}}, reading metrics requires authorization via a user, group or ServiceAccount with a ClusterRole that allows accessing `/metrics`.
|
||||||
|
For example:
|
||||||
|
```
|
||||||
|
apiVersion: rbac.authorization.k8s.io/v1
|
||||||
|
kind: ClusterRole
|
||||||
|
metadata:
|
||||||
|
name: prometheus
|
||||||
|
rules:
|
||||||
|
- nonResourceURLs:
|
||||||
|
- "/metrics"
|
||||||
|
verbs:
|
||||||
|
- get
|
||||||
|
```
|
||||||
|
|
||||||
|
## Metric lifecycle
|
||||||
|
|
||||||
|
Alpha metric → Stable metric → Deprecated metric → Hidden metric → Deletion
|
||||||
|
|
||||||
|
Alpha metrics have no stability guarantees; as such they can be modified or deleted at any time.
|
||||||
|
|
||||||
|
Stable metrics can be guaranteed to not change; Specifically, stability means:
|
||||||
|
|
||||||
|
* the metric itself will not be deleted (or renamed)
|
||||||
|
* the type of metric will not be modified
|
||||||
|
|
||||||
|
Deprecated metric signal that the metric will eventually be deleted; to find which version, you need to check annotation, which includes from which kubernetes version that metric will be considered deprecated.
|
||||||
|
|
||||||
|
Before deprecation:
|
||||||
|
|
||||||
|
```
|
||||||
|
# HELP some_counter this counts things
|
||||||
|
# TYPE some_counter counter
|
||||||
|
some_counter 0
|
||||||
|
```
|
||||||
|
|
||||||
|
After deprecation:
|
||||||
|
|
||||||
|
```
|
||||||
|
# HELP some_counter (Deprecated since 1.15.0) this counts things
|
||||||
|
# TYPE some_counter counter
|
||||||
|
some_counter 0
|
||||||
|
```
|
||||||
|
|
||||||
|
Once a metric is hidden then by default the metrics is not published for scraping. To use a hidden metric, you need to override the configuration for the relevant cluster component.
|
||||||
|
|
||||||
|
Once a metric is deleted, the metric is not published. You cannot change this using an override.
|
||||||
|
|
||||||
|
|
||||||
|
## Show Hidden Metrics
|
||||||
|
|
||||||
|
As described above, admins can enable hidden metrics through a command-line flag on a specific binary. This intends to be used as an escape hatch for admins if they missed the migration of the metrics deprecated in the last release.
|
||||||
|
|
||||||
|
The flag `show-hidden-metrics-for-version` takes a version for which you want to show metrics deprecated in that release. The version is expressed as x.y, where x is the major version, y is the minor version. The patch version is not needed even though a metrics can be deprecated in a patch release, the reason for that is the metrics deprecation policy runs against the minor release.
|
||||||
|
|
||||||
|
The flag can only take the previous minor version as it's value. All metrics hidden in previous will be emitted if admins set the previous version to `show-hidden-metrics-for-version`. The too old version is not allowed because this violates the metrics deprecated policy.
|
||||||
|
|
||||||
|
Take metric `A` as an example, here assumed that `A` is deprecated in 1.n. According to metrics deprecated policy, we can reach the following conclusion:
|
||||||
|
|
||||||
|
* In release `1.n`, the metric is deprecated, and it can be emitted by default.
|
||||||
|
* In release `1.n+1`, the metric is hidden by default and it can be emitted by command line `show-hidden-metrics-for-version=1.n`.
|
||||||
|
* In release `1.n+2`, the metric should be removed from the codebase. No escape hatch anymore.
|
||||||
|
|
||||||
|
If you're upgrading from release `1.12` to `1.13`, but still depend on a metric `A` deprecated in `1.12`, you should set hidden metrics via command line: `--show-hidden-metrics=1.12` and remember to remove this metric dependency before upgrading to `1.14`
|
||||||
|
|
||||||
|
## Component metrics
|
||||||
|
|
||||||
|
### kube-controller-manager metrics
|
||||||
|
|
||||||
|
Controller manager metrics provide important insight into the performance and health of the controller manager.
|
||||||
|
These metrics include common Go language runtime metrics such as go_routine count and controller specific metrics such as
|
||||||
|
etcd request latencies or Cloudprovider (AWS, GCE, OpenStack) API latencies that can be used
|
||||||
|
to gauge the health of a cluster.
|
||||||
|
|
||||||
|
Starting from Kubernetes 1.7, detailed Cloudprovider metrics are available for storage operations for GCE, AWS, Vsphere and OpenStack.
|
||||||
|
These metrics can be used to monitor health of persistent volume operations.
|
||||||
|
|
||||||
|
For example, for GCE these metrics are called:
|
||||||
|
|
||||||
|
```
|
||||||
|
cloudprovider_gce_api_request_duration_seconds { request = "instance_list"}
|
||||||
|
cloudprovider_gce_api_request_duration_seconds { request = "disk_insert"}
|
||||||
|
cloudprovider_gce_api_request_duration_seconds { request = "disk_delete"}
|
||||||
|
cloudprovider_gce_api_request_duration_seconds { request = "attach_disk"}
|
||||||
|
cloudprovider_gce_api_request_duration_seconds { request = "detach_disk"}
|
||||||
|
cloudprovider_gce_api_request_duration_seconds { request = "list_disk"}
|
||||||
|
```
|
||||||
|
|
||||||
|
{{% /capture %}}
|
||||||
|
|
||||||
|
{{% capture whatsnext %}}
|
||||||
|
* Read about the [Prometheus text format](https://github.com/prometheus/docs/blob/master/content/docs/instrumenting/exposition_formats.md#text-based-format) for metrics
|
||||||
|
* See the list of [stable Kubernetes metrics](https://github.com/kubernetes/kubernetes/blob/master/test/instrumentation/testdata/stable-metrics-list.yaml)
|
||||||
|
* Read about the [Kubernetes deprecation policy](https://kubernetes.io/docs/reference/using-api/deprecation-policy/#deprecating-a-feature-or-behavior )
|
||||||
|
{{% /capture %}}
|
|
@ -116,13 +116,13 @@ toc:
|
||||||
- docs/concepts/cluster-administration/networking.md
|
- docs/concepts/cluster-administration/networking.md
|
||||||
- docs/concepts/cluster-administration/network-plugins.md
|
- docs/concepts/cluster-administration/network-plugins.md
|
||||||
- docs/concepts/cluster-administration/logging.md
|
- docs/concepts/cluster-administration/logging.md
|
||||||
|
- docs/concepts/cluster-administration/monitoring.md
|
||||||
- docs/concepts/cluster-administration/kubelet-garbage-collection.md
|
- docs/concepts/cluster-administration/kubelet-garbage-collection.md
|
||||||
- docs/concepts/cluster-administration/federation.md
|
- docs/concepts/cluster-administration/federation.md
|
||||||
- docs/concepts/cluster-administration/sysctl-cluster.md
|
- docs/concepts/cluster-administration/sysctl-cluster.md
|
||||||
- docs/concepts/cluster-administration/authenticate-across-clusters-kubeconfig.md
|
- docs/concepts/cluster-administration/authenticate-across-clusters-kubeconfig.md
|
||||||
- docs/concepts/cluster-administration/master-node-communication.md
|
- docs/concepts/cluster-administration/master-node-communication.md
|
||||||
- docs/concepts/cluster-administration/proxies.md
|
- docs/concepts/cluster-administration/proxies.md
|
||||||
- docs/concepts/cluster-administration/controller-metrics.md
|
|
||||||
- docs/concepts/cluster-administration/device-plugins.md
|
- docs/concepts/cluster-administration/device-plugins.md
|
||||||
- title: Policies
|
- title: Policies
|
||||||
section:
|
section:
|
||||||
|
|
Loading…
Reference in New Issue