Monitor your Dedicated cluster (#5409)
* scaffolding for clustered grafana dashboard * add descriptions of dedicated monitoring dashboard cells * Apply suggestions from code review Co-authored-by: Jason Stirnaman <stirnamanj@gmail.com> * Update content/influxdb/cloud-dedicated/admin/monitor-your-cluster.md --------- Co-authored-by: Jason Stirnaman <stirnamanj@gmail.com>pull/5402/head^2
parent
2aac3cc2e8
commit
aa1732d0d2
|
@ -0,0 +1,390 @@
|
|||
---
|
||||
title: Monitor your cluster
|
||||
seotitle: Monitor your InfluxDB Cloud Dedicated cluster
|
||||
description: >
|
||||
Use the Grafana dashboard provided by InfluxData to monitor your
|
||||
InfluxDB Cloud Dedicated cluster.
|
||||
menu:
|
||||
influxdb_cloud_dedicated:
|
||||
parent: Administer InfluxDB Cloud
|
||||
weight: 104
|
||||
---
|
||||
|
||||
Use the Grafana dashboard provided by InfluxData to monitor your
|
||||
{{< product-name >}} cluster.
|
||||
|
||||
{{% note %}}
|
||||
#### Not available for all clusters
|
||||
|
||||
{{< product-name >}} monitoring dashboards are not available for all clusters.
|
||||
For questions about availability, [contact InfluxData support](https://support.influxdata.com).
|
||||
{{% /note %}}
|
||||
|
||||
- [Access your monitoring dashboard](#access-your-monitoring-dashboard)
|
||||
- [Dashboard sections and cells](#dashboard-sections-and-cells)
|
||||
|
||||
{{< img-hd src="/img/influxdb/clustered-admin-monitoring-dashboard.png" alt="InfluxDB Cloud Dedicated monitoring dashboard" />}}
|
||||
|
||||
## Access your monitoring dashboard
|
||||
|
||||
To access your {{< product-name >}} monitoring dashboard, visit the
|
||||
`/observability` endpoint of your {{< product-name >}} cluster in your browser:
|
||||
|
||||
<pre>
|
||||
<a href="https://{{< influxdb/host >}}/observability">https://{{< influxdb/host >}}/observability</a>
|
||||
</pre>
|
||||
|
||||
Use the credentials provided by InfluxData to log into your cluster monitoring dashboard.
|
||||
If you do not have login credentials, [contact InfluxData support](https://support.influxdata.com).
|
||||
|
||||
## Dashboard sections and cells
|
||||
|
||||
The dashboard is divided into the following sections that visualize metrics
|
||||
related to the health of components in your {{< product-name >}} cluster:
|
||||
|
||||
- [Query Tier Cpu/Mem](#query-tier-cpumem)
|
||||
- [Query Tier](#query-tier)
|
||||
- [Ingest Tier Cpu/Mem](#ingest-tier-cpumem)
|
||||
- [Ingest Tier](#ingest-tier)
|
||||
- [Compaction Tier Cpu/Mem](#compaction-tier-cpumem)
|
||||
- [Compactor](#compactor)
|
||||
- [Ingestor Catalog Operations](#ingestor-catalog-operations)
|
||||
- [Catalog Operations Overview](#catalog-operations-overview)
|
||||
|
||||
### Query Tier Cpu/Mem
|
||||
|
||||
The **Query Tier Cpu/Mem** section displays the CPU and memory usage of query
|
||||
pods as reported by Kubernetes.
|
||||
[Queriers](/influxdb/cloud-dedicated/reference/internals/storage-engine/#querier)
|
||||
handle query requests and returns query results for requests.
|
||||
|
||||
- [CPU Utilization (k8s)](#cpu-utilization-k8s)
|
||||
- [Memory Usage (k8s)](#memory-usage-k8s)
|
||||
|
||||
#### CPU Utilization (k8s)
|
||||
|
||||
The CPU utilization of query pods as reported by the Kubernetes container usage.
|
||||
Usage is reported by the number of CPU cores used by pods, including
|
||||
fractional cores.
|
||||
The CPU limit is represented by the top line in the visualization.
|
||||
|
||||
#### Memory Usage (k8s)
|
||||
|
||||
The memory usage of the query pod containers per cgroup as reported by Kubernetes.
|
||||
Usage is reported in a magnitude of bytes.
|
||||
The memory limit is represented by the top line in the visualization.
|
||||
|
||||
---
|
||||
|
||||
### Query Tier
|
||||
|
||||
The **Query Tier** section displays metrics reported from the InfluxDB gRPC
|
||||
query API.
|
||||
[Queriers](/influxdb/cloud-dedicated/reference/internals/storage-engine/#querier)
|
||||
handle query requests and returns query results for requests.
|
||||
|
||||
- [gRPC Requests (ok)](#grpc-requests-ok)
|
||||
- [gRPC Requests (not ok)](#grpc-requests-not-ok)
|
||||
- [Request Duration (flight DoGet) (ok + !ok)](#request-duration-flight-doget-ok--ok)
|
||||
- [Successful Request Duration (flight DoGet)](#successful-request-duration-flight-doget)
|
||||
- [Acquire Duration](#acquire-duration)
|
||||
|
||||
#### gRPC Requests (ok)
|
||||
|
||||
The rate of gRPC requests for different endpoints that returned the `OK` status code,
|
||||
summed across all queriers.
|
||||
Request rate is reported in requests per second.
|
||||
|
||||
#### gRPC Requests (not ok)
|
||||
|
||||
The rate of gRPC requests for all endpoints that returned a status code other
|
||||
than `OK`, summed across all queriers.
|
||||
Request rate is reported in requests per second.
|
||||
|
||||
#### Request Duration (flight DoGet) (ok + !ok)
|
||||
|
||||
A gRPC request duration heatmap for all requests to the `DoGet` endpoint
|
||||
regardless of request status.
|
||||
|
||||
The heatmap shows how many requests occurred in each duration "bucket" per time
|
||||
interval and provides insight into how long a typical query request takes.
|
||||
It also shows, at a glance, the predominate latency range as well as the
|
||||
minimum and maximum durations of all query requests.
|
||||
|
||||
The color scheme is a indicator of the value of each cell relative to the
|
||||
_currently displayed data_.
|
||||
|
||||
#### Successful Request Duration (flight DoGet)
|
||||
|
||||
A gRPC request duration heatmap for successful requests to the `DoGet` endpoint.
|
||||
|
||||
The heatmap shows how many requests occurred in each duration "bucket" per time
|
||||
interval and provides insight into how long a typical successful query request takes.
|
||||
It also shows, at a glance, the predominate latency range as well as the
|
||||
minimum and maximum durations of successful query requests.
|
||||
|
||||
The color scheme is a indicator of the value of each cell relative to the
|
||||
_currently displayed data_.
|
||||
|
||||
#### Acquire Duration
|
||||
|
||||
A heatmap of how long a query waits to pass the query _semaphore_--a mechanism
|
||||
that limits the number of concurrent query requests that can be processed and
|
||||
protects against Out of Memory (OOM) errors that can be caused by unaccounted-for
|
||||
data structures that may occur during query planning and execution.
|
||||
This cell only provides information about the queries waiting for the semaphore,
|
||||
not the time holding it.
|
||||
|
||||
This cell can be used to gauge how much query latency is added due to a high
|
||||
cluster load.
|
||||
|
||||
---
|
||||
|
||||
### Ingest Tier Cpu/Mem
|
||||
|
||||
The **Query Tier Cpu/Mem** section displays the CPU and memory usage of Ingester
|
||||
pods as reported by Kubernetes.
|
||||
[Ingesters](/influxdb/cloud-dedicated/reference/internals/storage-engine/#ingester)
|
||||
process line protocol submitted in write requests and persist time series data
|
||||
to the [Object store](/influxdb/cloud-dedicated/reference/internals/storage-engine/#object-store).
|
||||
|
||||
- [CPU Utilization Ingesters (k8s)](#cpu-utilization-ingesters-k8s)
|
||||
- [Memory Usage Ingesters (k8s)](#memory-usage-ingesters-k8s)
|
||||
- [CPU Utilization Routers (k8s)](#cpu-utilization-routers-k8s)
|
||||
- [Memory Usage Routers (k8s)](#memory-usage-routers-k8s)
|
||||
|
||||
#### CPU Utilization Ingesters (k8s)
|
||||
|
||||
CPU Utilization of Ingester pods as reported by the Kubernetes container usage.
|
||||
Usage is reported by the number of CPU cores used by pods, including
|
||||
fractional cores.
|
||||
The CPU limit is represented by the top line in the visualization.
|
||||
|
||||
#### Memory Usage Ingesters (k8s)
|
||||
|
||||
Memory usage of the Ingester pod containers per cgroup as reported by Kubernetes.
|
||||
Usage is reported in a magnitude of bytes.
|
||||
The memory limit is represented by the top line in the visualization.
|
||||
|
||||
#### CPU Utilization Routers (k8s)
|
||||
|
||||
CPU utilization of Ingester router pods as reported by the Kubernetes container usage.
|
||||
Usage is reported by the number of CPU cores used by pods, including
|
||||
fractional cores.
|
||||
|
||||
#### Memory Usage Routers (k8s)
|
||||
|
||||
Memory usage of the Ingester router pod containers per cgroup as reported by Kubernetes.
|
||||
Usage is reported in a magnitude of bytes.
|
||||
|
||||
---
|
||||
|
||||
### Ingest Tier
|
||||
|
||||
The **Ingest Tier** section displays metrics reported from the InfluxDB gRPC
|
||||
and HTTP write APIs.
|
||||
[Ingesters](/influxdb/cloud-dedicated/reference/internals/storage-engine/#ingester)
|
||||
process line protocol submitted in write requests and persist time series data
|
||||
to the [Object store](/influxdb/cloud-dedicated/reference/internals/storage-engine/#object-store).
|
||||
|
||||
- [Write Requests (at router)](#write-requests-at-router)
|
||||
- [LP Ingest (at router)](#lp-ingest-at-router-lines)
|
||||
<em class="op50">(lines)</em>
|
||||
- [LP Ingest (at router)](#lp-ingest-at-router-bytes)
|
||||
<em class="op50">(bytes)</em>
|
||||
- [HTTP request error rate (server's POV at Router)](#http-request-error-rate-server's-pov-at-router)
|
||||
- [Healthy Upstream Ingesters per Router](#healthy-upstream-ingesters-per-router)
|
||||
- [Persist Queue Depth](#persist-queue-depth)
|
||||
- [Persist Task Queue Duration](#persist-task-queue-duration)
|
||||
- [Ingester Disk Data Directory Usage](#ingester-disk-data-directory-usage)
|
||||
- [Ingest Blocked Time (24h)](#ingest-blocked-time-24h)
|
||||
- [Max Persist Queue Depth](#max-persist-queue-depth)
|
||||
- [Write Logs (10 examples)](#write-logs-10-examples)
|
||||
|
||||
#### Write Requests (at router)
|
||||
|
||||
Number of write operations completed across all Ingester routers.
|
||||
Requests are grouped by state (success or error).
|
||||
Request rate is reported in requests per second.
|
||||
|
||||
#### LP Ingest (at router) {#lp-ingest-at-router-lines metadata="lines"}
|
||||
|
||||
Rate of lines of line protocol being received by each router and across all
|
||||
Ingester routers.
|
||||
Request rate is reported in lines per second.
|
||||
|
||||
#### LP Ingest (at router) {#lp-ingest-at-router-bytes metadata="bytes"}
|
||||
|
||||
Rate of bytes of line protocol being received by each router and across all
|
||||
Ingester routers.
|
||||
Request rate is reported in bytes per second.
|
||||
|
||||
#### HTTP request error rate (server's POV at Router)
|
||||
|
||||
HTTP request error rate reported by the InfluxDB v3 HTTP request handler.
|
||||
Error rate is represented the percentage in total requests that return a non-2xx
|
||||
response code.
|
||||
|
||||
#### Healthy Upstream Ingesters per Router
|
||||
|
||||
The number of healthy upstream Ingesters each router detects.
|
||||
This reflects the router's RPC request balancer or circuit breaker state.
|
||||
|
||||
This can indicate when routers can't connect to Ingesters becaåuse of because of
|
||||
issues in the ingest pipeline such as network issues or Ingester availability.
|
||||
|
||||
The Persist Queue is the queue for persisting, or saving to s3, new parquey files.
|
||||
|
||||
#### Persist Queue Depth
|
||||
|
||||
The number of queued persist jobs that have not started.
|
||||
Each persist jobs consists of taking data from the Write Ahead Log (WAL),
|
||||
storing it in a Parquet file, and saving the Parquet file to the
|
||||
[Object store](/influxdb/cloud-dedicated/reference/internals/storage-engine/#object-store).
|
||||
|
||||
If the persist queue is growing it means Ingesters are not keeping up with the
|
||||
incoming write load and may result in Ingester failure.
|
||||
|
||||
#### Persist Task Queue Duration
|
||||
|
||||
A heatmap that shows the time persist jobs spend in the queue before being executed.
|
||||
|
||||
Longer queue times indicate slower persist job execution times which may be due
|
||||
to network or internal resource constraints, or an increasing
|
||||
[queue depth](#persist-queue-depth).
|
||||
|
||||
#### Ingester Disk Data Directory Usage
|
||||
|
||||
The per-pod disk usage as a percentage of the Ingesters' data directory.
|
||||
The WAL is stored on a disk attached to the Ingesters.
|
||||
As the WAL grows, more disk space is used.
|
||||
If Ingesters run out of disk, the WAL stops functioning.
|
||||
|
||||
#### Ingest Blocked Time (24h)
|
||||
|
||||
The amount of time the ingest pipeline has been marked as saturated and
|
||||
rejected write requests.
|
||||
|
||||
#### Max Persist Queue Depth
|
||||
|
||||
The queue depth as a percentage of the configured maximum queue depth.
|
||||
This shows the saturation level of the most saturated Ingester.
|
||||
Once the maximum queue depth is reached, writes are rejected.
|
||||
|
||||
#### Write Logs (10 examples)
|
||||
|
||||
A sample of 10 write logs from the displayed time period.
|
||||
_These do not represent the most recent logs._
|
||||
|
||||
---
|
||||
|
||||
### Compaction Tier Cpu/Mem
|
||||
|
||||
The **Compaction Tier Cpu/Mem** section displays the CPU and memory usage of
|
||||
Compactor pods as reported by Kubernetes.
|
||||
[Compactors](/influxdb/cloud-dedicated/reference/internals/storage-engine/#compactor)
|
||||
process and compress parquet files in the
|
||||
[Object store](/influxdb/cloud-dedicated/reference/internals/storage-engine/#object-store)
|
||||
to continually optimize storage.
|
||||
|
||||
- [CPU Utilization (k8s)](#compaction-cpu-utilization)
|
||||
- [Memory Usage (k8s)](#compaction-memory-usage)
|
||||
|
||||
#### CPU Utilization (k8s) {#compaction-cpu-utilization}
|
||||
|
||||
The CPU utilization of compactor pods as reported by the Kubernetes container usage.
|
||||
Usage is reported by the number of CPU cores used by pods, including
|
||||
fractional cores.
|
||||
The CPU limit is represented by the top line in the visualization.
|
||||
|
||||
#### Memory Usage (k8s) {#compaction-memory-usage}
|
||||
|
||||
The memory usage of compactor pod containers per cgroup as reported by Kubernetes.
|
||||
Usage is reported in a magnitude of bytes.
|
||||
The memory limit is represented by the top line in the visualization.
|
||||
|
||||
---
|
||||
|
||||
### Compactor
|
||||
|
||||
The **Compactor** section displays metrics related to the compaction of Parquet
|
||||
files in the [Object store](/influxdb/cloud-dedicated/reference/internals/storage-engine/#object-store).
|
||||
[Compactors](/influxdb/cloud-dedicated/reference/internals/storage-engine/#compactor)
|
||||
process and compress Parquet files to continually optimize storage.
|
||||
|
||||
- [Compactor: L0 File Counts (5m bucket width)](#compactor-l0-file-counts-5m-bucket-width)
|
||||
|
||||
#### Compactor: L0 File Counts (5m bucket width)
|
||||
|
||||
A histogram of the quantity of L0-compacted files at time of compaction.
|
||||
|
||||
Ingesters create Parquet files using L0 (level zero) compaction.
|
||||
As Compactors process and compact Parquet files over time, they do so in the
|
||||
following levels:
|
||||
|
||||
- **L0**: Uncompacted
|
||||
- **L1**: 4 L0 files compacted together
|
||||
- **L2**: 4 L1 files compacted together
|
||||
- **L3**: 4 L2 files compacted together
|
||||
|
||||
Parquet files store data partitioned by time and optionally tags
|
||||
_(see [Manage data partition](https://docs.influxdata.com/influxdb/cloud-dedicated/admin/custom-partitions/))_.
|
||||
After four L0 files accumulate for a partition, they are are eligible for compaction.
|
||||
If the compactor is keeping up with the incoming write load, all compaction
|
||||
events will have exactly four files. If the number of L0 files compacted begins to
|
||||
to increase, it indicates the compactor is not keeping up.
|
||||
|
||||
This histogram helps to determine if the Compactor is starting compactions as
|
||||
soon as it can.
|
||||
|
||||
---
|
||||
|
||||
### Ingestor Catalog Operations
|
||||
|
||||
The **Ingestor Catalog Operations** section displays metrics related to
|
||||
Catalog operations requested by Ingesters.
|
||||
The [Catalog](/influxdb/cloud-dedicated/reference/internals/storage-engine/#catalog)
|
||||
is a relational database that stores metadata related to your time series data
|
||||
including schema information and physical locations of partitions in the
|
||||
[Object store](/influxdb/cloud-dedicated/reference/internals/storage-engine/#object-store).
|
||||
|
||||
- [Catalog Ops - success](#catalog-ops---success)
|
||||
- [Catalog Ops - error](#catalog-ops---error)
|
||||
- [Catalog Op Latency (P90)](#catalog-op-latency-p90)
|
||||
|
||||
#### Catalog Ops - success
|
||||
|
||||
The rate of successful Catalog operations per second requested by Ingesters.
|
||||
Higher rates of successful Catalog operations requested by Ingesters indicate
|
||||
a high write load.
|
||||
|
||||
#### Catalog Ops - error
|
||||
|
||||
The rate of erred catalog operations per second requested by Ingesters.
|
||||
Higher rates of erred Catalog operations requested by Ingesters indicate
|
||||
that the Catalog may be overloaded or unresponsive.
|
||||
|
||||
#### Catalog Op Latency (P90)
|
||||
|
||||
The 90th percentile (P90) of query latency against the catalog service per operation.
|
||||
A high P90 value indicates that the Catalog may be overloaded.
|
||||
|
||||
---
|
||||
|
||||
### Catalog Operations Overview
|
||||
|
||||
The **Catalog Operations Overview** section displays metrics related to
|
||||
Catalog operations requested by all components of your {{< product-name >}} cluster.
|
||||
|
||||
- [Requests per Operation - success](#requests-per-operation---success)
|
||||
- [Requests per Operation - error](#requests-per-operation---error)
|
||||
|
||||
#### Requests per Operation - success
|
||||
|
||||
The rate of successful Catalog requests per second by operation.
|
||||
|
||||
#### Requests per Operation - error
|
||||
|
||||
The rate of erred Catalog requests per second by operation.
|
||||
Higher rates of erred Catalog operations indicate that the Catalog may be
|
||||
overloaded or unresponsive.
|
Binary file not shown.
After Width: | Height: | Size: 139 KiB |
Loading…
Reference in New Issue