Monitor your Dedicated cluster (#5409)

* scaffolding for clustered grafana dashboard

* add descriptions of dedicated monitoring dashboard cells

* Apply suggestions from code review

Co-authored-by: Jason Stirnaman <stirnamanj@gmail.com>

* Update content/influxdb/cloud-dedicated/admin/monitor-your-cluster.md

---------

Co-authored-by: Jason Stirnaman <stirnamanj@gmail.com>
pull/5402/head^2
Scott Anderson 2024-04-05 11:36:06 -06:00 committed by GitHub
parent 2aac3cc2e8
commit aa1732d0d2
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
2 changed files with 390 additions and 0 deletions

View File

@ -0,0 +1,390 @@
---
title: Monitor your cluster
seotitle: Monitor your InfluxDB Cloud Dedicated cluster
description: >
Use the Grafana dashboard provided by InfluxData to monitor your
InfluxDB Cloud Dedicated cluster.
menu:
influxdb_cloud_dedicated:
parent: Administer InfluxDB Cloud
weight: 104
---
Use the Grafana dashboard provided by InfluxData to monitor your
{{< product-name >}} cluster.
{{% note %}}
#### Not available for all clusters
{{< product-name >}} monitoring dashboards are not available for all clusters.
For questions about availability, [contact InfluxData support](https://support.influxdata.com).
{{% /note %}}
- [Access your monitoring dashboard](#access-your-monitoring-dashboard)
- [Dashboard sections and cells](#dashboard-sections-and-cells)
{{< img-hd src="/img/influxdb/clustered-admin-monitoring-dashboard.png" alt="InfluxDB Cloud Dedicated monitoring dashboard" />}}
## Access your monitoring dashboard
To access your {{< product-name >}} monitoring dashboard, visit the
`/observability` endpoint of your {{< product-name >}} cluster in your browser:
<pre>
<a href="https://{{< influxdb/host >}}/observability">https://{{< influxdb/host >}}/observability</a>
</pre>
Use the credentials provided by InfluxData to log into your cluster monitoring dashboard.
If you do not have login credentials, [contact InfluxData support](https://support.influxdata.com).
## Dashboard sections and cells
The dashboard is divided into the following sections that visualize metrics
related to the health of components in your {{< product-name >}} cluster:
- [Query Tier Cpu/Mem](#query-tier-cpumem)
- [Query Tier](#query-tier)
- [Ingest Tier Cpu/Mem](#ingest-tier-cpumem)
- [Ingest Tier](#ingest-tier)
- [Compaction Tier Cpu/Mem](#compaction-tier-cpumem)
- [Compactor](#compactor)
- [Ingestor Catalog Operations](#ingestor-catalog-operations)
- [Catalog Operations Overview](#catalog-operations-overview)
### Query Tier Cpu/Mem
The **Query Tier Cpu/Mem** section displays the CPU and memory usage of query
pods as reported by Kubernetes.
[Queriers](/influxdb/cloud-dedicated/reference/internals/storage-engine/#querier)
handle query requests and returns query results for requests.
- [CPU Utilization (k8s)](#cpu-utilization-k8s)
- [Memory Usage (k8s)](#memory-usage-k8s)
#### CPU Utilization (k8s)
The CPU utilization of query pods as reported by the Kubernetes container usage.
Usage is reported by the number of CPU cores used by pods, including
fractional cores.
The CPU limit is represented by the top line in the visualization.
#### Memory Usage (k8s)
The memory usage of the query pod containers per cgroup as reported by Kubernetes.
Usage is reported in a magnitude of bytes.
The memory limit is represented by the top line in the visualization.
---
### Query Tier
The **Query Tier** section displays metrics reported from the InfluxDB gRPC
query API.
[Queriers](/influxdb/cloud-dedicated/reference/internals/storage-engine/#querier)
handle query requests and returns query results for requests.
- [gRPC Requests (ok)](#grpc-requests-ok)
- [gRPC Requests (not ok)](#grpc-requests-not-ok)
- [Request Duration (flight DoGet) (ok + !ok)](#request-duration-flight-doget-ok--ok)
- [Successful Request Duration (flight DoGet)](#successful-request-duration-flight-doget)
- [Acquire Duration](#acquire-duration)
#### gRPC Requests (ok)
The rate of gRPC requests for different endpoints that returned the `OK` status code,
summed across all queriers.
Request rate is reported in requests per second.
#### gRPC Requests (not ok)
The rate of gRPC requests for all endpoints that returned a status code other
than `OK`, summed across all queriers.
Request rate is reported in requests per second.
#### Request Duration (flight DoGet) (ok + !ok)
A gRPC request duration heatmap for all requests to the `DoGet` endpoint
regardless of request status.
The heatmap shows how many requests occurred in each duration "bucket" per time
interval and provides insight into how long a typical query request takes.
It also shows, at a glance, the predominate latency range as well as the
minimum and maximum durations of all query requests.
The color scheme is a indicator of the value of each cell relative to the
_currently displayed data_.
#### Successful Request Duration (flight DoGet)
A gRPC request duration heatmap for successful requests to the `DoGet` endpoint.
The heatmap shows how many requests occurred in each duration "bucket" per time
interval and provides insight into how long a typical successful query request takes.
It also shows, at a glance, the predominate latency range as well as the
minimum and maximum durations of successful query requests.
The color scheme is a indicator of the value of each cell relative to the
_currently displayed data_.
#### Acquire Duration
A heatmap of how long a query waits to pass the query _semaphore_--a mechanism
that limits the number of concurrent query requests that can be processed and
protects against Out of Memory (OOM) errors that can be caused by unaccounted-for
data structures that may occur during query planning and execution.
This cell only provides information about the queries waiting for the semaphore,
not the time holding it.
This cell can be used to gauge how much query latency is added due to a high
cluster load.
---
### Ingest Tier Cpu/Mem
The **Query Tier Cpu/Mem** section displays the CPU and memory usage of Ingester
pods as reported by Kubernetes.
[Ingesters](/influxdb/cloud-dedicated/reference/internals/storage-engine/#ingester)
process line protocol submitted in write requests and persist time series data
to the [Object store](/influxdb/cloud-dedicated/reference/internals/storage-engine/#object-store).
- [CPU Utilization Ingesters (k8s)](#cpu-utilization-ingesters-k8s)
- [Memory Usage Ingesters (k8s)](#memory-usage-ingesters-k8s)
- [CPU Utilization Routers (k8s)](#cpu-utilization-routers-k8s)
- [Memory Usage Routers (k8s)](#memory-usage-routers-k8s)
#### CPU Utilization Ingesters (k8s)
CPU Utilization of Ingester pods as reported by the Kubernetes container usage.
Usage is reported by the number of CPU cores used by pods, including
fractional cores.
The CPU limit is represented by the top line in the visualization.
#### Memory Usage Ingesters (k8s)
Memory usage of the Ingester pod containers per cgroup as reported by Kubernetes.
Usage is reported in a magnitude of bytes.
The memory limit is represented by the top line in the visualization.
#### CPU Utilization Routers (k8s)
CPU utilization of Ingester router pods as reported by the Kubernetes container usage.
Usage is reported by the number of CPU cores used by pods, including
fractional cores.
#### Memory Usage Routers (k8s)
Memory usage of the Ingester router pod containers per cgroup as reported by Kubernetes.
Usage is reported in a magnitude of bytes.
---
### Ingest Tier
The **Ingest Tier** section displays metrics reported from the InfluxDB gRPC
and HTTP write APIs.
[Ingesters](/influxdb/cloud-dedicated/reference/internals/storage-engine/#ingester)
process line protocol submitted in write requests and persist time series data
to the [Object store](/influxdb/cloud-dedicated/reference/internals/storage-engine/#object-store).
- [Write Requests (at router)](#write-requests-at-router)
- [LP Ingest (at router)](#lp-ingest-at-router-lines)
<em class="op50">(lines)</em>
- [LP Ingest (at router)](#lp-ingest-at-router-bytes)
<em class="op50">(bytes)</em>
- [HTTP request error rate (server's POV at Router)](#http-request-error-rate-server's-pov-at-router)
- [Healthy Upstream Ingesters per Router](#healthy-upstream-ingesters-per-router)
- [Persist Queue Depth](#persist-queue-depth)
- [Persist Task Queue Duration](#persist-task-queue-duration)
- [Ingester Disk Data Directory Usage](#ingester-disk-data-directory-usage)
- [Ingest Blocked Time (24h)](#ingest-blocked-time-24h)
- [Max Persist Queue Depth](#max-persist-queue-depth)
- [Write Logs (10 examples)](#write-logs-10-examples)
#### Write Requests (at router)
Number of write operations completed across all Ingester routers.
Requests are grouped by state (success or error).
Request rate is reported in requests per second.
#### LP Ingest (at router) {#lp-ingest-at-router-lines metadata="lines"}
Rate of lines of line protocol being received by each router and across all
Ingester routers.
Request rate is reported in lines per second.
#### LP Ingest (at router) {#lp-ingest-at-router-bytes metadata="bytes"}
Rate of bytes of line protocol being received by each router and across all
Ingester routers.
Request rate is reported in bytes per second.
#### HTTP request error rate (server's POV at Router)
HTTP request error rate reported by the InfluxDB v3 HTTP request handler.
Error rate is represented the percentage in total requests that return a non-2xx
response code.
#### Healthy Upstream Ingesters per Router
The number of healthy upstream Ingesters each router detects.
This reflects the router's RPC request balancer or circuit breaker state.
This can indicate when routers can't connect to Ingesters becaåuse of because of
issues in the ingest pipeline such as network issues or Ingester availability.
The Persist Queue is the queue for persisting, or saving to s3, new parquey files.
#### Persist Queue Depth
The number of queued persist jobs that have not started.
Each persist jobs consists of taking data from the Write Ahead Log (WAL),
storing it in a Parquet file, and saving the Parquet file to the
[Object store](/influxdb/cloud-dedicated/reference/internals/storage-engine/#object-store).
If the persist queue is growing it means Ingesters are not keeping up with the
incoming write load and may result in Ingester failure.
#### Persist Task Queue Duration
A heatmap that shows the time persist jobs spend in the queue before being executed.
Longer queue times indicate slower persist job execution times which may be due
to network or internal resource constraints, or an increasing
[queue depth](#persist-queue-depth).
#### Ingester Disk Data Directory Usage
The per-pod disk usage as a percentage of the Ingesters' data directory.
The WAL is stored on a disk attached to the Ingesters.
As the WAL grows, more disk space is used.
If Ingesters run out of disk, the WAL stops functioning.
#### Ingest Blocked Time (24h)
The amount of time the ingest pipeline has been marked as saturated and
rejected write requests.
#### Max Persist Queue Depth
The queue depth as a percentage of the configured maximum queue depth.
This shows the saturation level of the most saturated Ingester.
Once the maximum queue depth is reached, writes are rejected.
#### Write Logs (10 examples)
A sample of 10 write logs from the displayed time period.
_These do not represent the most recent logs._
---
### Compaction Tier Cpu/Mem
The **Compaction Tier Cpu/Mem** section displays the CPU and memory usage of
Compactor pods as reported by Kubernetes.
[Compactors](/influxdb/cloud-dedicated/reference/internals/storage-engine/#compactor)
process and compress parquet files in the
[Object store](/influxdb/cloud-dedicated/reference/internals/storage-engine/#object-store)
to continually optimize storage.
- [CPU Utilization (k8s)](#compaction-cpu-utilization)
- [Memory Usage (k8s)](#compaction-memory-usage)
#### CPU Utilization (k8s) {#compaction-cpu-utilization}
The CPU utilization of compactor pods as reported by the Kubernetes container usage.
Usage is reported by the number of CPU cores used by pods, including
fractional cores.
The CPU limit is represented by the top line in the visualization.
#### Memory Usage (k8s) {#compaction-memory-usage}
The memory usage of compactor pod containers per cgroup as reported by Kubernetes.
Usage is reported in a magnitude of bytes.
The memory limit is represented by the top line in the visualization.
---
### Compactor
The **Compactor** section displays metrics related to the compaction of Parquet
files in the [Object store](/influxdb/cloud-dedicated/reference/internals/storage-engine/#object-store).
[Compactors](/influxdb/cloud-dedicated/reference/internals/storage-engine/#compactor)
process and compress Parquet files to continually optimize storage.
- [Compactor: L0 File Counts (5m bucket width)](#compactor-l0-file-counts-5m-bucket-width)
#### Compactor: L0 File Counts (5m bucket width)
A histogram of the quantity of L0-compacted files at time of compaction.
Ingesters create Parquet files using L0 (level zero) compaction.
As Compactors process and compact Parquet files over time, they do so in the
following levels:
- **L0**: Uncompacted
- **L1**: 4 L0 files compacted together
- **L2**: 4 L1 files compacted together
- **L3**: 4 L2 files compacted together
Parquet files store data partitioned by time and optionally tags
_(see [Manage data partition](https://docs.influxdata.com/influxdb/cloud-dedicated/admin/custom-partitions/))_.
After four L0 files accumulate for a partition, they are are eligible for compaction.
If the compactor is keeping up with the incoming write load, all compaction
events will have exactly four files. If the number of L0 files compacted begins to
to increase, it indicates the compactor is not keeping up.
This histogram helps to determine if the Compactor is starting compactions as
soon as it can.
---
### Ingestor Catalog Operations
The **Ingestor Catalog Operations** section displays metrics related to
Catalog operations requested by Ingesters.
The [Catalog](/influxdb/cloud-dedicated/reference/internals/storage-engine/#catalog)
is a relational database that stores metadata related to your time series data
including schema information and physical locations of partitions in the
[Object store](/influxdb/cloud-dedicated/reference/internals/storage-engine/#object-store).
- [Catalog Ops - success](#catalog-ops---success)
- [Catalog Ops - error](#catalog-ops---error)
- [Catalog Op Latency (P90)](#catalog-op-latency-p90)
#### Catalog Ops - success
The rate of successful Catalog operations per second requested by Ingesters.
Higher rates of successful Catalog operations requested by Ingesters indicate
a high write load.
#### Catalog Ops - error
The rate of erred catalog operations per second requested by Ingesters.
Higher rates of erred Catalog operations requested by Ingesters indicate
that the Catalog may be overloaded or unresponsive.
#### Catalog Op Latency (P90)
The 90th percentile (P90) of query latency against the catalog service per operation.
A high P90 value indicates that the Catalog may be overloaded.
---
### Catalog Operations Overview
The **Catalog Operations Overview** section displays metrics related to
Catalog operations requested by all components of your {{< product-name >}} cluster.
- [Requests per Operation - success](#requests-per-operation---success)
- [Requests per Operation - error](#requests-per-operation---error)
#### Requests per Operation - success
The rate of successful Catalog requests per second by operation.
#### Requests per Operation - error
The rate of erred Catalog requests per second by operation.
Higher rates of erred Catalog operations indicate that the Catalog may be
overloaded or unresponsive.

Binary file not shown.

After

Width:  |  Height:  |  Size: 139 KiB