Monitor your Dedicated cluster (#5409)

* scaffolding for clustered grafana dashboard * add descriptions of dedicated monitoring dashboard cells * Apply suggestions from code review Co-authored-by: Jason Stirnaman <stirnamanj@gmail.com> * Update content/influxdb/cloud-dedicated/admin/monitor-your-cluster.md --------- Co-authored-by: Jason Stirnaman <stirnamanj@gmail.com>
2024-04-05 11:36:06 -06:00 · 2024-04-05 11:36:06 -06:00 · aa1732d0d2
parent 2aac3cc2e8
commit aa1732d0d2
2 changed files with 390 additions and 0 deletions
--- a/content/influxdb/cloud-dedicated/admin/monitor-your-cluster.md
+++ b/content/influxdb/cloud-dedicated/admin/monitor-your-cluster.md
@ -0,0 +1,390 @@
+---
+title: Monitor your cluster
+seotitle: Monitor your InfluxDB Cloud Dedicated cluster
+description: >
+  Use the Grafana dashboard provided by InfluxData to monitor your
+  InfluxDB Cloud Dedicated cluster.
+menu:
+  influxdb_cloud_dedicated:
+    parent: Administer InfluxDB Cloud
+weight: 104
+---
+
+Use the Grafana dashboard provided by InfluxData to monitor your
+{{< product-name >}} cluster.
+
+{{% note %}}
+#### Not available for all clusters
+
+{{< product-name >}} monitoring dashboards are not available for all clusters.
+For questions about availability, [contact InfluxData support](https://support.influxdata.com).
+{{% /note %}}
+
+- [Access your monitoring dashboard](#access-your-monitoring-dashboard)
+- [Dashboard sections and cells](#dashboard-sections-and-cells)
+
+{{< img-hd src="/img/influxdb/clustered-admin-monitoring-dashboard.png" alt="InfluxDB Cloud Dedicated monitoring dashboard" />}}
+
+## Access your monitoring dashboard
+
+To access your {{< product-name >}} monitoring dashboard, visit the
+`/observability` endpoint of your {{< product-name >}} cluster in your browser:
+
+<pre>
+<a href="https://{{< influxdb/host >}}/observability">https://{{< influxdb/host >}}/observability</a>
+</pre>
+
+Use the credentials provided by InfluxData to log into your cluster monitoring dashboard.
+If you do not have login credentials, [contact InfluxData support](https://support.influxdata.com).
+
+## Dashboard sections and cells
+
+The dashboard is divided into the following sections that visualize metrics
+related to the health of components in your {{< product-name >}} cluster:
+
+- [Query Tier Cpu/Mem](#query-tier-cpumem)
+- [Query Tier](#query-tier)
+- [Ingest Tier Cpu/Mem](#ingest-tier-cpumem)
+- [Ingest Tier](#ingest-tier)
+- [Compaction Tier Cpu/Mem](#compaction-tier-cpumem)
+- [Compactor](#compactor)
+- [Ingestor Catalog Operations](#ingestor-catalog-operations)
+- [Catalog Operations Overview](#catalog-operations-overview)
+
+### Query Tier Cpu/Mem
+
+The **Query Tier Cpu/Mem** section displays the CPU and memory usage of query
+pods as reported by Kubernetes.
+[Queriers](/influxdb/cloud-dedicated/reference/internals/storage-engine/#querier)
+handle query requests and returns query results for requests.
+
+- [CPU Utilization (k8s)](#cpu-utilization-k8s)
+- [Memory Usage (k8s)](#memory-usage-k8s)
+
+#### CPU Utilization (k8s)
+
+The CPU utilization of query pods as reported by the Kubernetes container usage.
+Usage is reported by the number of CPU cores used by pods, including
+fractional cores.
+The CPU limit is represented by the top line in the visualization.
+
+#### Memory Usage (k8s)
+
+The memory usage of the query pod containers per cgroup as reported by Kubernetes.
+Usage is reported in a magnitude of bytes.
+The memory limit is represented by the top line in the visualization.
+
+---
+
+### Query Tier
+
+The **Query Tier** section displays metrics reported from the InfluxDB gRPC
+query API.
+[Queriers](/influxdb/cloud-dedicated/reference/internals/storage-engine/#querier)
+handle query requests and returns query results for requests.
+
+- [gRPC Requests (ok)](#grpc-requests-ok)
+- [gRPC Requests (not ok)](#grpc-requests-not-ok)
+- [Request Duration (flight DoGet) (ok + !ok)](#request-duration-flight-doget-ok--ok)
+- [Successful Request Duration (flight DoGet)](#successful-request-duration-flight-doget)
+- [Acquire Duration](#acquire-duration)
+
+#### gRPC Requests (ok)
+
+The rate of gRPC requests for different endpoints that returned the `OK` status code,
+summed across all queriers.
+Request rate is reported in requests per second.
+
+#### gRPC Requests (not ok)
+
+The rate of gRPC requests for all endpoints that returned a status code other
+than `OK`, summed across all queriers.
+Request rate is reported in requests per second.
+
+#### Request Duration (flight DoGet) (ok + !ok)
+
+A gRPC request duration heatmap for all requests to the `DoGet` endpoint
+regardless of request status.
+
+The heatmap shows how many requests occurred in each duration "bucket" per time
+interval and provides insight into how long a typical query request takes.
+It also shows, at a glance, the predominate latency range as well as the
+minimum and maximum durations of all query requests.
+
+The color scheme is a indicator of the value of each cell relative to the
+_currently displayed data_.
+
+#### Successful Request Duration (flight DoGet)
+
+A gRPC request duration heatmap for successful requests to the `DoGet` endpoint.
+
+The heatmap shows how many requests occurred in each duration "bucket" per time
+interval and provides insight into how long a typical successful query request takes.
+It also shows, at a glance, the predominate latency range as well as the
+minimum and maximum durations of successful query requests.
+
+The color scheme is a indicator of the value of each cell relative to the
+_currently displayed data_.
+
+#### Acquire Duration
+
+A heatmap of how long a query waits to pass the query _semaphore_--a mechanism
+that limits the number of concurrent query requests that can be processed and
+protects against Out of Memory (OOM) errors that can be caused by unaccounted-for
+data structures that may occur during query planning and execution.
+This cell only provides information about the queries waiting for the semaphore,
+not the time holding it.
+
+This cell can be used to gauge how much query latency is added due to a high
+cluster load.
+
+---
+
+### Ingest Tier Cpu/Mem
+
+The **Query Tier Cpu/Mem** section displays the CPU and memory usage of Ingester
+pods as reported by Kubernetes.
+[Ingesters](/influxdb/cloud-dedicated/reference/internals/storage-engine/#ingester)
+process line protocol submitted in write requests and persist time series data
+to the [Object store](/influxdb/cloud-dedicated/reference/internals/storage-engine/#object-store).
+
+- [CPU Utilization Ingesters (k8s)](#cpu-utilization-ingesters-k8s)
+- [Memory Usage Ingesters (k8s)](#memory-usage-ingesters-k8s)
+- [CPU Utilization Routers (k8s)](#cpu-utilization-routers-k8s)
+- [Memory Usage Routers (k8s)](#memory-usage-routers-k8s)
+
+#### CPU Utilization Ingesters (k8s)
+
+CPU Utilization of Ingester pods as reported by the Kubernetes container usage.
+Usage is reported by the number of CPU cores used by pods, including
+fractional cores.
+The CPU limit is represented by the top line in the visualization.
+
+#### Memory Usage Ingesters (k8s)
+
+Memory usage of the Ingester pod containers per cgroup as reported by Kubernetes.
+Usage is reported in a magnitude of bytes.
+The memory limit is represented by the top line in the visualization.
+
+#### CPU Utilization Routers (k8s)
+
+CPU utilization of Ingester router pods as reported by the Kubernetes container usage.
+Usage is reported by the number of CPU cores used by pods, including
+fractional cores.
+
+#### Memory Usage Routers (k8s)
+
+Memory usage of the Ingester router pod containers per cgroup as reported by Kubernetes.
+Usage is reported in a magnitude of bytes.
+
+---
+
+### Ingest Tier
+
+The **Ingest Tier** section displays metrics reported from the InfluxDB gRPC
+and HTTP write APIs.
+[Ingesters](/influxdb/cloud-dedicated/reference/internals/storage-engine/#ingester)
+process line protocol submitted in write requests and persist time series data
+to the [Object store](/influxdb/cloud-dedicated/reference/internals/storage-engine/#object-store).
+
+- [Write Requests (at router)](#write-requests-at-router)
+- [LP Ingest (at router)](#lp-ingest-at-router-lines)
+  <em class="op50">(lines)</em>
+- [LP Ingest (at router)](#lp-ingest-at-router-bytes)
+  <em class="op50">(bytes)</em>
+- [HTTP request error rate (server's POV at Router)](#http-request-error-rate-server's-pov-at-router)
+- [Healthy Upstream Ingesters per Router](#healthy-upstream-ingesters-per-router)
+- [Persist Queue Depth](#persist-queue-depth)
+- [Persist Task Queue Duration](#persist-task-queue-duration)
+- [Ingester Disk Data Directory Usage](#ingester-disk-data-directory-usage)
+- [Ingest Blocked Time (24h)](#ingest-blocked-time-24h)
+- [Max Persist Queue Depth](#max-persist-queue-depth)
+- [Write Logs (10 examples)](#write-logs-10-examples)
+
+#### Write Requests (at router)
+
+Number of write operations completed across all Ingester routers.
+Requests are grouped by state (success or error).
+Request rate is reported in requests per second.
+
+#### LP Ingest (at router) {#lp-ingest-at-router-lines metadata="lines"}
+
+Rate of lines of line protocol being received by each router and across all
+Ingester routers.
+Request rate is reported in lines per second.
+
+#### LP Ingest (at router) {#lp-ingest-at-router-bytes metadata="bytes"}
+
+Rate of bytes of line protocol being received by each router and across all
+Ingester routers.
+Request rate is reported in bytes per second.
+
+#### HTTP request error rate (server's POV at Router)
+
+HTTP request error rate reported by the InfluxDB v3 HTTP request handler.
+Error rate is represented the percentage in total requests that return a non-2xx
+response code.
+
+#### Healthy Upstream Ingesters per Router
+
+The number of healthy upstream Ingesters each router detects.
+This reflects the router's RPC request balancer or circuit breaker state.
+
+This can indicate when routers can't connect to Ingesters becaåuse of because of
+issues in the ingest pipeline such as network issues or Ingester availability.
+
+The Persist Queue is the queue for persisting, or saving to s3, new parquey files.
+
+#### Persist Queue Depth
+
+The number of queued persist jobs that have not started.
+Each persist jobs consists of taking data from the Write Ahead Log (WAL),
+storing it in a Parquet file, and saving the Parquet file to the
+[Object store](/influxdb/cloud-dedicated/reference/internals/storage-engine/#object-store).
+
+If the persist queue is growing it means Ingesters are not keeping up with the
+incoming write load and may result in Ingester failure. 
+
+#### Persist Task Queue Duration
+
+A heatmap that shows the time persist jobs spend in the queue before being executed.
+
+Longer queue times indicate slower persist job execution times which may be due
+to network or internal resource constraints, or an increasing
+[queue depth](#persist-queue-depth).
+
+#### Ingester Disk Data Directory Usage
+
+The per-pod disk usage as a percentage of the Ingesters' data directory.
+The WAL is stored on a disk attached to the Ingesters.
+As the WAL grows, more disk space is used.
+If Ingesters run out of disk, the WAL stops functioning.
+
+#### Ingest Blocked Time (24h)
+
+The amount of time the ingest pipeline has been marked as saturated and
+rejected write requests.
+
+#### Max Persist Queue Depth
+
+The queue depth as a percentage of the configured maximum queue depth.
+This shows the saturation level of the most saturated Ingester.
+Once the maximum queue depth is reached, writes are rejected.
+
+#### Write Logs (10 examples)
+
+A sample of 10 write logs from the displayed time period.
+_These do not represent the most recent logs._
+
+---
+
+### Compaction Tier Cpu/Mem
+
+The **Compaction Tier Cpu/Mem** section displays the CPU and memory usage of
+Compactor pods as reported by Kubernetes.
+[Compactors](/influxdb/cloud-dedicated/reference/internals/storage-engine/#compactor)
+process and compress parquet files in the
+[Object store](/influxdb/cloud-dedicated/reference/internals/storage-engine/#object-store)
+to continually optimize storage.
+
+- [CPU Utilization (k8s)](#compaction-cpu-utilization)
+- [Memory Usage (k8s)](#compaction-memory-usage)
+
+#### CPU Utilization (k8s) {#compaction-cpu-utilization}
+
+The CPU utilization of compactor pods as reported by the Kubernetes container usage.
+Usage is reported by the number of CPU cores used by pods, including
+fractional cores.
+The CPU limit is represented by the top line in the visualization.
+
+#### Memory Usage (k8s) {#compaction-memory-usage}
+
+The memory usage of compactor pod containers per cgroup as reported by Kubernetes.
+Usage is reported in a magnitude of bytes.
+The memory limit is represented by the top line in the visualization.
+
+---
+
+### Compactor
+
+The **Compactor** section displays metrics related to the compaction of Parquet
+files in the [Object store](/influxdb/cloud-dedicated/reference/internals/storage-engine/#object-store).
+[Compactors](/influxdb/cloud-dedicated/reference/internals/storage-engine/#compactor)
+process and compress Parquet files to continually optimize storage.
+
+- [Compactor: L0 File Counts (5m bucket width)](#compactor-l0-file-counts-5m-bucket-width)
+
+#### Compactor: L0 File Counts (5m bucket width)
+
+A histogram of the quantity of L0-compacted files at time of compaction.
+
+Ingesters create Parquet files using L0 (level zero) compaction.
+As Compactors process and compact Parquet files over time, they do so in the
+following levels:
+
+- **L0**: Uncompacted
+- **L1**: 4 L0 files compacted together
+- **L2**: 4 L1 files compacted together
+- **L3**: 4 L2 files compacted together
+
+Parquet files store data partitioned by time and optionally tags
+_(see [Manage data partition](https://docs.influxdata.com/influxdb/cloud-dedicated/admin/custom-partitions/))_.
+After four L0 files accumulate for a partition, they are are eligible for compaction.
+If the compactor is keeping up with the incoming write load, all compaction
+events will have exactly four files. If the number of L0 files compacted begins to
+to increase, it indicates the compactor is not keeping up.
+
+This histogram helps to determine if the Compactor is starting compactions as
+soon as it can.
+
+---
+
+### Ingestor Catalog Operations
+
+The **Ingestor Catalog Operations** section displays metrics related to 
+Catalog operations requested by Ingesters.
+The [Catalog](/influxdb/cloud-dedicated/reference/internals/storage-engine/#catalog)
+is a relational database that stores metadata related to your time series data
+including schema information and physical locations of partitions in the 
+[Object store](/influxdb/cloud-dedicated/reference/internals/storage-engine/#object-store).
+
+- [Catalog Ops - success](#catalog-ops---success)
+- [Catalog Ops - error](#catalog-ops---error)
+- [Catalog Op Latency (P90)](#catalog-op-latency-p90)
+
+#### Catalog Ops - success
+
+The rate of successful Catalog operations per second requested by Ingesters.
+Higher rates of successful Catalog operations requested by Ingesters indicate
+a high write load.
+
+#### Catalog Ops - error
+
+The rate of erred catalog operations per second requested by Ingesters.
+Higher rates of erred Catalog operations requested by Ingesters indicate
+that the Catalog may be overloaded or unresponsive.
+
+#### Catalog Op Latency (P90)
+
+The 90th percentile (P90) of query latency against the catalog service per operation.
+A high P90 value indicates that the Catalog may be overloaded.
+
+---
+
+### Catalog Operations Overview
+
+The **Catalog Operations Overview** section displays metrics related to 
+Catalog operations requested by all components of your {{< product-name >}} cluster.
+
+- [Requests per Operation - success](#requests-per-operation---success)
+- [Requests per Operation - error](#requests-per-operation---error)
+
+#### Requests per Operation - success
+
+The rate of successful Catalog requests per second by operation.
+
+#### Requests per Operation - error
+
+The rate of erred Catalog requests per second by operation.
+Higher rates of erred Catalog operations indicate that the Catalog may be
+overloaded or unresponsive.
--- a/static/img/influxdb/clustered-admin-monitoring-dashboard.png
+++ b/static/img/influxdb/clustered-admin-monitoring-dashboard.png