promql highlighting

Signed-off-by: Laura Lorenz <lauralorenz@google.com>
pull/51818/head
Laura Lorenz 2025-08-06 19:36:32 +00:00
parent 75d0139f49
commit 51d3214096
1 changed files with 27 additions and 20 deletions

View File

@ -116,16 +116,16 @@ your deployment by monitoring the following metrics.
The following metrics look closely at the internal ResourceClaim controller The following metrics look closely at the internal ResourceClaim controller
managed by the `kube-controller-manager` component. managed by the `kube-controller-manager` component.
* Workqueue Add Rate: Monitor * Workqueue Add Rate: Monitor {{< highlight promql
`sum(rate(workqueue_adds_total{name="resource_claim"}[5m]))` to gauge how >}}sum(rate(workqueue_adds_total{name="resource_claim"}[5m])){{< /highlight
quickly items are added to the ResourceClaim controller. >}} to gauge how quickly items are added to the ResourceClaim controller.
* Workqueue Depth: Track * Workqueue Depth: Track
`sum(workqueue_depth{endpoint="kube-controller-manager", {{< highlight promql >}}sum(workqueue_depth{endpoint="kube-controller-manager",
name="resource_claim"})` to identify any backlogs in the ResourceClaim name="resource_claim"}){{< /highlight >}} to identify any backlogs in the ResourceClaim
controller. controller.
* Workqueue Work Duration: Observe `histogram_quantile(0.99, * Workqueue Work Duration: Observe {{< highlight promql >}}histogram_quantile(0.99,
sum(rate(workqueue_work_duration_seconds_bucket{name="resource_claim"}[5m])) sum(rate(workqueue_work_duration_seconds_bucket{name="resource_claim"}[5m]))
by (le))` to understand the speed at which the ResourceClaim controller by (le)){{< /highlight >}} to understand the speed at which the ResourceClaim controller
processes work. processes work.
If you are experiencing low Workqueue Add Rate, high Workqueue Depth, and/or If you are experiencing low Workqueue Add Rate, high Workqueue Depth, and/or
@ -148,12 +148,14 @@ that the end-to-end metrics are ultimately influenced by the
`kube-controller-manager`'s performance in creating ResourceClaims from `kube-controller-manager`'s performance in creating ResourceClaims from
ResourceClainTemplates in deployments that heavily use ResourceClainTemplates. ResourceClainTemplates in deployments that heavily use ResourceClainTemplates.
* Scheduler End-to-End Duration: Monitor `histogram_quantile(0.99, * Scheduler End-to-End Duration: Monitor {{< highlight promql
>}}histogram_quantile(0.99,
sum(increase(scheduler_pod_scheduling_sli_duration_seconds_bucket[5m])) by sum(increase(scheduler_pod_scheduling_sli_duration_seconds_bucket[5m])) by
(le))`. (le)){{< /highlight >>}}.
* Scheduler Algorithm Latency: Track `histogram_quantile(0.99, * Scheduler Algorithm Latency: Track {{< highlight promql
>}}histogram_quantile(0.99,
sum(increase(scheduler_scheduling_algorithm_duration_seconds_bucket[5m])) by sum(increase(scheduler_scheduling_algorithm_duration_seconds_bucket[5m])) by
(le))`. (le)){{< /highlight >}}.
### `kubelet` metrics ### `kubelet` metrics
@ -162,12 +164,14 @@ the `NodePrepareResources` and `NodeUnprepareResources` methods of the DRA
driver. You can observe this behavior from the kubelet's point of view with the driver. You can observe this behavior from the kubelet's point of view with the
following metrics. following metrics.
* Kubelet NodePrepareResources: Monitor `histogram_quantile(0.99, * Kubelet NodePrepareResources: Monitor {{< highlight promql
>}}histogram_quantile(0.99,
sum(rate(dra_operations_duration_seconds_bucket{operation_name="PrepareResources"}[5m])) sum(rate(dra_operations_duration_seconds_bucket{operation_name="PrepareResources"}[5m]))
by (le))`. by (le)){{< /highlight >}}.
* Kubelet NodeUnprepareResources: Track `histogram_quantile(0.99, * Kubelet NodeUnprepareResources: Track {{< highlight promql
>}}histogram_quantile(0.99,
sum(rate(dra_operations_duration_seconds_bucket{operation_name="UnprepareResources"}[5m])) sum(rate(dra_operations_duration_seconds_bucket{operation_name="UnprepareResources"}[5m]))
by (le))`. by (le)){{< /highlight >}}.
### DRA kubeletplugin operations ### DRA kubeletplugin operations
@ -178,14 +182,17 @@ which surfaces its own metric for the underlying gRPC operation
behavior from the point of view of the internal kubeletplugin with the following behavior from the point of view of the internal kubeletplugin with the following
metrics. metrics.
* DRA kubeletplugin gRPC NodePrepareResources operation: Observe `histogram_quantile(0.99, * DRA kubeletplugin gRPC NodePrepareResources operation: Observe {{< highlight
promql >}}histogram_quantile(0.99,
sum(rate(dra_grpc_operations_duration_seconds_bucket{method_name=~".*NodePrepareResources"}[5m])) sum(rate(dra_grpc_operations_duration_seconds_bucket{method_name=~".*NodePrepareResources"}[5m]))
by (le))` by (le)){{< /highlight >}} .
* DRA kubeletplugin gRPC NodeUnprepareResources operation: Observe `histogram_quantile(0.99, * DRA kubeletplugin gRPC NodeUnprepareResources operation: Observe {{< highlight
promql >}}histogram_quantile(0.99,
sum(rate(dra_grpc_operations_duration_seconds_bucket{method_name=~".*NodeUnprepareResources"}[5m])) sum(rate(dra_grpc_operations_duration_seconds_bucket{method_name=~".*NodeUnprepareResources"}[5m]))
by (le))`. by (le)){{< /highlight >}}.
## {{% heading "whatsnext" %}} ## {{% heading "whatsnext" %}}
* [Learn more about DRA](/docs/concepts/scheduling-eviction/dynamic-resource-allocation) * [Learn more about
DRA](/docs/concepts/scheduling-eviction/dynamic-resource-allocation)