From f18aa950ed7539b99aa2ad3af83fdea76bb2445e Mon Sep 17 00:00:00 2001 From: Scott Anderson Date: Sun, 22 Mar 2026 08:35:37 -0600 Subject: [PATCH] Tc heartbeat status (#6964) * feat(tc-cel): add CEL expressions reference documentation Add multi-page CEL expression reference under Telegraf Controller reference docs with variables, functions, operators, and examples. Co-Authored-By: Claude Opus 4.6 (1M context) * feat(tc-status): expand agent status page with CEL examples and configuration Replace stub status page with practical guide including status values, evaluation overview, three inline config examples, and links to CEL reference. Co-Authored-By: Claude Opus 4.6 (1M context) * chore(tc-heartbeat-status): update status evaluation docs * chore(tc-heartbeat-status): remove impl plans --------- Co-authored-by: Claude Opus 4.6 (1M context) --- content/telegraf/controller/agents/status.md | 121 ++++++++- .../reference/agent-status-eval/_index.md | 97 +++++++ .../reference/agent-status-eval/examples.md | 257 ++++++++++++++++++ .../reference/agent-status-eval/functions.md | 120 ++++++++ .../reference/agent-status-eval/variables.md | 150 ++++++++++ 5 files changed, 736 insertions(+), 9 deletions(-) create mode 100644 content/telegraf/controller/reference/agent-status-eval/_index.md create mode 100644 content/telegraf/controller/reference/agent-status-eval/examples.md create mode 100644 content/telegraf/controller/reference/agent-status-eval/functions.md create mode 100644 content/telegraf/controller/reference/agent-status-eval/variables.md diff --git a/content/telegraf/controller/agents/status.md b/content/telegraf/controller/agents/status.md index eeba479f8..06f3ab617 100644 --- a/content/telegraf/controller/agents/status.md +++ b/content/telegraf/controller/agents/status.md @@ -1,24 +1,127 @@ --- title: Set agent statuses description: > - Understand how {{% product-name %}} receives and displays agent statuses from - the heartbeat output plugin. + Configure agent status evaluation using CEL expressions in the Telegraf + heartbeat output plugin and view statuses in {{% product-name %}}. menu: telegraf_controller: name: Set agent statuses parent: Manage agents weight: 104 +related: + - /telegraf/controller/reference/agent-status-eval/, Agent status evaluation reference + - /telegraf/controller/agents/reporting-rules/ + - /telegraf/v1/output-plugins/heartbeat/, Heartbeat output plugin --- -Agent statuses come from the Telegraf heartbeat output plugin and are sent with -each heartbeat request. -The plugin reports an `ok` status. +Agent statuses reflect the health of a Telegraf instance based on runtime data. +The Telegraf [heartbeat output plugin](/telegraf/v1/output-plugins/heartbeat/) +evaluates [Common Expression Language (CEL)](/telegraf/controller/reference/agent-status-eval/) +expressions against agent metrics, error counts, and plugin statistics to +determine the status sent with each heartbeat. > [!Note] -> A future Telegraf release will let you configure logic that sets the status value. -{{% product-name %}} also applies reporting rules to detect stale agents. -If an agent does not send a heartbeat within the rule's threshold, Controller -marks the agent as **Not Reporting** until it resumes sending heartbeats. +> #### Requires Telegraf v1.38.2+ +> +> Agent status evaluation in the Heartbeat output plugins requires Telegraf +> v1.38.2+. + +## Status values + +{{% product-name %}} displays the following agent statuses: + +| Status | Source | Description | +| :---------------- | :------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| **Ok** | Heartbeat plugin | The agent is healthy. Set when the `ok` CEL expression evaluates to `true`. | +| **Warn** | Heartbeat plugin | The agent has a potential issue. Set when the `warn` CEL expression evaluates to `true`. | +| **Fail** | Heartbeat plugin | The agent has a critical problem. Set when the `fail` CEL expression evaluates to `true`. | +| **Undefined** | Heartbeat plugin | No expression matched and the `default` is set to `undefined`, or the `initial` status is `undefined`. | +| **Not Reporting** | {{% product-name %}} | The agent has not sent a heartbeat within the [reporting rule](/telegraf/controller/agents/reporting-rules/) threshold. {{% product-name %}} applies this status automatically. | + +## How status evaluation works + +You define CEL expressions for `ok`, `warn`, and `fail` in the +`[outputs.heartbeat.status]` section of your heartbeat plugin configuration. +Telegraf evaluates expressions in a configurable order and assigns the status +of the first expression that evaluates to `true`. + +For full details on evaluation flow, configuration options, and available +variables and functions, see the +[Agent status evaluation reference](/telegraf/controller/reference/agent-status-eval/). + +## Configure agent statuses + +To configure status evaluation, add `"status"` to the `include` list in your +heartbeat plugin configuration and define CEL expressions in the +`[outputs.heartbeat.status]` section. + +### Example: Basic health check + +Report `ok` when metrics are flowing. +If no metrics arrive, fall back to the `fail` status. + +{{% telegraf/dynamic-values %}} +```toml +[[outputs.heartbeat]] + url = "http://telegraf_controller.example.com/agents/heartbeat" + instance_id = "&{agent_id}" + token = "${INFLUX_TOKEN}" + interval = "1m" + include = ["hostname", "statistics", "configs", "logs", "status"] + + [outputs.heartbeat.status] + ok = "metrics > 0" + default = "fail" +``` +{{% /telegraf/dynamic-values %}} + +### Example: Error-based status + +Warn when errors are logged, fail when the error count is high. + +{{% telegraf/dynamic-values %}} +```toml +[[outputs.heartbeat]] + url = "http://telegraf_controller.example.com/agents/heartbeat" + instance_id = "&{agent_id}" + token = "${INFLUX_TOKEN}" + interval = "1m" + include = ["hostname", "statistics", "configs", "logs", "status"] + + [outputs.heartbeat.status] + ok = "log_errors == 0 && log_warnings == 0" + warn = "log_errors > 0" + fail = "log_errors > 10" + order = ["fail", "warn", "ok"] + default = "ok" +``` +{{% /telegraf/dynamic-values %}} + +### Example: Composite condition + +Combine error count and buffer pressure signals. + +{{% telegraf/dynamic-values %}} +```toml +[[outputs.heartbeat]] + url = "http://telegraf_controller.example.com/agents/heartbeat" + instance_id = "&{agent_id}" + token = "${INFLUX_TOKEN}" + interval = "1m" + include = ["hostname", "statistics", "configs", "logs", "status"] + + [outputs.heartbeat.status] + ok = "metrics > 0 && log_errors == 0" + warn = "log_errors > 0 || (has(outputs.influxdb_v2) && outputs.influxdb_v2.exists(o, o.buffer_fullness > 0.8))" + fail = "log_errors > 5 && has(outputs.influxdb_v2) && outputs.influxdb_v2.exists(o, o.buffer_fullness > 0.9)" + order = ["fail", "warn", "ok"] + default = "ok" +``` +{{% /telegraf/dynamic-values %}} + +For more examples including buffer health, plugin-specific checks, and +time-based expressions, see +[CEL expression examples](/telegraf/controller/reference/agent-status-eval/examples/). ## View an agent's status diff --git a/content/telegraf/controller/reference/agent-status-eval/_index.md b/content/telegraf/controller/reference/agent-status-eval/_index.md new file mode 100644 index 000000000..bef40c47f --- /dev/null +++ b/content/telegraf/controller/reference/agent-status-eval/_index.md @@ -0,0 +1,97 @@ +--- +title: Agent status evaluation +description: > + Reference documentation for Common Expression Language (CEL) expressions used + to evaluate Telegraf agent status. +menu: + telegraf_controller: + name: Agent status evaluation + parent: Reference +weight: 107 +related: + - /telegraf/controller/agents/status/ + - /telegraf/v1/output-plugins/heartbeat/ +--- + +The Telegraf [heartbeat output plugin](/telegraf/v1/output-plugins/heartbeat/) +uses CEL expressions to evaluate agent status based on runtime data such as +metric counts, error rates, and plugin statistics. +[CEL (Common Expression Language)](https://cel.dev) is a lightweight expression +language designed for evaluating simple conditions. + +## How status evaluation works + +You define CEL expressions for three status levels in the +`[outputs.heartbeat.status]` section of your Telegraf configuration: + +- **ok** — The agent is healthy. +- **warn** — The agent has a potential issue. +- **fail** — The agent has a critical problem. + +Each expression is a CEL program that returns a boolean value. +Telegraf evaluates expressions in a configurable order (default: +`ok`, `warn`, `fail`) and assigns the status of the **first expression that +evaluates to `true`**. + +If no expression evaluates to `true`, the `default` status is used +(default: `"ok"`). + +### Initial status + +Use the `initial` setting to define a status before the first Telegraf flush +cycle. +If `initial` is not set or is empty, Telegraf evaluates the status expressions +immediately, even before the first flush. + +### Evaluation order + +The `order` setting controls which expressions are evaluated and in what +sequence. + +> [!Note] +> If you omit a status from the `order` list, its expression is **not +> evaluated**. + +## Configuration reference + +Configure status evaluation in the `[outputs.heartbeat.status]` section of the +heartbeat output plugin. +You must include `"status"` in the `include` list for status evaluation to take +effect. + +```toml +[[outputs.heartbeat]] + url = "http://telegraf_controller.example.com/agents/heartbeat" + instance_id = "agent-123" + interval = "1m" + include = ["hostname", "statistics", "status"] + + [outputs.heartbeat.status] + ## CEL expressions that return a boolean. + ## The first expression that evaluates to true sets the status. + ok = "metrics > 0" + warn = "log_errors > 0" + fail = "log_errors > 10" + + ## Evaluation order (default: ["ok", "warn", "fail"]) + order = ["ok", "warn", "fail"] + + ## Default status when no expression matches + ## Options: "ok", "warn", "fail", "undefined" + default = "ok" + + ## Initial status before the first flush cycle + ## Options: "ok", "warn", "fail", "undefined", "" + # initial = "" +``` + +| Option | Type | Default | Description | +|:-------|:-----|:--------|:------------| +| `ok` | string (CEL) | `"false"` | Expression that, when `true`, sets status to **ok**. | +| `warn` | string (CEL) | `"false"` | Expression that, when `true`, sets status to **warn**. | +| `fail` | string (CEL) | `"false"` | Expression that, when `true`, sets status to **fail**. | +| `order` | list of strings | `["ok", "warn", "fail"]` | Order in which expressions are evaluated. | +| `default` | string | `"ok"` | Status used when no expression evaluates to `true`. Options: `ok`, `warn`, `fail`, `undefined`. | +| `initial` | string | `""` | Status before the first flush. Options: `ok`, `warn`, `fail`, `undefined`, `""` (empty = evaluate expressions). | + +{{< children hlevel="h2" >}} diff --git a/content/telegraf/controller/reference/agent-status-eval/examples.md b/content/telegraf/controller/reference/agent-status-eval/examples.md new file mode 100644 index 000000000..355eb2764 --- /dev/null +++ b/content/telegraf/controller/reference/agent-status-eval/examples.md @@ -0,0 +1,257 @@ +--- +title: CEL expression examples +description: > + Real-world examples of CEL expressions for evaluating Telegraf agent status. +menu: + telegraf_controller: + name: Examples + parent: Agent status evaluation +weight: 203 +related: + - /telegraf/controller/agents/status/ + - /telegraf/controller/reference/agent-status-eval/variables/ + - /telegraf/controller/reference/agent-status-eval/functions/ +--- + +Each example includes a scenario description, the CEL expression, a full +heartbeat plugin configuration block, and an explanation. + +For the full list of available variables and functions, see: + +- [CEL variables](/telegraf/controller/reference/agent-status-eval/variables/) +- [CEL functions and operators](/telegraf/controller/reference/agent-status-eval/functions/) + +## Basic health check + +**Scenario:** Report `ok` when Telegraf is actively processing metrics. +Fall back to the default status (`ok`) when no expression matches — this means +the agent is healthy as long as metrics are flowing. + +**Expression:** + +```js +ok = "metrics > 0" +``` + +**Configuration:** + +```toml +[[outputs.heartbeat]] + url = "http://telegraf_controller.example.com/agents/heartbeat" + instance_id = "agent-123" + interval = "1m" + include = ["hostname", "statistics", "configs", "logs", "status"] + + [outputs.heartbeat.status] + ok = "metrics > 0" + default = "fail" +``` + +**How it works:** If the heartbeat plugin received metrics since the last +heartbeat, the status is `ok`. +If no metrics arrived, no expression matches and the `default` status of `fail` +is used, indicating the agent is not processing data. + +## Error rate monitoring + +**Scenario:** Warn when any errors are logged and fail when the error count is +high. + +**Expressions:** + +```js +warn = "log_errors > 0" +fail = "log_errors > 10" +``` + +**Configuration:** + +```toml +[[outputs.heartbeat]] + url = "http://telegraf_controller.example.com/agents/heartbeat" + instance_id = "agent-123" + interval = "1m" + include = ["hostname", "statistics", "configs", "logs", "status"] + + [outputs.heartbeat.status] + ok = "log_errors == 0 && log_warnings == 0" + warn = "log_errors > 0" + fail = "log_errors > 10" + order = ["fail", "warn", "ok"] + default = "ok" +``` + +**How it works:** Expressions are evaluated in `fail`, `warn`, `ok` order. +If more than 10 errors occurred since the last heartbeat, the status is `fail`. +If 1-10 errors occurred, the status is `warn`. +If no errors or warnings occurred, the status is `ok`. + +## Buffer health + +**Scenario:** Warn when any output plugin's buffer exceeds 80% fullness, +indicating potential data backpressure. + +**Expression:** + +```js +warn = "outputs.influxdb_v2.exists(o, o.buffer_fullness > 0.8)" +fail = "outputs.influxdb_v2.exists(o, o.buffer_fullness > 0.95)" +``` + +**Configuration:** + +```toml +[[outputs.heartbeat]] + url = "http://telegraf_controller.example.com/agents/heartbeat" + instance_id = "agent-123" + interval = "1m" + include = ["hostname", "statistics", "configs", "logs", "status"] + + [outputs.heartbeat.status] + ok = "metrics > 0" + warn = "outputs.influxdb_v2.exists(o, o.buffer_fullness > 0.8)" + fail = "outputs.influxdb_v2.exists(o, o.buffer_fullness > 0.95)" + order = ["fail", "warn", "ok"] + default = "ok" +``` + +**How it works:** The `outputs.influxdb_v2` map contains a list of all +`influxdb_v2` output plugin instances. +The `exists()` function iterates over all instances and returns `true` if any +instance's `buffer_fullness` exceeds the threshold. +At 95% fullness, the status is `fail`; at 80%, `warn`; otherwise `ok`. + +## Plugin-specific checks + +**Scenario:** Monitor a specific input plugin for collection errors and use +safe access patterns to avoid errors when the plugin is not configured. + +**Expression:** + +```js +warn = "has(inputs.cpu) && inputs.cpu.exists(i, i.errors > 0)" +fail = "has(inputs.cpu) && inputs.cpu.exists(i, i.startup_errors > 0)" +``` + +**Configuration:** + +```toml +[[outputs.heartbeat]] + url = "http://telegraf_controller.example.com/agents/heartbeat" + instance_id = "agent-123" + interval = "1m" + include = ["hostname", "statistics", "configs", "logs", "status"] + + [outputs.heartbeat.status] + ok = "metrics > 0" + warn = "has(inputs.cpu) && inputs.cpu.exists(i, i.errors > 0)" + fail = "has(inputs.cpu) && inputs.cpu.exists(i, i.startup_errors > 0)" + order = ["fail", "warn", "ok"] + default = "ok" +``` + +**How it works:** The `has()` function checks if the `cpu` key exists in the +`inputs` map before attempting to access it. +This prevents evaluation errors when the plugin is not configured. +If the plugin has startup errors, the status is `fail`. +If it has collection errors, the status is `warn`. + +## Composite conditions + +**Scenario:** Combine multiple signals to detect a degraded agent — high error +count combined with output buffer pressure. + +**Expression:** + +```js +fail = "log_errors > 5 && has(outputs.influxdb_v2) && outputs.influxdb_v2.exists(o, o.buffer_fullness > 0.9)" +``` + +**Configuration:** + +```toml +[[outputs.heartbeat]] + url = "http://telegraf_controller.example.com/agents/heartbeat" + instance_id = "agent-123" + interval = "1m" + include = ["hostname", "statistics", "configs", "logs", "status"] + + [outputs.heartbeat.status] + ok = "metrics > 0 && log_errors == 0" + warn = "log_errors > 0 || (has(outputs.influxdb_v2) && outputs.influxdb_v2.exists(o, o.buffer_fullness > 0.8))" + fail = "log_errors > 5 && has(outputs.influxdb_v2) && outputs.influxdb_v2.exists(o, o.buffer_fullness > 0.9)" + order = ["fail", "warn", "ok"] + default = "ok" +``` + +**How it works:** The `fail` expression requires **both** a high error count +**and** buffer pressure to trigger. +The `warn` expression uses `||` to trigger on **either** condition independently. +This layered approach avoids false alarms from transient spikes in a single +metric. + +## Time-based expressions + +**Scenario:** Warn when the time since the last successful heartbeat exceeds a +threshold, indicating potential connectivity or performance issues. + +**Expression:** + +```js +warn = "now() - last_update > duration('10m')" +fail = "now() - last_update > duration('30m')" +``` + +**Configuration:** + +```toml +[[outputs.heartbeat]] + url = "http://telegraf_controller.example.com/agents/heartbeat" + instance_id = "agent-123" + interval = "1m" + include = ["hostname", "statistics", "configs", "logs", "status"] + + [outputs.heartbeat.status] + ok = "metrics > 0" + warn = "now() - last_update > duration('10m')" + fail = "now() - last_update > duration('30m')" + order = ["fail", "warn", "ok"] + default = "undefined" + initial = "undefined" +``` + +**How it works:** The `now()` function returns the current time and +`last_update` is the timestamp of the last successful heartbeat. +Subtracting them produces a duration that can be compared against a threshold. +The `initial` status is set to `undefined` so new agents don't immediately show +a stale-data warning before their first successful heartbeat. + +## Custom evaluation order + +**Scenario:** Use fail-first evaluation to prioritize detecting critical issues +before checking for healthy status. + +**Configuration:** + +```toml +[[outputs.heartbeat]] + url = "http://telegraf_controller.example.com/agents/heartbeat" + instance_id = "agent-123" + interval = "1m" + include = ["hostname", "statistics", "configs", "logs", "status"] + + [outputs.heartbeat.status] + ok = "metrics > 0 && log_errors == 0" + warn = "log_errors > 0" + fail = "log_errors > 10 || agent.metrics_dropped > 100" + order = ["fail", "warn", "ok"] + default = "undefined" +``` + +**How it works:** By setting `order = ["fail", "warn", "ok"]`, the most severe +conditions are checked first. +If the agent has more than 10 logged errors or has dropped more than 100 +metrics, the status is `fail` — regardless of whether the `ok` or `warn` +expression would also match. +This is the recommended order for production deployments where early detection +of critical issues is important. diff --git a/content/telegraf/controller/reference/agent-status-eval/functions.md b/content/telegraf/controller/reference/agent-status-eval/functions.md new file mode 100644 index 000000000..c5bfcf112 --- /dev/null +++ b/content/telegraf/controller/reference/agent-status-eval/functions.md @@ -0,0 +1,120 @@ +--- +title: CEL functions and operators +description: > + Reference for functions and operators available in CEL expressions used to + evaluate Telegraf agent status. +menu: + telegraf_controller: + name: Functions + parent: Agent status evaluation +weight: 202 +--- + +CEL expressions for agent status evaluation support built-in CEL operators and +the following function libraries. + +## Time functions + +### `now()` + +Returns the current time. +Use with `last_update` to calculate durations or detect stale data. + +```js +// True if more than 10 minutes since last heartbeat +now() - last_update > duration('10m') +``` + +```js +// True if more than 5 minutes since last heartbeat +now() - last_update > duration('5m') +``` + +## Math functions + +Math functions from the +[CEL math library](https://github.com/google/cel-go/blob/master/ext/README.md#math) +are available for numeric calculations. + +### Commonly used functions + +| Function | Description | Example | +|:---------|:------------|:--------| +| `math.greatest(a, b, ...)` | Returns the greatest value. | `math.greatest(log_errors, log_warnings)` | +| `math.least(a, b, ...)` | Returns the least value. | `math.least(agent.metrics_gathered, 1000)` | + +### Example + +```js +// Warn if either errors or warnings exceed a threshold +math.greatest(log_errors, log_warnings) > 5 +``` + +## String functions + +String functions from the +[CEL strings library](https://github.com/google/cel-go/blob/master/ext/README.md#strings) +are available for string operations. +These are useful when checking plugin `alias` or `id` fields. + +### Example + +```js +// Check if any input plugin has an alias containing "critical" +inputs.cpu.exists(i, has(i.alias) && i.alias.contains("critical")) +``` + +## Encoding functions + +Encoding functions from the +[CEL encoder library](https://github.com/google/cel-go/blob/master/ext/README.md#encoders) +are available for encoding and decoding values. + +## Operators + +CEL supports standard operators for building expressions. + +### Comparison operators + +| Operator | Description | Example | +|:---------|:------------|:--------| +| `==` | Equal | `metrics == 0` | +| `!=` | Not equal | `log_errors != 0` | +| `<` | Less than | `agent.metrics_gathered < 100` | +| `<=` | Less than or equal | `buffer_fullness <= 0.5` | +| `>` | Greater than | `log_errors > 10` | +| `>=` | Greater than or equal | `metrics >= 1000` | + +### Logical operators + +| Operator | Description | Example | +|:---------|:------------|:--------| +| `&&` | Logical AND | `log_errors > 0 && metrics == 0` | +| `\|\|` | Logical OR | `log_errors > 10 \|\| log_warnings > 50` | +| `!` | Logical NOT | `!(metrics > 0)` | + +### Arithmetic operators + +| Operator | Description | Example | +|:---------|:------------|:--------| +| `+` | Addition | `log_errors + log_warnings` | +| `-` | Subtraction | `agent.metrics_gathered - agent.metrics_dropped` | +| `*` | Multiplication | `log_errors * 2` | +| `/` | Division | `agent.metrics_dropped / agent.metrics_gathered` | +| `%` | Modulo | `metrics % 100` | + +### Ternary operator + +```js +// Conditional expression +log_errors > 10 ? true : false +``` + +### List operations + +| Function | Description | Example | +|:---------|:------------|:--------| +| `exists(var, condition)` | True if any element matches. | `inputs.cpu.exists(i, i.errors > 0)` | +| `all(var, condition)` | True if all elements match. | `outputs.influxdb_v2.all(o, o.errors == 0)` | +| `size()` | Number of elements. | `inputs.cpu.size() > 0` | +| `has()` | True if a field or key exists. | `has(inputs.cpu)` | diff --git a/content/telegraf/controller/reference/agent-status-eval/variables.md b/content/telegraf/controller/reference/agent-status-eval/variables.md new file mode 100644 index 000000000..8861d2126 --- /dev/null +++ b/content/telegraf/controller/reference/agent-status-eval/variables.md @@ -0,0 +1,150 @@ +--- +title: CEL variables +description: > + Reference for variables available in CEL expressions used to evaluate + Telegraf agent status in {{% product-name %}}. +menu: + telegraf_controller: + name: Variables + parent: Agent status evaluation +weight: 201 +--- + +CEL expressions for agent status evaluation have access to variables that +represent data collected by Telegraf since the last successful heartbeat message +(unless noted otherwise). + +## Top-level variables + +| Variable | Type | Description | +| :------------- | :--- | :---------------------------------------------------------------------------------------------------- | +| `metrics` | int | Number of metrics arriving at the heartbeat output plugin. | +| `log_errors` | int | Number of errors logged by the Telegraf instance. | +| `log_warnings` | int | Number of warnings logged by the Telegraf instance. | +| `last_update` | time | Timestamp of the last successful heartbeat message. Use with `now()` to calculate durations or rates. | +| `agent` | map | Agent-level statistics. See [Agent statistics](#agent-statistics). | +| `inputs` | map | Input plugin statistics. See [Input plugin statistics](#input-plugin-statistics-inputs). | +| `outputs` | map | Output plugin statistics. See [Output plugin statistics](#output-plugin-statistics-outputs). | + +## Agent statistics + +The `agent` variable is a map containing aggregate statistics for the entire +Telegraf instance. +These fields correspond to the `internal_agent` metric from the +Telegraf [internal input plugin](/telegraf/v1/plugins/#input-internal). + +| Field | Type | Description | +| :----------------------- | :--- | :-------------------------------------------------- | +| `agent.metrics_written` | int | Total metrics written by all output plugins. | +| `agent.metrics_rejected` | int | Total metrics rejected by all output plugins. | +| `agent.metrics_dropped` | int | Total metrics dropped by all output plugins. | +| `agent.metrics_gathered` | int | Total metrics collected by all input plugins. | +| `agent.gather_errors` | int | Total collection errors across all input plugins. | +| `agent.gather_timeouts` | int | Total collection timeouts across all input plugins. | + +### Example + +```js +agent.gather_errors > 0 +``` + +## Input plugin statistics (`inputs`) + +The `inputs` variable is a map where each key is a plugin type (for example, +`cpu` for `inputs.cpu`) and the value is a **list** of plugin instances. +Each entry in the list represents one configured instance of that plugin type. + +These fields correspond to the `internal_gather` metric from the Telegraf +[internal input plugin](/telegraf/v1/plugins/#input-internal). + +| Field | Type | Description | +| :----------------- | :----- | :---------------------------------------------------------------------------------------- | +| `id` | string | Unique plugin identifier. | +| `alias` | string | Alias set for the plugin. Only exists if an alias is defined in the plugin configuration. | +| `errors` | int | Collection errors for this plugin instance. | +| `metrics_gathered` | int | Number of metrics collected by this instance. | +| `gather_time_ns` | int | Time spent gathering metrics, in nanoseconds. | +| `gather_timeouts` | int | Number of timeouts during metric collection. | +| `startup_errors` | int | Number of times the plugin failed to start. | + +### Access patterns + +Access a specific plugin type and iterate over its instances: + +```js +// Check if any cpu input instance has errors +inputs.cpu.exists(i, i.errors > 0) +``` + +```js +// Access the first instance of the cpu input +inputs.cpu[0].metrics_gathered +``` + +Use `has()` to safely check if a plugin type exists before accessing it: + +```js +// Safe access — returns false if no cpu input is configured +has(inputs.cpu) && inputs.cpu.exists(i, i.errors > 0) +``` + +## Output plugin statistics (`outputs`) + +The `outputs` variable is a map with the same structure as `inputs`. +Each key is a plugin type (for example, `influxdb_v3` for `outputs.influxdb_v3`) +and the value is a list of plugin instances. + +These fields correspond to the `internal_write` metric from the Telegraf +[internal input plugin](/telegraf/v1/plugins/#input-internal). + +| Field | Type | Description | +| :----------------- | :----- | :------------------------------------------------------------------------------------------------------- | +| `id` | string | Unique plugin identifier. | +| `alias` | string | Alias set for the plugin. Only exists if an alias is defined in the plugin configuration. | +| `errors` | int | Write errors for this plugin instance. | +| `metrics_filtered` | int | Number of metrics filtered by the output. | +| `write_time_ns` | int | Time spent writing metrics, in nanoseconds. | +| `startup_errors` | int | Number of times the plugin failed to start. | +| `metrics_added` | int | Number of metrics added to the output buffer. | +| `metrics_written` | int | Number of metrics written to the output destination. | +| `metrics_rejected` | int | Number of metrics rejected by the service or serialization. | +| `metrics_dropped` | int | Number of metrics dropped (for example, due to buffer fullness). | +| `buffer_size` | int | Current number of metrics in the output buffer. | +| `buffer_limit` | int | Capacity of the output buffer. Irrelevant for disk-based buffers. | +| `buffer_fullness` | float | Ratio of metrics in the buffer to capacity. Can exceed `1.0` (greater than 100%) for disk-based buffers. | + +### Access patterns + +```js +// Access the first instance of the InfluxDB v3 output plugin +outputs.influxdb_v3[0].metrics_written +``` + +```js +// Check if any InfluxDB v3 output has write errors +outputs.influxdb_v3.exists(o, o.errors > 0) +``` + +```js +// Check buffer fullness across all instances of an output +outputs.influxdb_v3.exists(o, o.buffer_fullness > 0.8) +``` + +Use `has()` to safely check if a plugin type exists before accessing it: + +```js +// Safe access — returns false if no cpu input is configured +has(outputs.influxdb_v3) && outputs.influxdb_v3.exists(o, o.errors > 0) +``` + +## Accumulation behavior + +Unless noted otherwise, all variable values are **accumulated since the last +successful heartbeat message**. +Use the `last_update` variable with `now()` to calculate rates — for example: + +```js +// True if the error rate exceeds 1 error per minute +log_errors > 0 && duration.getMinutes(now() - last_update) > 0 + && log_errors / duration.getMinutes(now() - last_update) > 1 +```