chore(tc-heartbeat-status): remove impl plans

pull/6964/head
Scott Anderson 2026-03-18 16:50:55 -06:00
parent 07f29b2cf9
commit c519072c55
2 changed files with 0 additions and 1103 deletions

View File

@ -1,972 +0,0 @@
# Telegraf Controller: Agent Status & CEL Reference Implementation Plan
> **For agentic workers:** REQUIRED: Use superpowers:subagent-driven-development (if subagents available) or superpowers:executing-plans to implement this plan. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Add comprehensive agent status configuration docs and a multi-page CEL expression reference to the Telegraf Controller documentation.
**Architecture:** Update the existing status stub page with practical examples and create four new pages under `reference/cel/`. All content is documentation-only (Markdown). The CEL reference is self-contained and does not depend on the heartbeat plugin docs.
**Tech Stack:** Hugo, Markdown, TOML (config examples), CEL (expression examples)
**Spec:** `docs/superpowers/specs/2026-03-17-tc-cel-status-design.md`
***
## File Map
| Action | File | Responsibility |
| ------ | -------------------------------------------------------- | --------------------------------------------------------- |
| Modify | `content/telegraf/controller/agents/status.md` | Practical guide: status values, config examples, UI steps |
| Create | `content/telegraf/controller/reference/cel/_index.md` | CEL overview, evaluation flow, config reference |
| Create | `content/telegraf/controller/reference/cel/variables.md` | All CEL variables: top-level, agent, inputs, outputs |
| Create | `content/telegraf/controller/reference/cel/functions.md` | CEL functions, operators, quick reference |
| Create | `content/telegraf/controller/reference/cel/examples.md` | Real-world CEL expression examples by scenario |
### Conventions (from existing TC docs)
- **Menu:** All TC pages use `menu: telegraf_controller:`. Child pages use `parent:` matching the parent's `name`.
- **Reference children:** Existing reference pages use `parent: Reference` with weights 101-110. The CEL section uses `parent: Reference` on `_index.md` with weight 107 (after authorization at 106, before EULA at 110). CEL child pages use `parent: CEL expressions`.
- **Product name shortcode:** Use `{{% product-name %}}` for "Telegraf Controller" and `{{% product-name "short" %}}` for "Controller".
- **Dynamic values shortcode:** Wrap TOML configs containing `&{...}` parameters with `{{% telegraf/dynamic-values %}}...{{% /telegraf/dynamic-values %}}`.
- **Callouts:** Use `> [!Note]`, `> [!Important]`, `> [!Warning]` syntax.
- **Semantic line feeds:** One sentence per line.
***
## Task 1: Create CEL reference index page
**Files:**
- Create: `content/telegraf/controller/reference/cel/_index.md`
- [ ] **Step 1: Create the CEL reference index page**
Create `content/telegraf/controller/reference/cel/_index.md` with the following content:
````markdown
---
title: CEL expressions
description: >
Reference documentation for Common Expression Language (CEL) expressions used
to evaluate Telegraf agent status in {{% product-name %}}.
menu:
telegraf_controller:
name: CEL expressions
parent: Reference
weight: 107
related:
- /telegraf/controller/agents/status/
- /telegraf/v1/output-plugins/heartbeat/
---
[Common Expression Language (CEL)](https://cel.dev) is a lightweight expression
language designed for evaluating simple conditions.
{{% product-name %}} uses CEL expressions in the Telegraf
[heartbeat output plugin](/telegraf/v1/output-plugins/heartbeat/) to evaluate
agent status based on runtime data such as metric counts, error rates, and
plugin statistics.
## How status evaluation works
You define CEL expressions for three status levels in the
`[outputs.heartbeat.status]` section of your Telegraf configuration:
- **`ok`** — The agent is healthy.
- **`warn`** — The agent has a potential issue.
- **`fail`** — The agent has a critical problem.
Each expression is a CEL program that returns a boolean value.
Telegraf evaluates expressions in a configurable order (default:
`ok`, `warn`, `fail`) and assigns the status of the **first expression that
evaluates to `true`**.
If no expression evaluates to `true`, the `default` status is used
(default: `"ok"`).
### Initial status
Use the `initial` setting to define a status before the first Telegraf flush
cycle.
If `initial` is not set or is empty, Telegraf evaluates the status expressions
immediately, even before the first flush.
### Evaluation order
The `order` setting controls which expressions are evaluated and in what
sequence.
> [!Note]
> If you omit a status from the `order` list, its expression is **not
> evaluated**.
## Configuration reference
Configure status evaluation in the `[outputs.heartbeat.status]` section of the
heartbeat output plugin.
You must include `"status"` in the `include` list for status evaluation to take
effect.
```toml
[[outputs.heartbeat]]
url = "http://telegraf_controller.example.com/agents/heartbeat"
instance_id = "agent-123"
interval = "1m"
include = ["hostname", "statistics", "status"]
[outputs.heartbeat.status]
## CEL expressions that return a boolean.
## The first expression that evaluates to true sets the status.
ok = "metrics > 0"
warn = "log_errors > 0"
fail = "log_errors > 10"
## Evaluation order (default: ["ok", "warn", "fail"])
order = ["ok", "warn", "fail"]
## Default status when no expression matches
## Options: "ok", "warn", "fail", "undefined"
default = "ok"
## Initial status before the first flush cycle
## Options: "ok", "warn", "fail", "undefined", ""
# initial = ""
````
| Option | Type | Default | Description |
| :-------- | :-------------- | :----------------------- | :-------------------------------------------------------------------------------------------------------------- |
| `ok` | string (CEL) | `"false"` | Expression that, when `true`, sets status to **ok**. |
| `warn` | string (CEL) | `"false"` | Expression that, when `true`, sets status to **warn**. |
| `fail` | string (CEL) | `"false"` | Expression that, when `true`, sets status to **fail**. |
| `order` | list of strings | `["ok", "warn", "fail"]` | Order in which expressions are evaluated. |
| `default` | string | `"ok"` | Status used when no expression evaluates to `true`. Options: `ok`, `warn`, `fail`, `undefined`. |
| `initial` | string | `""` | Status before the first flush. Options: `ok`, `warn`, `fail`, `undefined`, `""` (empty = evaluate expressions). |
{{< children hlevel="h2" >}}
````
- [ ] **Step 2: Verify the file renders correctly**
Run: `npx hugo server` and navigate to the CEL expressions reference page.
Verify: page renders, navigation shows "CEL expressions" under "Reference", child page links appear.
- [ ] **Step 3: Commit**
```bash
git add content/telegraf/controller/reference/cel/_index.md
git commit -m "feat(tc-cel): add CEL expressions reference index page"
````
***
## Task 2: Create CEL variables reference page
**Files:**
- Create: `content/telegraf/controller/reference/cel/variables.md`
- [ ] **Step 1: Create the variables reference page**
Create `content/telegraf/controller/reference/cel/variables.md` with the following content:
````markdown
---
title: CEL variables
description: >
Reference for variables available in CEL expressions used to evaluate
Telegraf agent status in {{% product-name %}}.
menu:
telegraf_controller:
name: Variables
parent: CEL expressions
weight: 201
---
CEL expressions for agent status evaluation have access to variables that
represent data collected by Telegraf since the last successful heartbeat message
(unless noted otherwise).
## Top-level variables
| Variable | Type | Description |
|:---------|:-----|:------------|
| `metrics` | int | Number of metrics arriving at the heartbeat output plugin. |
| `log_errors` | int | Number of errors logged by the Telegraf instance. |
| `log_warnings` | int | Number of warnings logged by the Telegraf instance. |
| `last_update` | time | Timestamp of the last successful heartbeat message. Use with `now()` to calculate durations or rates. |
| `agent` | map | Agent-level statistics. See [Agent statistics](#agent-statistics). |
| `inputs` | map | Input plugin statistics. See [Input plugin statistics](#input-plugin-statistics-inputs). |
| `outputs` | map | Output plugin statistics. See [Output plugin statistics](#output-plugin-statistics-outputs). |
## Agent statistics
The `agent` variable is a map containing aggregate statistics for the entire
Telegraf instance.
These fields correspond to the `internal_agent` metric from the
Telegraf [internal input plugin](/telegraf/v1/plugins/#input-internal).
| Field | Type | Description |
|:------|:-----|:------------|
| `agent.metrics_written` | int | Total metrics written by all output plugins. |
| `agent.metrics_rejected` | int | Total metrics rejected by all output plugins. |
| `agent.metrics_dropped` | int | Total metrics dropped by all output plugins. |
| `agent.metrics_gathered` | int | Total metrics collected by all input plugins. |
| `agent.gather_errors` | int | Total collection errors across all input plugins. |
| `agent.gather_timeouts` | int | Total collection timeouts across all input plugins. |
### Example
```cel
agent.gather_errors > 0
````
## Input plugin statistics (`inputs`)
The `inputs` variable is a map where each key is a plugin type (for example,
`cpu` for `inputs.cpu`) and the value is a **list** of plugin instances.
Each entry in the list represents one configured instance of that plugin type.
These fields correspond to the `internal_gather` metric from the Telegraf
[internal input plugin](/telegraf/v1/plugins/#input-internal).
| Field | Type | Description |
| :----------------- | :----- | :---------------------------------------------------------------------------------------- |
| `id` | string | Unique plugin identifier. |
| `alias` | string | Alias set for the plugin. Only exists if an alias is defined in the plugin configuration. |
| `errors` | int | Collection errors for this plugin instance. |
| `metrics_gathered` | int | Number of metrics collected by this instance. |
| `gather_time_ns` | int | Time spent gathering metrics, in nanoseconds. |
| `gather_timeouts` | int | Number of timeouts during metric collection. |
| `startup_errors` | int | Number of times the plugin failed to start. |
### Access patterns
Access a specific plugin type and iterate over its instances:
```cel
// Check if any cpu input instance has errors
inputs.cpu.exists(i, i.errors > 0)
```
```cel
// Access the first instance of the cpu input
inputs.cpu[0].metrics_gathered
```
Use `has()` to safely check if a plugin type exists before accessing it:
```cel
// Safe access — returns false if no cpu input is configured
has(inputs.cpu) && inputs.cpu.exists(i, i.errors > 0)
```
## Output plugin statistics (`outputs`)
The `outputs` variable is a map with the same structure as `inputs`.
Each key is a plugin type (for example, `influxdb_v2` for `outputs.influxdb_v2`)
and the value is a list of plugin instances.
These fields correspond to the `internal_write` metric from the Telegraf
[internal input plugin](/telegraf/v1/plugins/#input-internal).
| Field | Type | Description |
| :----------------- | :----- | :------------------------------------------------------------------------------------------------------- |
| `id` | string | Unique plugin identifier. |
| `alias` | string | Alias set for the plugin. Only exists if an alias is defined in the plugin configuration. |
| `errors` | int | Write errors for this plugin instance. |
| `metrics_filtered` | int | Number of metrics filtered by the output. |
| `write_time_ns` | int | Time spent writing metrics, in nanoseconds. |
| `startup_errors` | int | Number of times the plugin failed to start. |
| `metrics_added` | int | Number of metrics added to the output buffer. |
| `metrics_written` | int | Number of metrics written to the output destination. |
| `metrics_rejected` | int | Number of metrics rejected by the service or serialization. |
| `metrics_dropped` | int | Number of metrics dropped (for example, due to buffer fullness). |
| `buffer_size` | int | Current number of metrics in the output buffer. |
| `buffer_limit` | int | Capacity of the output buffer. Irrelevant for disk-based buffers. |
| `buffer_fullness` | float | Ratio of metrics in the buffer to capacity. Can exceed `1.0` (greater than 100%) for disk-based buffers. |
### Access patterns
```cel
// Check if any InfluxDB v2 output has write errors
outputs.influxdb_v2.exists(o, o.errors > 0)
```
```cel
// Check buffer fullness across all instances of an output
outputs.influxdb_v2.exists(o, o.buffer_fullness > 0.8)
```
## Accumulation behavior
Unless noted otherwise, all variable values are **accumulated since the last
successful heartbeat message**.
Use the `last_update` variable with `now()` to calculate rates — for example:
```cel
// True if the error rate exceeds 1 error per minute
log_errors > 0 && duration.getMinutes(now() - last_update) > 0
&& log_errors / duration.getMinutes(now() - last_update) > 1
```
````
- [ ] **Step 2: Verify the file renders correctly**
Run: `npx hugo server` and navigate to the Variables page under CEL expressions.
Verify: page renders, tables display correctly, code blocks have proper syntax highlighting, navigation shows "Variables" under "CEL expressions".
- [ ] **Step 3: Commit**
```bash
git add content/telegraf/controller/reference/cel/variables.md
git commit -m "feat(tc-cel): add CEL variables reference page"
````
***
## Task 3: Create CEL functions reference page
**Files:**
- Create: `content/telegraf/controller/reference/cel/functions.md`
- [ ] **Step 1: Create the functions reference page**
Create `content/telegraf/controller/reference/cel/functions.md` with the following content:
````markdown
---
title: CEL functions and operators
description: >
Reference for functions and operators available in CEL expressions used to
evaluate Telegraf agent status in {{% product-name %}}.
menu:
telegraf_controller:
name: Functions
parent: CEL expressions
weight: 202
---
CEL expressions for agent status evaluation support built-in CEL operators and
the following function libraries.
## Time functions
### `now()`
Returns the current time.
Use with `last_update` to calculate durations or detect stale data.
```cel
// True if more than 10 minutes since last heartbeat
now() - last_update > duration('10m')
````
```cel
// True if more than 5 minutes since last heartbeat
now() - last_update > duration('5m')
```
## Math functions
Math functions from the
[CEL math library](https://github.com/google/cel-go/blob/master/ext/README.md#math)
are available for numeric calculations.
### Commonly used functions
| Function | Description | Example |
| :------------------------- | :-------------------------- | :----------------------------------------- |
| `math.greatest(a, b, ...)` | Returns the greatest value. | `math.greatest(log_errors, log_warnings)` |
| `math.least(a, b, ...)` | Returns the least value. | `math.least(agent.metrics_gathered, 1000)` |
### Example
```cel
// Warn if either errors or warnings exceed a threshold
math.greatest(log_errors, log_warnings) > 5
```
## String functions
String functions from the
[CEL strings library](https://github.com/google/cel-go/blob/master/ext/README.md#strings)
are available for string operations.
These are useful when checking plugin `alias` or `id` fields.
### Example
```cel
// Check if any input plugin has an alias containing "critical"
inputs.cpu.exists(i, has(i.alias) && i.alias.contains("critical"))
```
## Encoding functions
Encoding functions from the
[CEL encoder library](https://github.com/google/cel-go/blob/master/ext/README.md#encoders)
are available for encoding and decoding values.
## Operators
CEL supports standard operators for building expressions.
### Comparison operators
| Operator | Description | Example |
| :------- | :-------------------- | :----------------------------- |
| `==` | Equal | `metrics == 0` |
| `!=` | Not equal | `log_errors != 0` |
| `<` | Less than | `agent.metrics_gathered < 100` |
| `<=` | Less than or equal | `buffer_fullness <= 0.5` |
| `>` | Greater than | `log_errors > 10` |
| `>=` | Greater than or equal | `metrics >= 1000` |
### Logical operators
| Operator | Description | Example |
| :------- | :---------- | :--------------------------------------- |
| `&&` | Logical AND | `log_errors > 0 && metrics == 0` |
| `\|\|` | Logical OR | `log_errors > 10 \|\| log_warnings > 50` |
| `!` | Logical NOT | `!(metrics > 0)` |
### Arithmetic operators
| Operator | Description | Example |
| :------- | :------------- | :----------------------------------------------- |
| `+` | Addition | `log_errors + log_warnings` |
| `-` | Subtraction | `agent.metrics_gathered - agent.metrics_dropped` |
| `*` | Multiplication | `log_errors * 2` |
| `/` | Division | `agent.metrics_dropped / agent.metrics_gathered` |
| `%` | Modulo | `metrics % 100` |
### Ternary operator
```cel
// Conditional expression
log_errors > 10 ? true : false
```
### List operations
| Function | Description | Example |
| :----------------------- | :----------------------------- | :------------------------------------------ |
| `exists(var, condition)` | True if any element matches. | `inputs.cpu.exists(i, i.errors > 0)` |
| `all(var, condition)` | True if all elements match. | `outputs.influxdb_v2.all(o, o.errors == 0)` |
| `size()` | Number of elements. | `inputs.cpu.size() > 0` |
| `has()` | True if a field or key exists. | `has(inputs.cpu)` |
````
- [ ] **Step 2: Verify the file renders correctly**
Run: `npx hugo server` and navigate to the Functions page under CEL expressions.
Verify: page renders, tables display correctly, pipe characters in logical operators table render properly, navigation shows "Functions" under "CEL expressions".
- [ ] **Step 3: Commit**
```bash
git add content/telegraf/controller/reference/cel/functions.md
git commit -m "feat(tc-cel): add CEL functions and operators reference page"
````
***
## Task 4: Create CEL examples page
**Files:**
- Create: `content/telegraf/controller/reference/cel/examples.md`
- [ ] **Step 1: Create the examples page**
Create `content/telegraf/controller/reference/cel/examples.md` with the following content:
````markdown
---
title: CEL expression examples
description: >
Real-world examples of CEL expressions for evaluating Telegraf agent status
in {{% product-name %}}.
menu:
telegraf_controller:
name: Examples
parent: CEL expressions
weight: 203
related:
- /telegraf/controller/agents/status/
- /telegraf/controller/reference/cel/variables/
- /telegraf/controller/reference/cel/functions/
---
Each example includes a scenario description, the CEL expression, a full
heartbeat plugin configuration block, and an explanation.
For the full list of available variables and functions, see:
- [CEL variables](/telegraf/controller/reference/cel/variables/)
- [CEL functions and operators](/telegraf/controller/reference/cel/functions/)
## Basic health check
**Scenario:** Report `ok` when Telegraf is actively processing metrics.
Fall back to the default status (`ok`) when no expression matches — this means
the agent is healthy as long as metrics are flowing.
**Expression:**
```cel
ok = "metrics > 0"
````
**Configuration:**
```toml
[[outputs.heartbeat]]
url = "http://telegraf_controller.example.com/agents/heartbeat"
instance_id = "agent-123"
interval = "1m"
include = ["hostname", "statistics", "status"]
[outputs.heartbeat.status]
ok = "metrics > 0"
default = "fail"
```
**How it works:** If the heartbeat plugin received metrics since the last
heartbeat, the status is `ok`.
If no metrics arrived, no expression matches and the `default` status of `fail`
is used, indicating the agent is not processing data.
## Error rate monitoring
**Scenario:** Warn when any errors are logged and fail when the error count is
high.
**Expressions:**
```cel
warn = "log_errors > 0"
fail = "log_errors > 10"
```
**Configuration:**
```toml
[[outputs.heartbeat]]
url = "http://telegraf_controller.example.com/agents/heartbeat"
instance_id = "agent-123"
interval = "1m"
include = ["hostname", "statistics", "status"]
[outputs.heartbeat.status]
ok = "log_errors == 0 && log_warnings == 0"
warn = "log_errors > 0"
fail = "log_errors > 10"
order = ["fail", "warn", "ok"]
default = "ok"
```
**How it works:** Expressions are evaluated in `fail`, `warn`, `ok` order.
If more than 10 errors occurred since the last heartbeat, the status is `fail`.
If 1-10 errors occurred, the status is `warn`.
If no errors or warnings occurred, the status is `ok`.
## Buffer health
**Scenario:** Warn when any output plugin's buffer exceeds 80% fullness,
indicating potential data backpressure.
**Expression:**
```cel
warn = "outputs.influxdb_v2.exists(o, o.buffer_fullness > 0.8)"
fail = "outputs.influxdb_v2.exists(o, o.buffer_fullness > 0.95)"
```
**Configuration:**
```toml
[[outputs.heartbeat]]
url = "http://telegraf_controller.example.com/agents/heartbeat"
instance_id = "agent-123"
interval = "1m"
include = ["hostname", "statistics", "status"]
[outputs.heartbeat.status]
ok = "metrics > 0"
warn = "outputs.influxdb_v2.exists(o, o.buffer_fullness > 0.8)"
fail = "outputs.influxdb_v2.exists(o, o.buffer_fullness > 0.95)"
order = ["fail", "warn", "ok"]
default = "ok"
```
**How it works:** The `outputs.influxdb_v2` map contains a list of all
`influxdb_v2` output plugin instances.
The `exists()` function iterates over all instances and returns `true` if any
instance's `buffer_fullness` exceeds the threshold.
At 95% fullness, the status is `fail`; at 80%, `warn`; otherwise `ok`.
## Plugin-specific checks
**Scenario:** Monitor a specific input plugin for collection errors and use
safe access patterns to avoid errors when the plugin is not configured.
**Expression:**
```cel
warn = "has(inputs.cpu) && inputs.cpu.exists(i, i.errors > 0)"
fail = "has(inputs.cpu) && inputs.cpu.exists(i, i.startup_errors > 0)"
```
**Configuration:**
```toml
[[outputs.heartbeat]]
url = "http://telegraf_controller.example.com/agents/heartbeat"
instance_id = "agent-123"
interval = "1m"
include = ["hostname", "statistics", "status"]
[outputs.heartbeat.status]
ok = "metrics > 0"
warn = "has(inputs.cpu) && inputs.cpu.exists(i, i.errors > 0)"
fail = "has(inputs.cpu) && inputs.cpu.exists(i, i.startup_errors > 0)"
order = ["fail", "warn", "ok"]
default = "ok"
```
**How it works:** The `has()` function checks if the `cpu` key exists in the
`inputs` map before attempting to access it.
This prevents evaluation errors when the plugin is not configured.
If the plugin has startup errors, the status is `fail`.
If it has collection errors, the status is `warn`.
## Composite conditions
**Scenario:** Combine multiple signals to detect a degraded agent — high error
count combined with output buffer pressure.
**Expression:**
```cel
fail = "log_errors > 5 && has(outputs.influxdb_v2) && outputs.influxdb_v2.exists(o, o.buffer_fullness > 0.9)"
```
**Configuration:**
```toml
[[outputs.heartbeat]]
url = "http://telegraf_controller.example.com/agents/heartbeat"
instance_id = "agent-123"
interval = "1m"
include = ["hostname", "statistics", "status"]
[outputs.heartbeat.status]
ok = "metrics > 0 && log_errors == 0"
warn = "log_errors > 0 || (has(outputs.influxdb_v2) && outputs.influxdb_v2.exists(o, o.buffer_fullness > 0.8))"
fail = "log_errors > 5 && has(outputs.influxdb_v2) && outputs.influxdb_v2.exists(o, o.buffer_fullness > 0.9)"
order = ["fail", "warn", "ok"]
default = "ok"
```
**How it works:** The `fail` expression requires **both** a high error count
**and** buffer pressure to trigger.
The `warn` expression uses `||` to trigger on **either** condition independently.
This layered approach avoids false alarms from transient spikes in a single
metric.
## Time-based expressions
**Scenario:** Warn when the time since the last successful heartbeat exceeds a
threshold, indicating potential connectivity or performance issues.
**Expression:**
```cel
warn = "now() - last_update > duration('10m')"
fail = "now() - last_update > duration('30m')"
```
**Configuration:**
```toml
[[outputs.heartbeat]]
url = "http://telegraf_controller.example.com/agents/heartbeat"
instance_id = "agent-123"
interval = "1m"
include = ["hostname", "statistics", "status"]
[outputs.heartbeat.status]
ok = "metrics > 0"
warn = "now() - last_update > duration('10m')"
fail = "now() - last_update > duration('30m')"
order = ["fail", "warn", "ok"]
default = "undefined"
initial = "undefined"
```
**How it works:** The `now()` function returns the current time and
`last_update` is the timestamp of the last successful heartbeat.
Subtracting them produces a duration that can be compared against a threshold.
The `initial` status is set to `undefined` so new agents don't immediately show
a stale-data warning before their first successful heartbeat.
## Custom evaluation order
**Scenario:** Use fail-first evaluation to prioritize detecting critical issues
before checking for healthy status.
**Configuration:**
```toml
[[outputs.heartbeat]]
url = "http://telegraf_controller.example.com/agents/heartbeat"
instance_id = "agent-123"
interval = "1m"
include = ["hostname", "statistics", "status"]
[outputs.heartbeat.status]
ok = "metrics > 0 && log_errors == 0"
warn = "log_errors > 0"
fail = "log_errors > 10 || agent.metrics_dropped > 100"
order = ["fail", "warn", "ok"]
default = "undefined"
```
**How it works:** By setting `order = ["fail", "warn", "ok"]`, the most severe
conditions are checked first.
If the agent has more than 10 logged errors or has dropped more than 100
metrics, the status is `fail` — regardless of whether the `ok` or `warn`
expression would also match.
This is the recommended order for production deployments where early detection
of critical issues is important.
````
- [ ] **Step 2: Verify the file renders correctly**
Run: `npx hugo server` and navigate to the Examples page under CEL expressions.
Verify: page renders, all seven example sections display with correct TOML syntax highlighting, navigation shows "Examples" under "CEL expressions".
- [ ] **Step 3: Commit**
```bash
git add content/telegraf/controller/reference/cel/examples.md
git commit -m "feat(tc-cel): add CEL expression examples page"
````
***
## Task 5: Update the agent status page
**Files:**
- Modify: `content/telegraf/controller/agents/status.md`
- [ ] **Step 1: Replace the status page content**
Replace the full content of `content/telegraf/controller/agents/status.md` with the following:
````markdown
---
title: Set agent statuses
description: >
Configure agent status evaluation using CEL expressions in the Telegraf
heartbeat output plugin and view statuses in {{% product-name %}}.
menu:
telegraf_controller:
name: Set agent statuses
parent: Manage agents
weight: 104
related:
- /telegraf/controller/reference/cel/
- /telegraf/controller/agents/reporting-rules/
- /telegraf/v1/output-plugins/heartbeat/
---
Agent statuses reflect the health of a Telegraf instance based on runtime data.
The Telegraf [heartbeat output plugin](/telegraf/v1/output-plugins/heartbeat/)
evaluates [Common Expression Language (CEL)](/telegraf/controller/reference/cel/)
expressions against agent metrics, error counts, and plugin statistics to
determine the status sent with each heartbeat.
## Status values
{{% product-name %}} displays the following agent statuses:
| Status | Source | Description |
|:-------|:-------|:------------|
| **Ok** | Heartbeat plugin | The agent is healthy. Set when the `ok` CEL expression evaluates to `true`. |
| **Warn** | Heartbeat plugin | The agent has a potential issue. Set when the `warn` CEL expression evaluates to `true`. |
| **Fail** | Heartbeat plugin | The agent has a critical problem. Set when the `fail` CEL expression evaluates to `true`. |
| **Undefined** | Heartbeat plugin | No expression matched and the `default` is set to `undefined`, or the `initial` status is `undefined`. |
| **Not Reporting** | {{% product-name "short" %}} | The agent has not sent a heartbeat within the [reporting rule](/telegraf/controller/agents/reporting-rules/) threshold. {{% product-name "short" %}} applies this status automatically. |
## How status evaluation works
You define CEL expressions for `ok`, `warn`, and `fail` in the
`[outputs.heartbeat.status]` section of your heartbeat plugin configuration.
Telegraf evaluates expressions in a configurable order and assigns the status
of the first expression that evaluates to `true`.
For full details on evaluation flow, configuration options, and available
variables and functions, see the
[CEL expressions reference](/telegraf/controller/reference/cel/).
## Configure agent statuses
To configure status evaluation, add `"status"` to the `include` list in your
heartbeat plugin configuration and define CEL expressions in the
`[outputs.heartbeat.status]` section.
### Example: Basic health check
Report `ok` when metrics are flowing.
If no metrics arrive, fall back to the `fail` status.
{{% telegraf/dynamic-values %}}
```toml
[[outputs.heartbeat]]
url = "http://telegraf_controller.example.com/agents/heartbeat"
instance_id = "&{agent_id}"
token = "${INFLUX_TOKEN}"
interval = "1m"
include = ["hostname", "statistics", "status"]
[outputs.heartbeat.status]
ok = "metrics > 0"
default = "fail"
````
{{% /telegraf/dynamic-values %}}
### Example: Error-based status
Warn when errors are logged, fail when the error count is high.
{{% telegraf/dynamic-values %}}
```toml
[[outputs.heartbeat]]
url = "http://telegraf_controller.example.com/agents/heartbeat"
instance_id = "&{agent_id}"
token = "${INFLUX_TOKEN}"
interval = "1m"
include = ["hostname", "statistics", "status"]
[outputs.heartbeat.status]
ok = "log_errors == 0 && log_warnings == 0"
warn = "log_errors > 0"
fail = "log_errors > 10"
order = ["fail", "warn", "ok"]
default = "ok"
```
{{% /telegraf/dynamic-values %}}
### Example: Composite condition
Combine error count and buffer pressure signals.
{{% telegraf/dynamic-values %}}
```toml
[[outputs.heartbeat]]
url = "http://telegraf_controller.example.com/agents/heartbeat"
instance_id = "&{agent_id}"
token = "${INFLUX_TOKEN}"
interval = "1m"
include = ["hostname", "statistics", "status"]
[outputs.heartbeat.status]
ok = "metrics > 0 && log_errors == 0"
warn = "log_errors > 0 || (has(outputs.influxdb_v2) && outputs.influxdb_v2.exists(o, o.buffer_fullness > 0.8))"
fail = "log_errors > 5 && has(outputs.influxdb_v2) && outputs.influxdb_v2.exists(o, o.buffer_fullness > 0.9)"
order = ["fail", "warn", "ok"]
default = "ok"
```
{{% /telegraf/dynamic-values %}}
For more examples including buffer health, plugin-specific checks, and
time-based expressions, see
[CEL expression examples](/telegraf/controller/reference/cel/examples/).
## View an agent's status
1. In {{% product-name %}}, go to **Agents**.
2. Check the **Status** column for each agent.
3. To see more details, click the **More button ({{% icon "tc-more" %}})** and
select **View Details**.
4. The details page shows the reported status, reporting rule assignment, and
the time of the last heartbeat.
## Learn more
- [CEL expressions reference](/telegraf/controller/reference/cel/) — Full
reference for CEL evaluation flow, configuration, variables, functions, and
examples.
- [Heartbeat output plugin](/telegraf/v1/output-plugins/heartbeat/) — Plugin
configuration reference.
- [Define reporting rules](/telegraf/controller/agents/reporting-rules/) — Configure
thresholds for the **Not Reporting** status.
````
- [ ] **Step 2: Verify the file renders correctly**
Run: `npx hugo server` and navigate to the "Set agent statuses" page under "Manage agents".
Verify: page renders, status table displays correctly, all three example config blocks render with TOML syntax highlighting, cross-links resolve correctly, the "View an agent's status" section is preserved.
- [ ] **Step 3: Commit**
```bash
git add content/telegraf/controller/agents/status.md
git commit -m "feat(tc-status): expand agent status page with CEL examples and configuration"
````
***
## Task 6: Cross-link verification and final review
**Files:**
- All files from Tasks 1-5
- [ ] **Step 1: Verify all cross-links**
Run: `npx hugo server` and verify the following links resolve:
1. Status page → CEL reference index: `/telegraf/controller/reference/cel/`
2. Status page → Heartbeat plugin: `/telegraf/v1/output-plugins/heartbeat/`
3. Status page → Reporting rules: `/telegraf/controller/agents/reporting-rules/`
4. Status page → CEL examples: `/telegraf/controller/reference/cel/examples/`
5. CEL index → Heartbeat plugin: `/telegraf/v1/output-plugins/heartbeat/`
6. CEL examples → Variables: `/telegraf/controller/reference/cel/variables/`
7. CEL examples → Functions: `/telegraf/controller/reference/cel/functions/`
8. CEL examples → Status page: `/telegraf/controller/agents/status/`
9. CEL variables → Internal input plugin: `/telegraf/v1/plugins/#input-internal`
- [ ] **Step 2: Verify navigation structure**
In the left nav, confirm:
- "CEL expressions" appears under "Reference"
- "Variables", "Functions", and "Examples" appear as children of "CEL expressions"
- "Set agent statuses" remains under "Manage agents"
- [ ] **Step 3: Run Vale linting**
Run: `.ci/vale/vale.sh content/telegraf/controller/agents/status.md content/telegraf/controller/reference/cel/`
Fix any errors or warnings. Suggestions can be evaluated but are not blocking.
- [ ] **Step 4: Commit any linting fixes**
```bash
git add content/telegraf/controller/agents/status.md content/telegraf/controller/reference/cel/
git commit -m "style(tc-cel): fix Vale linting issues"
```

View File

@ -1,131 +0,0 @@
# Telegraf Controller: Agent Status & CEL Expression Reference
**Date:** 2026-03-17
**Status:** Approved
**Scope:** Documentation content (no code changes)
## Summary
Add comprehensive agent status configuration documentation to Telegraf Controller docs. This includes updating the existing status page with practical examples and creating a new multi-page CEL expression reference in the reference section.
## Deliverables
### 1. Update existing status page
**File:** `content/telegraf/controller/agents/status.md`
Expand from the current stub into a practical guide with the following structure:
1. **Intro** — What agent statuses are, the four status values (`ok`, `warn`, `fail`, `undefined`) plus the Controller-applied `Not Reporting` state.
2. **How status evaluation works** — Brief explanation of CEL expressions, evaluation order, defaults, and initial status. Links to the CEL reference for full details.
3. **Configure agent statuses** — Example heartbeat plugin config with `include = ["status"]` and the `[outputs.heartbeat.status]` section. 2-3 practical inline examples:
- Basic health check (ok when metrics are flowing)
- Error-based warning/failure
- Composite condition
4. **View an agent's status** — Keep existing UI steps as-is.
5. **Link to CEL reference** — Points users to the full reference for all variables, functions, and more examples.
### 2. Create CEL expression reference (multi-page)
New section under `content/telegraf/controller/reference/cel/`.
#### `_index.md` — CEL Overview
1. **Intro** — What CEL is (Common Expression Language), how Telegraf Controller uses it to evaluate agent status from heartbeat data.
2. **How status evaluation works** — Detailed evaluation flow:
- Expressions are defined for `ok`, `warn`, `fail` — each is a CEL program returning a boolean.
- Evaluation order is configurable via `order` (default: `["ok", "warn", "fail"]`).
- First expression evaluating to `true` sets the status.
- If none match, `default` status is used (default: `"ok"`).
- `initial` status can be set for the period before the first flush.
3. **Configuration reference** — The `[outputs.heartbeat.status]` config block with all options: `ok`, `warn`, `fail`, `order`, `default`, `initial`.
4. **Child page links** — Variables, Functions, Examples.
#### `variables.md` — Variables Reference
1. **Intro** — Variables represent data collected by Telegraf since the last successful heartbeat (unless noted otherwise).
2. **Top-level variables** — Table or definition list:
- `metrics` (int) — metrics arriving at the heartbeat plugin
- `log_errors` (int) — errors logged
- `log_warnings` (int) — warnings logged
- `last_update` (time) — time of last successful heartbeat
- `agent` (map) — agent-level statistics
- `inputs` (map) — input plugin statistics
- `outputs` (map) — output plugin statistics
3. **Agent statistics (`agent`)** — Map fields:
- `metrics_written`, `metrics_rejected`, `metrics_dropped`, `metrics_gathered`, `gather_errors`, `gather_timeouts`
4. **Input plugin statistics (`inputs`)** — Map structure: key = plugin type (e.g., `cpu`), value = list of instances. Fields per instance:
- `id`, `alias`, `errors`, `metrics_gathered`, `gather_time_ns`, `gather_timeouts`, `startup_errors`
5. **Output plugin statistics (`outputs`)** — Same map structure. Fields per instance:
- `id`, `alias`, `errors`, `metrics_filtered`, `write_time_ns`, `startup_errors`, `metrics_added`, `metrics_written`, `metrics_rejected`, `metrics_dropped`, `buffer_size`, `buffer_limit`, `buffer_fullness`
6. **Note on accumulation** — Values accumulate since last successful heartbeat; `last_update` enables rate calculation.
#### `functions.md` — Functions Reference
1. **Intro** — CEL expressions support built-in CEL operators plus additional function libraries.
2. **Time functions**`now()` returns current time; usage with `last_update` for duration/rate calculations. Include usage example.
3. **Math functions** — Link to CEL math library. Highlight commonly useful functions (e.g., `math.greatest()`, `math.least()`). Brief examples.
4. **String functions** — Link to CEL strings library. Note usefulness for checking `alias` or `id` fields. Brief example.
5. **Encoding functions** — Link to CEL encoder library. Brief note on relevance.
6. **CEL operators reference** — Quick reference for comparison (`==`, `!=`, `<`, `>`), logical (`&&`, `||`, `!`), arithmetic (`+`, `-`, `*`, `/`), and ternary (`? :`) operators.
#### `examples.md` — Examples
Each example follows a consistent pattern: **scenario description → CEL expression(s) → full config block → explanation**.
1. **Basic health check**`ok` when metrics are flowing, `fail` otherwise.
- `ok = "metrics > 0"`
2. **Error rate monitoring** — warn on logged errors, fail on high error count.
- `warn = "log_errors > 0"`, `fail = "log_errors > 10"`
3. **Buffer health** — warn when any output buffer exceeds 80% fullness.
- Uses `outputs` map iteration to check `buffer_fullness` across plugin instances.
4. **Plugin-specific checks** — check a specific input or output for errors.
- Demonstrates map access like `outputs.influxdb_v2.exists(o, o.errors > 0)` and safe access with `has()`.
5. **Composite conditions** — combining multiple signals.
- `fail = "log_errors > 5 && outputs.influxdb_v2.exists(o, o.buffer_fullness > 0.9)"`
6. **Time-based expressions** — using `now()` and `last_update` for staleness.
- e.g., `warn = "now() - last_update > duration('10m')"`
7. **Custom evaluation order** — shows `order = ["fail", "warn", "ok"]` for fail-first evaluation.
## File Structure
### New files
```
content/telegraf/controller/reference/
cel/
_index.md — CEL overview, evaluation flow, config reference
variables.md — All variables (top-level, agent, inputs, outputs)
functions.md — Functions, operators, quick reference
examples.md — Real-world examples by scenario
```
### Updated files
```
content/telegraf/controller/agents/status.md — Expand from stub to practical guide
```
## Navigation / Menu Structure
The CEL section nests under the existing `Reference` parent in the `telegraf_controller` menu:
- **Reference** (existing)
- **CEL expressions** (`_index.md`)
- **Variables** (`variables.md`)
- **Functions** (`functions.md`)
- **Examples** (`examples.md`)
## Cross-Linking Strategy
- Status page → CEL reference `_index.md` for full details
- Status page → heartbeat plugin for base config syntax
- CEL examples page → status page for UI context
- CEL variables/functions pages are **self-contained** (standalone, no dependency on heartbeat plugin docs)
## Design Decisions
1. **Standalone CEL reference** — The TC CEL reference is self-contained with its own variable and function documentation, independent of the heartbeat plugin page. Users configuring statuses in Controller shouldn't need to navigate to plugin docs for the variable reference.
2. **Status page as practical guide** — Includes 2-3 inline examples for quick start; full reference lives in the CEL section.
3. **Multi-page reference** — Keeps pages shorter and searchable. Variables, functions, and examples each get their own page. Function pages can be split further by category later if they grow large.
4. **Consistent example format** — Every example includes scenario, expression, full config block, and explanation.