diff --git a/content/telegraf/controller/reference/_index.md b/content/telegraf/controller/reference/_index.md new file mode 100644 index 000000000..583b619e1 --- /dev/null +++ b/content/telegraf/controller/reference/_index.md @@ -0,0 +1,16 @@ +--- +title: Telegraf Controller reference documentation +description: > + Reference documentation for Telegraf Controller, the application that + centralizes configuration management and provides information about the health + of Telegraf agent deployments. +menu: + telegraf_controller: + name: Reference +weight: 20 +--- + +Use the reference docs to look up Telegraf Controller configuration options, +APIs, and operational details. + +{{< children hlevel="h2" >}} diff --git a/content/telegraf/controller/reference/architecture.md b/content/telegraf/controller/reference/architecture.md new file mode 100644 index 000000000..2a2afa395 --- /dev/null +++ b/content/telegraf/controller/reference/architecture.md @@ -0,0 +1,268 @@ +--- +title: Telegraf Controller architecture +description: > + Architectural overview of the {{% product-name %}} application. +menu: + telegraf_controller: + name: Architectural overview + parent: Reference +weight: 105 +--- + +{{% product-name %}} is a standalone application that provides centralized +management for Telegraf agents. It runs as a single binary that starts two +separate servers: a web interface/API server and a dedicated high-performance +heartbeat server for agent monitoring. + +## Runtime Architecture + +### Application Components + +When you run the Telegraf Controller binary, it starts four main subsystems: + +- **Web Server**: Serves the management interface (default port: `8888`) +- **API Server**: Handles configuration management and administrative requests + (served on the same port as the web server) +- **Heartbeat Server**: Dedicated high-performance server for agent heartbeats + (default port: `8000`) +- **Background Scheduler**: Monitors agent health every 60 seconds + +### Process Model + +- **telegraf_controller** _(single process, multiple servers)_ + - **Main HTTP Server** _(port `8888`)_ + - Web UI (`/`) + - API Endpoints (`/api/*`) + - **Heartbeat Server** (port `8000`) + - POST /heartbeat _(high-performance endpoint)_ + - **Database Connection** + - SQLite or PostgreSQL + - **Background Tasks** + - Agent Status Monitor (60s interval) + +The dual-server architecture separates high-frequency heartbeat traffic from +regular management operations, ensuring that the web interface remains +responsive even under heavy agent load. + +## Configuration + +{{% product-name %}} configuration is controlled through command options and +environment variables. + +| Command Option | Environment Variable | Description | +| :----------------- | :------------------- | :--------------------------------------------------------------------------------------------------------------- | +| `--port` | `PORT` | API server port (default is `8888`) | +| `--heartbeat-port` | `HEARTBEAT_PORT` | Heartbeat service port (default: `8000`) | +| `--database` | `DATABASE` | Database filepath or URL (default is [SQLite path](/telegraf/controller/install/#default-sqlite-data-locations)) | +| `--ssl-cert` | `SSL_CERT` | Path to SSL certificate | +| `--ssl-key` | `SSL_KEY` | Path to SSL private key | + +To use environment variables, create a `.env` file in the same directory as the +binary or export these environment variables in your terminal session. + +### Database Selection + +{{% product-name %}} automatically selects the database type based on the +`DATABASE` string: + +- **SQLite** (default): Best for development and small deployments with less + than 1000 agents. Database file created automatically. +- **PostgreSQL**: Required for large deployments. Must be provisioned separately. + +Example PostgreSQL configuration: + +```bash +DATABASE="postgresql://user:password@localhost:5432/telegraf_controller" +``` + +## Data Flow + +### Agent registration and heartbeats + +{{< diagram >}} +flowchart LR + T["Telegraf Agents
(POST heartbeats)"] --> H["Port 8000
Heartbeat Server"] + H --Direct Write--> D[("Database")] + W["Web UI/API
"] --> A["Port 8888
API Server"] --View Agents (Read-Only)--> D + R["Rust Scheduler
(Agent status updates)"] --> D + +{{< /diagram >}} + +1. **Agents send heartbeats**: + + Telegraf agents with the heartbeat output plugin send `POST` requests to the + dedicated heartbeat server (port `8000` by default). + +2. **Heartbeat server processes the heartbeat**: + + The heartbeat server is a high-performance Rust-based HTTP server that: + + - Receives the `POST` request at `/agents/heartbeat` + - Validates the heartbeat payload + - Extracts agent information (ID, hostname, IP address, status, etc.) + - Uniquely identifies each agent using the `instance_id` in the heartbeat + payload. + +3. **Heartbeat server writes directly to the database**: + + The heartbeat server uses a Rust NAPI module that: + + - Bypasses the application ORM (Object-Relational Mapping) layer entirely + - Uses `sqlx` (Rust SQL library) to write directly to the database + - Implements batch processing to efficiently process multiple heartbeats + - Provides much higher throughput than going through the API layer + + The Rust module performs these operations: + + - Creates a new agent if it does not already exist + - Adds or updates the `last_seen` timestamp + - Adds or updates the agent status to the status reported in the heartbeat + - Updates other agent metadata (hostname, IP, etc.) + +4. **API layer reads agent data**: + + The API layer has read-only access for agent data and performs the following + actions: + + - `GET /api/agents` - List agents + - `GET /api/agents/summary` - Agent status summary + + The API never writes to the agents table. Only the heartbeat server does. + +5. **The Web UI displays updated agent data**: + + The web interface polls the API endpoints to display: + + - Real-time agent status + - Last seen timestamps + - Agent health metrics + +6. **The background scheduler evaluates agent statuses**: + + Every 60 seconds, a Rust-based scheduler (also part of the NAPI module): + + - Scans all agents in the database + - Checks `last_seen` timestamps against the agent's assigned reporting rule + - Updates agent statuses: + - ok → not_reporting (if heartbeat missed beyond threshold) + - not_reporting → ok (if heartbeat resumes) + - Auto-deletes agents that have exceeded the auto-delete threshold + (if enabled for the reporting rule) + +### Configuration distribution + +1. **An agent requests a configuration**: + + Telegraf agents request their configuration from the main API server + (port `8888`): + + ```bash + telegraf --config "http://localhost:8888/api/configs/{config-id}/toml?location=datacenter1&env=prod" + ``` + + The agent makes a `GET` request with: + + - **Config ID**: Unique identifier for the configuration template + - **Query Parameters**: Variables for parameter substitution + - **Accept Header**: Can specify `text/x-toml` or `application/octet-stream` + for download + +2. **The API server receives request**: + + The API server on port `8888` handles the request at + `/api/configs/{id}/toml` and does the following: + + - Validates the configuration ID + - Extracts all query parameters for substitution + - Checks the `Accept` header to determine response format + +3. **The application retrieves the configuration from the database**: + + {{% product-name %}} fetches configuration data from the database: + + - **Configuration TOML**: The raw configuration with parameter placeholders + - **Configuration name**: Used for filename if downloading + - **Updated timestamp**: For the `Last-Modified` header + +4. **{{% product-name %}} substitutes parameters**: + + {{% product-name %}} processes the TOML template and replaces parameters + with parameter values specified in the `GET` request. + +5. **{{% product-name %}} sets response headers**: + + - Content-Type + - Last-Modified + + Telegraf uses the `Last-Modified` header to determine if a configuration + has been updated and, if so, download and use the updated configuration. + +6. **{{% product-name %}} delivers the response**: + + Based on the `Accept` header: + + {{< tabs-wrapper >}} +{{% tabs "medium" %}} +[text/x-toml (TOML)](#) +[application/octet-stream (Download)](#) +{{% /tabs %}} +{{% tab-content %}} + + +``` +HTTP/1.1 200 OK +Content-Type: text/x-toml; charset=utf-8 +Last-Modified: Mon, 05 Jan 2025 07:28:00 GMT + +[agent] + hostname = "server-01" + environment = "prod" +... +``` + + +{{% /tab-content %}} +{{% tab-content %}} + + +``` +HTTP/1.1 200 OK +Content-Type: application/octet-stream +Content-Disposition: attachment; filename="config_name.toml" +Last-Modified: Mon, 05 Jan 2025 07:28:00 GMT + +[agent] + hostname = "server-01" +... +``` + + +{{% /tab-content %}} +{{< /tabs-wrapper >}} + +7. _(Optional)_ **Telegraf regularly checks the configuration for updates**: + + Telegraf agents can regularly check {{% product-name %}} for configuration + updates and automatically load updates when detected. When starting a + Telegraf agent, include the `--config-url-watch-interval` option with the + interval that you want the agent to use to check for updates—for example: + + ```bash + telegraf \ + --config http://localhost:8888/api/configs/xxxxxx/toml \ + --config-url-watch-interval 1h + ``` + +## Reporting Rules + +{{% product-name %}} uses reporting rules to determine when agents should be +marked as not reporting: + +- **Default Rule**: Created automatically on first run +- **Heartbeat Interval**: Expected frequency of agent heartbeats (default: 60s) +- **Threshold Multiplier**: How many intervals to wait before marking not_reporting (default: 3x) + +Access reporting rules via: + +- **Web UI**: Reporting Rules +- **API**: `GET /api/reporting-rules`