docs-v2/content/telegraf/controller/reference/architecture.md

269 lines
9.5 KiB
Markdown

---
title: Telegraf Controller architecture
description: >
Architectural overview of the {{% product-name %}} application.
menu:
telegraf_controller:
name: Architectural overview
parent: Reference
weight: 105
---
{{% product-name %}} is a standalone application that provides centralized
management for Telegraf agents. It runs as a single binary that starts two
separate servers: a web interface/API server and a dedicated high-performance
heartbeat server for agent monitoring.
## Runtime Architecture
### Application Components
When you run the Telegraf Controller binary, it starts four main subsystems:
- **Web Server**: Serves the management interface (default port: `8888`)
- **API Server**: Handles configuration management and administrative requests
(served on the same port as the web server)
- **Heartbeat Server**: Dedicated high-performance server for agent heartbeats
(default port: `8000`)
- **Background Scheduler**: Monitors agent health every 60 seconds
### Process Model
- **telegraf_controller** _(single process, multiple servers)_
- **Main HTTP Server** _(port `8888`)_
- Web UI (`/`)
- API Endpoints (`/api/*`)
- **Heartbeat Server** (port `8000`)
- POST /heartbeat _(high-performance endpoint)_
- **Database Connection**
- SQLite or PostgreSQL
- **Background Tasks**
- Agent Status Monitor (60s interval)
The dual-server architecture separates high-frequency heartbeat traffic from
regular management operations, ensuring that the web interface remains
responsive even under heavy agent load.
## Configuration
{{% product-name %}} configuration is controlled through command options and
environment variables.
| Command Option | Environment Variable | Description |
| :----------------- | :------------------- | :--------------------------------------------------------------------------------------------------------------- |
| `--port` | `PORT` | API server port (default is `8888`) |
| `--heartbeat-port` | `HEARTBEAT_PORT` | Heartbeat service port (default: `8000`) |
| `--database` | `DATABASE` | Database filepath or URL (default is [SQLite path](/telegraf/controller/install/#default-sqlite-data-locations)) |
| `--ssl-cert` | `SSL_CERT` | Path to SSL certificate |
| `--ssl-key` | `SSL_KEY` | Path to SSL private key |
To use environment variables, create a `.env` file in the same directory as the
binary or export these environment variables in your terminal session.
### Database Selection
{{% product-name %}} automatically selects the database type based on the
`DATABASE` string:
- **SQLite** (default): Best for development and small deployments with less
than 1000 agents. Database file created automatically.
- **PostgreSQL**: Required for large deployments. Must be provisioned separately.
Example PostgreSQL configuration:
```bash
DATABASE="postgresql://user:password@localhost:5432/telegraf_controller"
```
## Data Flow
### Agent registration and heartbeats
{{< diagram >}}
flowchart LR
T["Telegraf Agents<br/>(POST heartbeats)"] --> H["Port 8000<br/>Heartbeat Server"]
H --Direct Write--> D[("Database")]
W["Web UI/API<br/>"] --> A["Port 8888<br/>API Server"] --View Agents (Read-Only)--> D
R["Rust Scheduler<br/>(Agent status updates)"] --> D
{{< /diagram >}}
1. **Agents send heartbeats**:
Telegraf agents with the heartbeat output plugin send `POST` requests to the
dedicated heartbeat server (port `8000` by default).
2. **Heartbeat server processes the heartbeat**:
The heartbeat server is a high-performance Rust-based HTTP server that:
- Receives the `POST` request at `/agents/heartbeat`
- Validates the heartbeat payload
- Extracts agent information (ID, hostname, IP address, status, etc.)
- Uniquely identifies each agent using the `instance_id` in the heartbeat
payload.
3. **Heartbeat server writes directly to the database**:
The heartbeat server uses a Rust NAPI module that:
- Bypasses the application ORM (Object-Relational Mapping) layer entirely
- Uses `sqlx` (Rust SQL library) to write directly to the database
- Implements batch processing to efficiently process multiple heartbeats
- Provides much higher throughput than going through the API layer
The Rust module performs these operations:
- Creates a new agent if it does not already exist
- Adds or updates the `last_seen` timestamp
- Adds or updates the agent status to the status reported in the heartbeat
- Updates other agent metadata (hostname, IP, etc.)
4. **API layer reads agent data**:
The API layer has read-only access for agent data and performs the following
actions:
- `GET /api/agents` - List agents
- `GET /api/agents/summary` - Agent status summary
The API never writes to the agents table. Only the heartbeat server does.
5. **The Web UI displays updated agent data**:
The web interface polls the API endpoints to display:
- Real-time agent status
- Last seen timestamps
- Agent health metrics
6. **The background scheduler evaluates agent statuses**:
Every 60 seconds, a Rust-based scheduler (also part of the NAPI module):
- Scans all agents in the database
- Checks `last_seen` timestamps against the agent's assigned reporting rule
- Updates agent statuses:
- ok → not_reporting (if heartbeat missed beyond threshold)
- not_reporting → ok (if heartbeat resumes)
- Auto-deletes agents that have exceeded the auto-delete threshold
(if enabled for the reporting rule)
### Configuration distribution
1. **An agent requests a configuration**:
Telegraf agents request their configuration from the main API server
(port `8888`):
```bash
telegraf --config "http://localhost:8888/api/configs/{config-id}/toml?location=datacenter1&env=prod"
```
The agent makes a `GET` request with:
- **Config ID**: Unique identifier for the configuration template
- **Query Parameters**: Variables for parameter substitution
- **Accept Header**: Can specify `text/x-toml` or `application/octet-stream`
for download
2. **The API server receives request**:
The API server on port `8888` handles the request at
`/api/configs/{id}/toml` and does the following:
- Validates the configuration ID
- Extracts all query parameters for substitution
- Checks the `Accept` header to determine response format
3. **The application retrieves the configuration from the database**:
{{% product-name %}} fetches configuration data from the database:
- **Configuration TOML**: The raw configuration with parameter placeholders
- **Configuration name**: Used for filename if downloading
- **Updated timestamp**: For the `Last-Modified` header
4. **{{% product-name %}} substitutes parameters**:
{{% product-name %}} processes the TOML template and replaces parameters
with parameter values specified in the `GET` request.
5. **{{% product-name %}} sets response headers**:
- Content-Type
- Last-Modified
Telegraf uses the `Last-Modified` header to determine if a configuration
has been updated and, if so, download and use the updated configuration.
6. **{{% product-name %}} delivers the response**:
Based on the `Accept` header:
{{< tabs-wrapper >}}
{{% tabs "medium" %}}
[text/x-toml (TOML)](#)
[application/octet-stream (Download)](#)
{{% /tabs %}}
{{% tab-content %}}
<!------------------------------- BEGIN TOML ------------------------------>
```
HTTP/1.1 200 OK
Content-Type: text/x-toml; charset=utf-8
Last-Modified: Mon, 05 Jan 2025 07:28:00 GMT
[agent]
hostname = "server-01"
environment = "prod"
...
```
<!-------------------------------- END TOML ------------------------------->
{{% /tab-content %}}
{{% tab-content %}}
<!----------------------------- BEGIN DOWNLOAD ---------------------------->
```
HTTP/1.1 200 OK
Content-Type: application/octet-stream
Content-Disposition: attachment; filename="config_name.toml"
Last-Modified: Mon, 05 Jan 2025 07:28:00 GMT
[agent]
hostname = "server-01"
...
```
<!------------------------------ END DOWNLOAD ----------------------------->
{{% /tab-content %}}
{{< /tabs-wrapper >}}
7. _(Optional)_ **Telegraf regularly checks the configuration for updates**:
Telegraf agents can regularly check {{% product-name %}} for configuration
updates and automatically load updates when detected. When starting a
Telegraf agent, include the `--config-url-watch-interval` option with the
interval that you want the agent to use to check for updates—for example:
```bash
telegraf \
--config http://localhost:8888/api/configs/xxxxxx/toml \
--config-url-watch-interval 1h
```
## Reporting Rules
{{% product-name %}} uses reporting rules to determine when agents should be
marked as not reporting:
- **Default Rule**: Created automatically on first run
- **Heartbeat Interval**: Expected frequency of agent heartbeats (default: 60s)
- **Threshold Multiplier**: How many intervals to wait before marking not_reporting (default: 3x)
Access reporting rules via:
- **Web UI**: Reporting Rules
- **API**: `GET /api/reporting-rules`