docs-v2/content/telegraf/controller/reference/architecture.md

9.5 KiB

title description menu weight
Telegraf Controller architecture Architectural overview of the {{% product-name %}} application.
telegraf_controller
name parent
Architectural overview Reference
105

{{% product-name %}} is a standalone application that provides centralized management for Telegraf agents. It runs as a single binary that starts two separate servers: a web interface/API server and a dedicated high-performance heartbeat server for agent monitoring.

Runtime Architecture

Application Components

When you run the Telegraf Controller binary, it starts four main subsystems:

  • Web Server: Serves the management interface (default port: 8888)
  • API Server: Handles configuration management and administrative requests (served on the same port as the web server)
  • Heartbeat Server: Dedicated high-performance server for agent heartbeats (default port: 8000)
  • Background Scheduler: Monitors agent health every 60 seconds

Process Model

  • telegraf_controller (single process, multiple servers)
    • Main HTTP Server (port 8888)
      • Web UI (/)
      • API Endpoints (/api/*)
    • Heartbeat Server (port 8000)
      • POST /heartbeat (high-performance endpoint)
    • Database Connection
      • SQLite or PostgreSQL
    • Background Tasks
      • Agent Status Monitor (60s interval)

The dual-server architecture separates high-frequency heartbeat traffic from regular management operations, ensuring that the web interface remains responsive even under heavy agent load.

Configuration

{{% product-name %}} configuration is controlled through command options and environment variables.

Command Option Environment Variable Description
--port PORT API server port (default is 8888)
--heartbeat-port HEARTBEAT_PORT Heartbeat service port (default: 8000)
--database DATABASE Database filepath or URL (default is SQLite path)
--ssl-cert SSL_CERT Path to SSL certificate
--ssl-key SSL_KEY Path to SSL private key

To use environment variables, create a .env file in the same directory as the binary or export these environment variables in your terminal session.

Database Selection

{{% product-name %}} automatically selects the database type based on the DATABASE string:

  • SQLite (default): Best for development and small deployments with less than 1000 agents. Database file created automatically.
  • PostgreSQL: Required for large deployments. Must be provisioned separately.

Example PostgreSQL configuration:

DATABASE="postgresql://user:password@localhost:5432/telegraf_controller"

Data Flow

Agent registration and heartbeats

{{< diagram >}} flowchart LR T["Telegraf Agents
(POST heartbeats)"] --> H["Port 8000
Heartbeat Server"] H --Direct Write--> D[("Database")] W["Web UI/API
"] --> A["Port 8888
API Server"] --View Agents (Read-Only)--> D
R["Rust Scheduler
(Agent status updates)"] --> D

{{< /diagram >}}

  1. Agents send heartbeats:

    Telegraf agents with the heartbeat output plugin send POST requests to the dedicated heartbeat server (port 8000 by default).

  2. Heartbeat server processes the heartbeat:

    The heartbeat server is a high-performance Rust-based HTTP server that:

    • Receives the POST request at /agents/heartbeat
    • Validates the heartbeat payload
    • Extracts agent information (ID, hostname, IP address, status, etc.)
    • Uniquely identifies each agent using the instance_id in the heartbeat payload.
  3. Heartbeat server writes directly to the database:

    The heartbeat server uses a Rust NAPI module that:

    • Bypasses the application ORM (Object-Relational Mapping) layer entirely
    • Uses sqlx (Rust SQL library) to write directly to the database
    • Implements batch processing to efficiently process multiple heartbeats
    • Provides much higher throughput than going through the API layer

    The Rust module performs these operations:

    • Creates a new agent if it does not already exist
    • Adds or updates the last_seen timestamp
    • Adds or updates the agent status to the status reported in the heartbeat
    • Updates other agent metadata (hostname, IP, etc.)
  4. API layer reads agent data:

    The API layer has read-only access for agent data and performs the following actions:

    • GET /api/agents - List agents
    • GET /api/agents/summary - Agent status summary

    The API never writes to the agents table. Only the heartbeat server does.

  5. The Web UI displays updated agent data:

    The web interface polls the API endpoints to display:

    • Real-time agent status
    • Last seen timestamps
    • Agent health metrics
  6. The background scheduler evaluates agent statuses:

    Every 60 seconds, a Rust-based scheduler (also part of the NAPI module):

    • Scans all agents in the database
    • Checks last_seen timestamps against the agent's assigned reporting rule
    • Updates agent statuses:
      • ok → not_reporting (if heartbeat missed beyond threshold)
      • not_reporting → ok (if heartbeat resumes)
    • Auto-deletes agents that have exceeded the auto-delete threshold (if enabled for the reporting rule)

Configuration distribution

  1. An agent requests a configuration:

    Telegraf agents request their configuration from the main API server (port 8888):

    telegraf --config "http://localhost:8888/api/configs/{config-id}/toml?location=datacenter1&env=prod"
    

    The agent makes a GET request with:

    • Config ID: Unique identifier for the configuration template
    • Query Parameters: Variables for parameter substitution
    • Accept Header: Can specify text/x-toml or application/octet-stream for download
  2. The API server receives request:

    The API server on port 8888 handles the request at /api/configs/{id}/toml and does the following:

    • Validates the configuration ID
    • Extracts all query parameters for substitution
    • Checks the Accept header to determine response format
  3. The application retrieves the configuration from the database:

    {{% product-name %}} fetches configuration data from the database:

    • Configuration TOML: The raw configuration with parameter placeholders
    • Configuration name: Used for filename if downloading
    • Updated timestamp: For the Last-Modified header
  4. {{% product-name %}} substitutes parameters:

    {{% product-name %}} processes the TOML template and replaces parameters with parameter values specified in the GET request.

  5. {{% product-name %}} sets response headers:

    • Content-Type
    • Last-Modified

    Telegraf uses the Last-Modified header to determine if a configuration has been updated and, if so, download and use the updated configuration.

  6. {{% product-name %}} delivers the response:

    Based on the Accept header:

    {{< tabs-wrapper >}} {{% tabs "medium" %}} text/x-toml (TOML) application/octet-stream (Download) {{% /tabs %}} {{% tab-content %}}

HTTP/1.1 200 OK
Content-Type: text/x-toml; charset=utf-8
Last-Modified: Mon, 05 Jan 2025 07:28:00 GMT

[agent]
  hostname = "server-01"
  environment = "prod"
...

{{% /tab-content %}} {{% tab-content %}}

HTTP/1.1 200 OK
Content-Type: application/octet-stream
Content-Disposition: attachment; filename="config_name.toml"
Last-Modified: Mon, 05 Jan 2025 07:28:00 GMT

[agent]
  hostname = "server-01"
...

{{% /tab-content %}} {{< /tabs-wrapper >}}

  1. (Optional) Telegraf regularly checks the configuration for updates:

    Telegraf agents can regularly check {{% product-name %}} for configuration updates and automatically load updates when detected. When starting a Telegraf agent, include the --config-url-watch-interval option with the interval that you want the agent to use to check for updates—for example:

    telegraf \
      --config http://localhost:8888/api/configs/xxxxxx/toml \
      --config-url-watch-interval 1h
    

Reporting Rules

{{% product-name %}} uses reporting rules to determine when agents should be marked as not reporting:

  • Default Rule: Created automatically on first run
  • Heartbeat Interval: Expected frequency of agent heartbeats (default: 60s)
  • Threshold Multiplier: How many intervals to wait before marking not_reporting (default: 3x)

Access reporting rules via:

  • Web UI: Reporting Rules
  • API: GET /api/reporting-rules