Merge branch 'master' into chore/update-db-delete

chore/update-db-delete
Jason Stirnaman 2026-01-06 09:57:03 -06:00 committed by GitHub
commit aedb2df6f4
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
5 changed files with 3908 additions and 0 deletions

View File

@ -61,6 +61,33 @@ directory. This new directory contains artifacts associated with the specified r
---
## 20251218-1946608 {date="2025-12-18"}
### Quickstart
```yaml
spec:
package:
image: us-docker.pkg.dev/influxdb2-artifacts/clustered/influxdb:20251218-1946608
```
#### Release artifacts
- [app-instance-schema.json](/downloads/clustered-release-artifacts/20251218-1946608/app-instance-schema.json)
- [example-customer.yml](/downloads/clustered-release-artifacts/20251218-1946608/example-customer.yml)
- [InfluxDB Clustered README EULA July 2024.txt](/downloads/clustered-release-artifacts/InfluxDB%20Clustered%20README%20EULA%20July%202024.txt)
### Highlights
- The garbage collector has been fixed to support customers who specify the S3 bucket in `spec.package.spec.objectStore.s3.endpoint` (for example, `"https://$BUCKET.$REGION.amazonaws.com"`) and an additional prefix in `spec.package.spec.objectStore.bucket`; if you previously disabled `INFLUXDB_IOX_CREATE_CATALOG_BACKUP_DATA_SNAPSHOT_FILES` and `INFLUXDB_IOX_DELETE_USING_CATALOG_BACKUP_DATA_SNAPSHOT_FILES` to work around the bug, you can remove those overrides now.
- Add support for both 'postgres' and 'postgresql' URI schemes in catalog DSN parsing.
- Add support to the Management API for:
- Renaming databases
- Undeleting databases
- Renaming tables
- Deleting tables
- Undeleting tables
- Dependency updates and miscellaneous security fixes.
## 20250925-1878107 {date="2025-09-25"}
### Quickstart

View File

@ -0,0 +1,16 @@
---
title: Telegraf Controller reference documentation
description: >
Reference documentation for Telegraf Controller, the application that
centralizes configuration management and provides information about the health
of Telegraf agent deployments.
menu:
telegraf_controller:
name: Reference
weight: 20
---
Use the reference docs to look up Telegraf Controller configuration options,
APIs, and operational details.
{{< children hlevel="h2" >}}

View File

@ -0,0 +1,268 @@
---
title: Telegraf Controller architecture
description: >
Architectural overview of the {{% product-name %}} application.
menu:
telegraf_controller:
name: Architectural overview
parent: Reference
weight: 105
---
{{% product-name %}} is a standalone application that provides centralized
management for Telegraf agents. It runs as a single binary that starts two
separate servers: a web interface/API server and a dedicated high-performance
heartbeat server for agent monitoring.
## Runtime Architecture
### Application Components
When you run the Telegraf Controller binary, it starts four main subsystems:
- **Web Server**: Serves the management interface (default port: `8888`)
- **API Server**: Handles configuration management and administrative requests
(served on the same port as the web server)
- **Heartbeat Server**: Dedicated high-performance server for agent heartbeats
(default port: `8000`)
- **Background Scheduler**: Monitors agent health every 60 seconds
### Process Model
- **telegraf_controller** _(single process, multiple servers)_
- **Main HTTP Server** _(port `8888`)_
- Web UI (`/`)
- API Endpoints (`/api/*`)
- **Heartbeat Server** (port `8000`)
- POST /heartbeat _(high-performance endpoint)_
- **Database Connection**
- SQLite or PostgreSQL
- **Background Tasks**
- Agent Status Monitor (60s interval)
The dual-server architecture separates high-frequency heartbeat traffic from
regular management operations, ensuring that the web interface remains
responsive even under heavy agent load.
## Configuration
{{% product-name %}} configuration is controlled through command options and
environment variables.
| Command Option | Environment Variable | Description |
| :----------------- | :------------------- | :--------------------------------------------------------------------------------------------------------------- |
| `--port` | `PORT` | API server port (default is `8888`) |
| `--heartbeat-port` | `HEARTBEAT_PORT` | Heartbeat service port (default: `8000`) |
| `--database` | `DATABASE` | Database filepath or URL (default is [SQLite path](/telegraf/controller/install/#default-sqlite-data-locations)) |
| `--ssl-cert` | `SSL_CERT` | Path to SSL certificate |
| `--ssl-key` | `SSL_KEY` | Path to SSL private key |
To use environment variables, create a `.env` file in the same directory as the
binary or export these environment variables in your terminal session.
### Database Selection
{{% product-name %}} automatically selects the database type based on the
`DATABASE` string:
- **SQLite** (default): Best for development and small deployments with less
than 1000 agents. Database file created automatically.
- **PostgreSQL**: Required for large deployments. Must be provisioned separately.
Example PostgreSQL configuration:
```bash
DATABASE="postgresql://user:password@localhost:5432/telegraf_controller"
```
## Data Flow
### Agent registration and heartbeats
{{< diagram >}}
flowchart LR
T["Telegraf Agents<br/>(POST heartbeats)"] --> H["Port 8000<br/>Heartbeat Server"]
H --Direct Write--> D[("Database")]
W["Web UI/API<br/>"] --> A["Port 8888<br/>API Server"] --View Agents (Read-Only)--> D
R["Rust Scheduler<br/>(Agent status updates)"] --> D
{{< /diagram >}}
1. **Agents send heartbeats**:
Telegraf agents with the heartbeat output plugin send `POST` requests to the
dedicated heartbeat server (port `8000` by default).
2. **Heartbeat server processes the heartbeat**:
The heartbeat server is a high-performance Rust-based HTTP server that:
- Receives the `POST` request at `/agents/heartbeat`
- Validates the heartbeat payload
- Extracts agent information (ID, hostname, IP address, status, etc.)
- Uniquely identifies each agent using the `instance_id` in the heartbeat
payload.
3. **Heartbeat server writes directly to the database**:
The heartbeat server uses a Rust NAPI module that:
- Bypasses the application ORM (Object-Relational Mapping) layer entirely
- Uses `sqlx` (Rust SQL library) to write directly to the database
- Implements batch processing to efficiently process multiple heartbeats
- Provides much higher throughput than going through the API layer
The Rust module performs these operations:
- Creates a new agent if it does not already exist
- Adds or updates the `last_seen` timestamp
- Adds or updates the agent status to the status reported in the heartbeat
- Updates other agent metadata (hostname, IP, etc.)
4. **API layer reads agent data**:
The API layer has read-only access for agent data and performs the following
actions:
- `GET /api/agents` - List agents
- `GET /api/agents/summary` - Agent status summary
The API never writes to the agents table. Only the heartbeat server does.
5. **The Web UI displays updated agent data**:
The web interface polls the API endpoints to display:
- Real-time agent status
- Last seen timestamps
- Agent health metrics
6. **The background scheduler evaluates agent statuses**:
Every 60 seconds, a Rust-based scheduler (also part of the NAPI module):
- Scans all agents in the database
- Checks `last_seen` timestamps against the agent's assigned reporting rule
- Updates agent statuses:
- ok → not_reporting (if heartbeat missed beyond threshold)
- not_reporting → ok (if heartbeat resumes)
- Auto-deletes agents that have exceeded the auto-delete threshold
(if enabled for the reporting rule)
### Configuration distribution
1. **An agent requests a configuration**:
Telegraf agents request their configuration from the main API server
(port `8888`):
```bash
telegraf --config "http://localhost:8888/api/configs/{config-id}/toml?location=datacenter1&env=prod"
```
The agent makes a `GET` request with:
- **Config ID**: Unique identifier for the configuration template
- **Query Parameters**: Variables for parameter substitution
- **Accept Header**: Can specify `text/x-toml` or `application/octet-stream`
for download
2. **The API server receives request**:
The API server on port `8888` handles the request at
`/api/configs/{id}/toml` and does the following:
- Validates the configuration ID
- Extracts all query parameters for substitution
- Checks the `Accept` header to determine response format
3. **The application retrieves the configuration from the database**:
{{% product-name %}} fetches configuration data from the database:
- **Configuration TOML**: The raw configuration with parameter placeholders
- **Configuration name**: Used for filename if downloading
- **Updated timestamp**: For the `Last-Modified` header
4. **{{% product-name %}} substitutes parameters**:
{{% product-name %}} processes the TOML template and replaces parameters
with parameter values specified in the `GET` request.
5. **{{% product-name %}} sets response headers**:
- Content-Type
- Last-Modified
Telegraf uses the `Last-Modified` header to determine if a configuration
has been updated and, if so, download and use the updated configuration.
6. **{{% product-name %}} delivers the response**:
Based on the `Accept` header:
{{< tabs-wrapper >}}
{{% tabs "medium" %}}
[text/x-toml (TOML)](#)
[application/octet-stream (Download)](#)
{{% /tabs %}}
{{% tab-content %}}
<!------------------------------- BEGIN TOML ------------------------------>
```
HTTP/1.1 200 OK
Content-Type: text/x-toml; charset=utf-8
Last-Modified: Mon, 05 Jan 2025 07:28:00 GMT
[agent]
hostname = "server-01"
environment = "prod"
...
```
<!-------------------------------- END TOML ------------------------------->
{{% /tab-content %}}
{{% tab-content %}}
<!----------------------------- BEGIN DOWNLOAD ---------------------------->
```
HTTP/1.1 200 OK
Content-Type: application/octet-stream
Content-Disposition: attachment; filename="config_name.toml"
Last-Modified: Mon, 05 Jan 2025 07:28:00 GMT
[agent]
hostname = "server-01"
...
```
<!------------------------------ END DOWNLOAD ----------------------------->
{{% /tab-content %}}
{{< /tabs-wrapper >}}
7. _(Optional)_ **Telegraf regularly checks the configuration for updates**:
Telegraf agents can regularly check {{% product-name %}} for configuration
updates and automatically load updates when detected. When starting a
Telegraf agent, include the `--config-url-watch-interval` option with the
interval that you want the agent to use to check for updates—for example:
```bash
telegraf \
--config http://localhost:8888/api/configs/xxxxxx/toml \
--config-url-watch-interval 1h
```
## Reporting Rules
{{% product-name %}} uses reporting rules to determine when agents should be
marked as not reporting:
- **Default Rule**: Created automatically on first run
- **Heartbeat Interval**: Expected frequency of agent heartbeats (default: 60s)
- **Threshold Multiplier**: How many intervals to wait before marking not_reporting (default: 3x)
Access reporting rules via:
- **Web UI**: Reporting Rules
- **API**: `GET /api/reporting-rules`

View File

@ -0,0 +1,342 @@
# yaml-language-server: $schema=app-instance-schema.json
apiVersion: kubecfg.dev/v1alpha1
kind: AppInstance
metadata:
name: influxdb
namespace: influxdb
spec:
# One or more secrets that are used to pull the images from an authenticated registry.
# This will either be the secret provided to you, if using our registry, or a secret for your own registry
# if self-hosting the images.
imagePullSecrets:
- name: <name of the secret>
package:
# The version of the clustered package that will be used.
# This determines the version of all of the individual components.
# When a new version of the product is released, this version should be updated and any
# new config options should be updated below.
image: us-docker.pkg.dev/influxdb2-artifacts/clustered/influxdb:20251218-1946608
apiVersion: influxdata.com/v1alpha1
spec:
# # Provides a way to pass down hosting environment specific configuration, such as an role ARN when using EKS IRSA.
# # This section contains three multually-exclusive "blocks". Uncomment the block named after the hosting environment
# # you run: "aws", "openshift" or "gke".
# hostingEnvironment:
# # # Uncomment this block if you're running in EKS.
# # aws:
# # eksRoleArn: 'arn:aws:iam::111111111111:role/your-influxdb-clustered-role'
# #
# # # Uncomment this block if you're running inside OpenShift.
# # # Note: there are currently no OpenShift-specific parameters. You have to pass an empty object
# # # as a marker that you're choosing OpenShift as hosting environment.
# # openshift: {}
# #
# # # Uncomment this block if you're running in GKE:
# # gke:
# # # Authenticate to Google Cloud services via workload identity, this
# # # annotates the 'iox' ServiceAccount with the role name you specify.
# # # NOTE: This setting just enables GKE specific authentication mechanism,
# # # You still need to enable `spec.objectStore.google` below if you want to use GCS.
# # workloadIdentity:
# # # Google Service Account name to use for the workload identity.
# # serviceAccountEmail: <service-account>@<project-name>.iam.gserviceaccount.com
catalog:
# A postgresql style DSN that points at a postgresql compatible database.
# eg: postgres://[user[:password]@][netloc][:port][/dbname][?param1=value1&...]
dsn:
valueFrom:
secretKeyRef:
name: <your secret name here>
key: <the key in the secret that contains the dsn>
# images:
# # This can be used to override a specific image name with its FQIN
# # (Fully Qualified Image Name) for testing. eg.
# overrides:
# - name: influxdb2-artifacts/iox/iox
# newFQIN: mycompany/test-iox-build:aninformativetag
#
# # Set this variable to the prefix of your internal registry. This will be prefixed to all expected images.
# # eg. us-docker.pkg.dev/iox:latest => registry.mycompany.io/us-docker.pkg.dev/iox:latest
# registryOverride: <the domain name portion of your registry (registry.mycompany.io in the example above)>
objectStore:
# Bucket that the parquet files will be stored in
bucket: <bucket name>
# Uncomment one of the following (s3, azure)
# to enable the configuration of your object store
s3:
# URL for S3 Compatible object store
endpoint: <S3 url>
# Set to true to allow communication over HTTP (instead of HTTPS)
allowHttp: "false"
# S3 Access Key
# This can also be provided as a valueFrom: secretKeyRef:
accessKey:
value: <your access key>
# S3 Secret Key
# This can also be provided as a valueFrom: secretKeyRef:
secretKey:
value: <your secret>
# This value is required for AWS S3, it may or may not be required for other providers.
region: <region>
# azure:
# Azure Blob Storage Access Key
# This can also be provided as a valueFrom: secretKeyRef:
# accessKey:
# value: <your access key>
# Azure Blob Storage Account
# This can also be provided as a valueFrom: secretKeyRef:
# account:
# value: <your access key>
# There are two main ways you can access a Google:
#
# a) GKE Workload Identity: configure workload identity in the top level `hostingEnvironment.gke` section.
# b) Explicit service account secret (JSON) file: use the `serviceAccountSecret` field here
#
# If you pick (a) you may not need to uncomment anything else in this section,
# but you still need to tell influxdb that you intend to use Google Cloud Storage.
# so you need to specify an empty object. Uncomment the following line:
#
# google: {}
#
#
# If you pick (b), uncomment the following block:
#
# google:
# # If you're authenticating to Google Cloud service using a Service Account credentials file, as opposed
# # as to use workload identity (see above) you need to provide a reference to a k8s secret containing the credentials file.
# serviceAccountSecret:
# # Kubernetes Secret name containing the credentials for a Google IAM Service Account.
# name: <secret name>
# # The key within the Secret containing the credentials.
# key: <key name>
# Parameters to tune observability configuration, such as Prometheus ServiceMonitor's.
observability: {}
# retention: 12h
# serviceMonitor:
# interval: 10s
# scrapeTimeout: 30s
# Ingester pods have a volume attached.
ingesterStorage:
# (Optional) Set the storage class. This will differ based on the K8s environment and desired storage characteristics.
# If not set, the default storage class will be used.
# storageClassName: <storage-class>
# Set the storage size (minimum 2Gi recommended)
storage: <storage-size>
# Monitoring pods have a volume attached.
monitoringStorage:
# (Optional) Set the storage class. This will differ based on the K8s environment and desired storage characteristics.
# If not set, the default storage class will be used.
# storageClassName: <storage-class>
# Set the storage size (minimum 10Gi recommended)
storage: <storage-size>
# Uncomment the follow block if using our provided Ingress.
#
# We currently only support the ingress NGINX ingress controller: https://github.com/kubernetes/ingress-nginx
#
# ingress:
# hosts:
# # This is the host on which you will access Influxdb 3.0, for both reads and writes
# - <influxdb-host>
# (Optional)
# The name of the Kubernetes Secret containing a TLS certificate, this should exist in the same namespace as the Clustered installation.
# If you are using cert-manager, enter a name for the Secret it should create.
# tlsSecretName: <secret-name>
# http:
# # Usually you have only one ingress controller installed in a given cluster.
# # In case you have more than one, you have to specify the "class name" of the ingress controller you want to use
# className: nginx
# grpc:
# # Usually you have only one ingress controller installed in a given cluster.
# # In case you have more than one, you have to specify the "class name" of the ingress controller you want to use
# className: nginx
#
# Enables specifying which 'type' of Ingress to use, alongside whether to place additional annotations
# onto those objects, this is useful for third party software in your environment, such as cert-manager.
# template:
# apiVersion: 'route.openshift.io/v1'
# kind: 'Route'
# metadata:
# annotations:
# 'example-annotation': 'annotation-value'
# Enables specifying customizations for the various components in InfluxDB 3.0.
# components:
# # router:
# # template:
# # containers:
# # iox:
# # env:
# # INFLUXDB_IOX_MAX_HTTP_REQUESTS: "5000"
# # nodeSelector:
# # disktype: ssd
# # tolerations:
# # - effect: NoSchedule
# # key: example
# # operator: Exists
# # Common customizations for all components go in a pseudo-component called "common"
# # common:
# # template:
# # # Metadata contains custom annotations (and labels) to be added to a component. E.g.:
# # metadata:
# # annotations:
# # telegraf.influxdata.com/class: "foo"
# Example of setting nodeAffinity for the querier component to ensure it runs on nodes with specific labels
# components:
# # querier:
# # template:
# # affinity:
# # nodeAffinity:
# # requiredDuringSchedulingIgnoredDuringExecution:
# # Node must have these labels to be considered for scheduling
# # nodeSelectorTerms:
# # - matchExpressions:
# # - key: required
# # operator: In
# # values:
# # - ssd
# # preferredDuringSchedulingIgnoredDuringExecution:
# # Scheduler will prefer nodes with these labels but they're not required
# # - weight: 1
# # preference:
# # matchExpressions:
# # - key: preferred
# # operator: In
# # values:
# # - postgres
# Example of setting podAntiAffinity for the querier component to ensure it runs on nodes with specific labels
# components:
# # querier:
# # template:
# # affinity:
# # podAntiAffinity:
# # requiredDuringSchedulingIgnoredDuringExecution:
# # Ensures that the pod will not be scheduled on a node if another pod matching the labelSelector is already running there
# # - labelSelector:
# # matchExpressions:
# # - key: app
# # operator: In
# # values:
# # - querier
# # topologyKey: "kubernetes.io/hostname"
# # preferredDuringSchedulingIgnoredDuringExecution:
# # Scheduler will prefer not to schedule pods together but may do so if necessary
# # - weight: 1
# # podAffinityTerm:
# # labelSelector:
# # matchExpressions:
# # - key: app
# # operator: In
# # values:
# # - querier
# # topologyKey: "kubernetes.io/hostname"
# Uncomment the following block to tune the various pods for their cpu/memory/replicas based on workload needs.
# Only uncomment the specific resources you want to change, anything uncommented will use the package default.
# (You can read more about k8s resources and limits in https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#requests-and-limits)
#
# resources:
# # The ingester handles data being written
# ingester:
# requests:
# cpu: <cpu amount>
# memory: <ram amount>
# replicas: <num replicas> # The default for ingesters is 3 to increase availability
#
# # optionally you can specify the resource limits which improves isolation.
# # (see https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#requests-and-limits)
# # limits:
# # cpu: <cpu amount>
# # memory: <ram amount>
# # The compactor reorganizes old data to improve query and storage efficiency.
# compactor:
# requests:
# cpu: <cpu amount>
# memory: <ram amount>
# replicas: <num replicas> # the default is 1
# # The querier handles querying data.
# querier:
# requests:
# cpu: <cpu amount>
# memory: <ram amount>
# replicas: <num replicas> # the default is 3
# # The router performs some api routing.
# router:
# requests:
# cpu: <cpu amount>
# memory: <ram amount>
# replicas: <num replicas> # the default is 3
admin:
# The list of users to grant access to Clustered via influxctl
users:
# First name of user
- firstName: <first-name>
# Last name of user
lastName: <last-name>
# Email of user
email: <email>
# The ID that the configured Identity Provider uses for the user in oauth flows
id: <id>
# Optional list of user groups to assign to the user, rather than the default groups. The following groups are currently supported: Admin, Auditor, Member
userGroups:
- <group-name>
# The dsn for the postgres compatible database (note this is the same as defined above)
dsn:
valueFrom:
secretKeyRef:
name: <secret name>
key: <dsn key>
# The identity provider to be used e.g. "keycloak", "auth0", "azure", etc
# Note for Azure Active Directory it must be exactly "azure"
identityProvider: <identity-provider>
# The JWKS endpoint provided by the Identity Provider
jwksEndpoint: <endpoint>
# # This (optional) section controls how InfluxDB issues outbound requests to other services
# egress:
# # If you're using a custom CA you will need to specify the full custom CA bundle here.
# #
# # NOTE: the custom CA is currently only honoured for outbound requests used to obtain
# # the JWT public keys from your identiy provider (see `jwksEndpoint`).
# customCertificates:
# valueFrom:
# configMapKeyRef:
# key: ca.pem
# name: custom-ca
# We also include the ability to enable some features that are not yet ready for general availability
# or for which we don't yet have a proper place to turn on an optional feature in the configuration file.
# To turn on these you should include the name of the feature flag in the `featureFlag` array.
#
# featureFlags:
# # Uncomment to install a Grafana deployment.
# # Depends on one of the prometheus features being deployed.
# # - grafana
# # The following 2 flags should be uncommented for k8s API 1.21 support.
# # Note that this is an experimental configuration.
# # - noMinReadySeconds
# # - noGrpcProbes