Update design to reflect discussions.
parent
c35bed3cff
commit
d9a3a957a9
365
docs/design.md
365
docs/design.md
|
@ -15,37 +15,29 @@
|
|||
### Initial Goals
|
||||
1. Produce pre-canned graphs for devops telegraf data for docker containers or system stats.
|
||||
2. Up and running in 2 minutes
|
||||
3. User administration for OSS and Plutonium
|
||||
3. User administration for Influx Enterprise.
|
||||
4. Leverage our existing enterprise front-end code.
|
||||
5. Leverage lessons-learned for enterprise back-end code.
|
||||
6. Minimum viable product by Oct 10th.
|
||||
7. Three to four weeks of testing and polishing before release.
|
||||
|
||||
### Versions
|
||||
|
||||
Each version will contain more and more features around monitoring various devops components.
|
||||
|
||||
#### Cycles
|
||||
Two month cycles (typically, one month feature/one month polish)
|
||||
|
||||
#### Features
|
||||
|
||||
1. Nov
|
||||
1. v1
|
||||
- Data explorer for both OSS and Enterprise
|
||||
- Dashboards for telegraf system metrics
|
||||
- User and Role adminstration
|
||||
- Proxy queries over OSS and Enterprise
|
||||
- Authenticate against OSS/Enterprise
|
||||
|
||||
2. Jan
|
||||
2. v2
|
||||
- Telegraf agent service
|
||||
- Additional Dashboards for telegraf agent
|
||||
|
||||
3. Mar
|
||||
- The next stuff
|
||||
|
||||
### Supported Versions of Tick Stack
|
||||
... what versions are we supporting and not supporting? Nothing pre-1.0?
|
||||
### Supported Versions of TICK Stack
|
||||
We will only support 1.0 of the TICK stack.
|
||||
|
||||
|
||||
### Closed source vs Open Source
|
||||
|
@ -82,60 +74,48 @@ We'll use GDM as the vendoring solution to maintain consistency with other piece
|
|||
#### REST
|
||||
We'll use swagger interface definition to specify API and JSON validation. The goal is to emphasize designing to an interface facilitating parallel development.
|
||||
|
||||
#### Query Proxy
|
||||
|
||||
The query proxy is a special endpoint to query InfluxDB/InfluxEnterprise.
|
||||
#### Queries
|
||||
|
||||
Features would include:
|
||||
|
||||
1. Load balancing against all data nodes in cluster.
|
||||
1. Formatting the output results to be simple to use in frontend.
|
||||
1. Decimating the results to minimize network traffic.
|
||||
1. Use prepared queries to move query window and specify dimensions.
|
||||
1. Use parameters to move query time range.
|
||||
1. Allow different types of response protocols (http GET, websocket, etc.).
|
||||
1. Efficiently handle many queries at once (for a dashboard).
|
||||
1. Support multiple InfluxQL queries per request.
|
||||
1. Only support `SELECT` queries. (no explicit validation? happens in plutonium)
|
||||
|
||||
Chronograf will take a different approach than InfluxEnterprise 1.0 and Grafana 2 & 3, which use a GET request with parameters.
|
||||
It provides two endpoints for ephemeral and persistent queries.
|
||||
Both endpoints accept a POST request with a JSON object containing similar parameters.
|
||||
It provides two endpoints:
|
||||
|
||||
##### Ephemeral Queries
|
||||
- **`/proxy`:** used to send queries directly to the Influx backend. They should be most useful for the data explorer or other ad hoc query functionality.
|
||||
- **`/queries`:** Permanent resource that can repeatedly return results. Queries are specified using a JSON representation. The server constructs the raw influxql query from the representation.
|
||||
|
||||
Ephemeral queries are transient and unlikely to be requested multiple times.
|
||||
They should be most useful for the data explorer or other ad hoc query functionality.
|
||||
##### `/proxy` Queries
|
||||
|
||||
Uses a POST request to the `/queries` endpoint and returns results in the response.
|
||||
No resource is created.
|
||||
Caching may or may not be available, but should not be relied upon.
|
||||
Queries to the `/proxy` endpoint do not create new REST resources. Instead, it returns results of the query.
|
||||
|
||||
This endpoint uses POST with a JSON object to specify the query and the parameters. The endpoint's response will be the results of the query, or, the errors from the backend InfluxDB.
|
||||
|
||||
Errors in the 4xx range come from the Influxdb data source.
|
||||
|
||||
```sequence
|
||||
App->Proxy: POST query
|
||||
Note right of Proxy: Query Validation
|
||||
Note right of Proxy: Load balance query
|
||||
Proxy->Influx/Relay/Cluster: SELECT
|
||||
Influx/Relay/Cluster-->Proxy: Time Series
|
||||
Note right of Proxy: Format and Decimate
|
||||
Proxy-->App: Formatted results
|
||||
App->/proxy: POST query
|
||||
Note right of /proxy: Query Validation
|
||||
Note right of /proxy: Load balance query
|
||||
/proxy->Influx/Relay/Cluster: SELECT
|
||||
Influx/Relay/Cluster-->/proxy: Time Series
|
||||
Note right of /proxy: Format
|
||||
/proxy-->App: Formatted results
|
||||
```
|
||||
|
||||
Example:
|
||||
Request:
|
||||
|
||||
```http
|
||||
POST /enterprise/v1/sources/{id}/query HTTP/1.1
|
||||
POST /enterprise/v1/sources/{id}/proxy HTTP/1.1
|
||||
Accept: application/json
|
||||
Content-Type: application/json
|
||||
{
|
||||
"query": [
|
||||
{
|
||||
"name": "query1",
|
||||
"query": "SELECT * from telegraf where time > $value"
|
||||
}
|
||||
],
|
||||
"format": "dygraph",
|
||||
"max_points": 1000,
|
||||
"type": "http"
|
||||
"query": "SELECT * from telegraf where time > $value",
|
||||
"format": "dygraph",
|
||||
}
|
||||
```
|
||||
|
||||
|
@ -143,166 +123,131 @@ Response:
|
|||
|
||||
```http
|
||||
HTTP/1.1 200 OK
|
||||
Content-Type: application/json
|
||||
|
||||
{
|
||||
"type": "http",
|
||||
"format": "dygraph",
|
||||
"query": [
|
||||
{
|
||||
"name": "explorer1",
|
||||
"results": "..."
|
||||
}
|
||||
]
|
||||
"results": "..."
|
||||
}
|
||||
```
|
||||
|
||||
##### Persistent Queries
|
||||
Error Response:
|
||||
|
||||
Persistent queries are create a resource that provides results for frequent queries.
|
||||
They should be most useful for dashboards where many queries are needed with similar dimensions.
|
||||
```http
|
||||
HTTP/1.1 400 Bad Request
|
||||
Content-Type: application/json
|
||||
|
||||
{
|
||||
"code": 400,
|
||||
"message": "error parsing query: found..."
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
##### `/queries` Queries
|
||||
|
||||
POSTs to the `/queries` endpoint creates a resource, which will provide results for frequent queries.
|
||||
Query resources should be most useful for dashboards.
|
||||
|
||||
The `resultsLink` key in the POST response is the URI to `GET` the results of the query. The parameters `lower` and `upper` are passed to the `GET` request to specify the time range of the query.
|
||||
|
||||
Uses a POST request to the `/series` endpoint which returns the location of the resource in the response.
|
||||
Using a GET request will return results for the original queries.
|
||||
Parameters for prepared statements can be passed in through the GET request.
|
||||
|
||||
```sequence
|
||||
App->Proxy: POST query
|
||||
Note right of Proxy: Query Validation
|
||||
Proxy-->App: Location of query resource
|
||||
App->Proxy: GET Location
|
||||
Note right of Proxy: Load balance query
|
||||
Proxy->Influx/Relay/Cluster: SELECT
|
||||
Influx/Relay/Cluster-->Proxy: Time Series
|
||||
Note right of Proxy: Format and Decimate
|
||||
Proxy-->App:
|
||||
Note left of App: Format to dygraph
|
||||
App->/queries: POST query
|
||||
Note right of /queries: Query Validation
|
||||
/queries-->App: Location of query resource: /query/{id}
|
||||
App->/queries: GET /queries/{id}/results?lower=now()-1h&upper=now()
|
||||
Note right of /queries: Load balance query
|
||||
/queries->Influx/Relay/Cluster: SELECT
|
||||
Influx/Relay/Cluster-->/queries: Time Series
|
||||
Note right of /queries: Format and Decimate
|
||||
/queries-->App:
|
||||
```
|
||||
|
||||
Example POST Request:
|
||||
Example Request:
|
||||
|
||||
```http
|
||||
POST /enterprise/v1/sources/{id}/series HTTP/1.1
|
||||
POST /enterprise/v1/sources/{id}/queries HTTP/1.1
|
||||
Accept: application/json
|
||||
Content-Type: application/json
|
||||
{
|
||||
"query": [
|
||||
{
|
||||
"name": "query1",
|
||||
"query": "SELECT * from telegraf where time > $value"
|
||||
}
|
||||
],
|
||||
"format": "dygraph",
|
||||
"max_points": 1000,
|
||||
"type": "http"
|
||||
"ttl": "6h",
|
||||
"every": "15s", // possible?
|
||||
}
|
||||
```
|
||||
|
||||
Response:
|
||||
|
||||
```http
|
||||
HTTP/1.1 202 OK
|
||||
{
|
||||
"link": {
|
||||
"rel": "self",
|
||||
"href": "/enterprise/v1/sources/{id}/series/{qid}",
|
||||
"type": "http"
|
||||
"database": "telegraf",
|
||||
"measurement": "cpu",
|
||||
"retentionPolicy": "autogen",
|
||||
"fields": [
|
||||
{"field": "usage_system", "funcs": ["mean"]},
|
||||
{"field": "usage_user", "funcs": ["mean"]},
|
||||
],
|
||||
"tags": {
|
||||
"cpu": ["cpu0", "cpu1", "cpu2"],
|
||||
"host": ["host1"]
|
||||
},
|
||||
"groupBy": {
|
||||
"time": "1h",
|
||||
"tags": [
|
||||
"cpu0",
|
||||
"cpu1"
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Example GET Request:
|
||||
|
||||
```http
|
||||
GET /enterprise/v1/sources/{id}/series/{qid} HTTP/1.1
|
||||
Accept: application/json
|
||||
Content-Type: application/json
|
||||
{
|
||||
"query": [
|
||||
{
|
||||
"name": "query1",
|
||||
"query": "SELECT * from telegraf where time > $value"
|
||||
}
|
||||
],
|
||||
"format": "dygraph",
|
||||
"max_points": 1000,
|
||||
"type": "http"
|
||||
"ttl": "6h",
|
||||
"every": "15s", // possible?
|
||||
"format": "raw"
|
||||
}
|
||||
```
|
||||
|
||||
Response:
|
||||
|
||||
```http
|
||||
HTTP/1.1 200 OK
|
||||
HTTP/1.1 204 Created
|
||||
Content-Type: application/json
|
||||
Location: /enterprise/v1/sources/{id}/queries/{qid}
|
||||
|
||||
{
|
||||
"type": "http",
|
||||
"format": "dygraph",
|
||||
"query": [
|
||||
{
|
||||
"name": "explorer1",
|
||||
"results": "..."
|
||||
}
|
||||
]
|
||||
"database": "telegraf",
|
||||
"measurement": "cpu",
|
||||
"retentionPolicy": "autogen",
|
||||
"fields": [
|
||||
{"field": "usage_system", "funcs": ["mean"]},
|
||||
{"field": "usage_user", "funcs": ["mean"]},
|
||||
],
|
||||
"tags": {
|
||||
"cpu": ["cpu0", "cpu1", "cpu2"],
|
||||
"host": ["host1"]
|
||||
},
|
||||
"groupBy": {
|
||||
"time": "1h",
|
||||
"tags": [
|
||||
"cpu0",
|
||||
"cpu1"
|
||||
]
|
||||
}
|
||||
"format": "raw",
|
||||
"resultsLink": "/enterprise/v1/sources/{id}/queries/{qid}/results"
|
||||
}
|
||||
```
|
||||
|
||||
##### Websockets
|
||||
|
||||
__Not a priority for v1.1__
|
||||
|
||||
A websocket protocol will be developed to allow:
|
||||
|
||||
1. dynamic parameter changes
|
||||
2. streaming new points
|
||||
|
||||
Setting the `type` parameter to `ws` will indicate the client wants to initiate a websockets request.
|
||||
Simple protocol sketch:
|
||||
|
||||
```
|
||||
c:begin
|
||||
s:data
|
||||
s:end
|
||||
|
||||
c:ping
|
||||
s:refresh
|
||||
c:accept
|
||||
s:data
|
||||
s:end
|
||||
c:disconnect
|
||||
GET Request:
|
||||
|
||||
c:ping
|
||||
c:update
|
||||
```http
|
||||
GET /enterprise/v1/sources/{id}/series/{qid}/results?lower=now()-1h&upper=now() HTTP/1.1
|
||||
```
|
||||
|
||||
Response:
|
||||
|
||||
```http
|
||||
HTTP/1.1 200 OK
|
||||
Content-Type: application/json
|
||||
|
||||
{
|
||||
"results": [
|
||||
"...",
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
##### Load balancing
|
||||
|
||||
Use simple round robin load balancing requests to data nodes.
|
||||
Discover active data nodes using `influxd-ctl show-cluster` on meta.
|
||||
|
||||
##### Questions
|
||||
_For back-end:_
|
||||
1. Define how load balancer behaves when influx breaks?
|
||||
|
||||
- retry handling and when to return an error to app
|
||||
- rate limit to avoid stampede if node is down
|
||||
|
||||
_For front-end:_
|
||||
1. Is it desirable to use InfluxQL prepared statements or construct query from components (server-side or in-app)?
|
||||
1. Does separating ephemeral and persistent queries make sense? (Perhaps it sucks to have to do a parameterized GET)
|
||||
1. SHOW/DROP statements on separate endpoints. (`/users`, `/roles`, `/measurements`, `/tags`, `/fields`)
|
||||
|
||||
_For core:_
|
||||
1. Which influx client do we use?
|
||||
|
||||
- Current clients are not flexible
|
||||
- New client in design phase
|
||||
- non-SELECT queries need either InfluxDB or InfluxEnterprise client
|
||||
|
||||
1. How should we handle cacheing?
|
||||
1. Use websockets to support a subscription model?
|
||||
1. Best way to discover whether a node is up?
|
||||
Discover active data nodes using Plutonium meta client.
|
||||
|
||||
#### Backend-server store
|
||||
We will build a interface for storing API resources.
|
||||
|
@ -326,21 +271,17 @@ Future versions will support more HA data stores.
|
|||
1. User
|
||||
|
||||
- Version 1.1 will be a one-to-one mapping to influx.
|
||||
- Do we want to extend this user object at some point? What other properties or relations are important?
|
||||
|
||||
1. Layouts
|
||||
1. Dashboards
|
||||
|
||||
- We need to have another discussion about the goals.
|
||||
- For now the design is an opaque JSON blob until we know how to structure this.
|
||||
- precanned dashboards for telegraf
|
||||
- Includes location of query resources.
|
||||
|
||||
1. Queries
|
||||
- We may store the queries object.
|
||||
- Downside: when to expire? TTL?
|
||||
- Upside: single endpoint for specific query with bindable parameters.
|
||||
- Used to construct influxql.
|
||||
|
||||
1. Sessions for particular user?
|
||||
- What data do we persist about a user's session, if any?
|
||||
1. Sessions
|
||||
- We could simply use the JWT token as the session information
|
||||
|
||||
1. Server Configuration
|
||||
|
||||
|
@ -351,11 +292,7 @@ Future versions will support more HA data stores.
|
|||
|
||||
We want the backend data store (influx oss or influx meta) handle the authentication so that the web server has less responsibility.
|
||||
|
||||
##### Questions
|
||||
1. Do we want shared secret authentication between the server and the influx data store?
|
||||
|
||||
2. How will users be authenticated to the web server?
|
||||
|
||||
We'll use JWT throughout.
|
||||
|
||||
### Testing
|
||||
Talk with Mark and Michael and talk about larger efforts. This will impact the repository layout.
|
||||
|
@ -380,57 +317,3 @@ Because we are pulling together so many TICK stack components we will need stron
|
|||
- Deployment experience
|
||||
- Ease of use.
|
||||
- Speed to accomplish task, e.g. find specific info, change setting.
|
||||
|
||||
|
||||
### Collection Agent
|
||||
|
||||
The collection agent is at the very least telegraf and a configuration.
|
||||
|
||||
The collection agent post-version 1 will feel similar to [Datadog](https://app.datadoghq.com/account/settings#agent).
|
||||
|
||||
Good user experience is the key
|
||||
|
||||
#### Quesions
|
||||
1. Talk to Cameron about distribution/service
|
||||
|
||||
1. Get his opinions on our basic designs (env vars?)
|
||||
|
||||
1. Use confd vs something built into telegraf for dynamic configuration?
|
||||
|
||||
1. Telegraf authentication with jwt?
|
||||
|
||||
1. Should there be prebuilt package (rpm, deb) vs something else. Are there different config files for each one of these packages?
|
||||
1. We could just have environment variables `INFLUX_URL` and `INFLUX_SHARED_SECRET`. Anything else?
|
||||
|
||||
1. what product order are we supporting?
|
||||
- v1 system stats
|
||||
- v2 Docker stats
|
||||
|
||||
1. Multiple telegrafs to support other services? E.g. a telegraf instance with only Postgres plugin
|
||||
|
||||
### User Stories
|
||||
#### Initial Setup v2
|
||||
1. User clicks on an icon that represents their system (e.g. Redhat).
|
||||
2. User fills out a form that includes the information needed to configure telegraf.
|
||||
- influx url
|
||||
- influx authentication
|
||||
- does telegraf have shared secret jwt?
|
||||
- let's talk to nathaniel about this... in regards to how it worked with kapacitor.
|
||||
3. User gets a download command (sh?) This command has enough to start a telegraf for docker container monitoring (v1.1) and send the data to influx.
|
||||
|
||||
- Question: how do we remove machines? (e.g. I don't want to see my testing mac laptop anymore)
|
||||
- Could use retention policies (fast)
|
||||
- testing rp
|
||||
- production rp
|
||||
- Could use namespaced databases
|
||||
- We should talk to the people that working on the new series index to help us handle paging-off of old/inactive instances problem gracefully. DROP SERIES WHERE "host" = 'machine1'
|
||||
|
||||
1. SHOW SERIES WHERE time > 1w # gets all host names for the last week SHOW TAG VALUE WITH KEY = 'host' WHERE time > 1w
|
||||
2. Performance??
|
||||
2. DROP SERIES WHERE "host" = 'machine1'
|
||||
3. We could have a machine endpoint allowing GET/DELETE
|
||||
4. we want to filter machines by the times whey were active and the times they first showed up
|
||||
|
||||
#### Update telegraf configuration on host
|
||||
confd for telegraf configuration?
|
||||
survey prometheus service discovery methodology and compare to our telegraf design (stand-alone service or built-in to telegraf)
|
||||
|
|
Loading…
Reference in New Issue