From d9a3a957a925a8c5c653f29831655b8005c8514e Mon Sep 17 00:00:00 2001 From: Chris Goller Date: Fri, 9 Sep 2016 13:53:42 -0500 Subject: [PATCH] Update design to reflect discussions. --- docs/design.md | 365 +++++++++++++++++-------------------------------- 1 file changed, 124 insertions(+), 241 deletions(-) diff --git a/docs/design.md b/docs/design.md index 956387d66..ebfebac74 100644 --- a/docs/design.md +++ b/docs/design.md @@ -15,37 +15,29 @@ ### Initial Goals 1. Produce pre-canned graphs for devops telegraf data for docker containers or system stats. 2. Up and running in 2 minutes -3. User administration for OSS and Plutonium +3. User administration for Influx Enterprise. 4. Leverage our existing enterprise front-end code. 5. Leverage lessons-learned for enterprise back-end code. -6. Minimum viable product by Oct 10th. -7. Three to four weeks of testing and polishing before release. ### Versions Each version will contain more and more features around monitoring various devops components. -#### Cycles -Two month cycles (typically, one month feature/one month polish) - #### Features -1. Nov +1. v1 - Data explorer for both OSS and Enterprise - Dashboards for telegraf system metrics - User and Role adminstration - Proxy queries over OSS and Enterprise - Authenticate against OSS/Enterprise -2. Jan +2. v2 - Telegraf agent service - Additional Dashboards for telegraf agent -3. Mar - - The next stuff - -### Supported Versions of Tick Stack -... what versions are we supporting and not supporting? Nothing pre-1.0? +### Supported Versions of TICK Stack +We will only support 1.0 of the TICK stack. ### Closed source vs Open Source @@ -82,60 +74,48 @@ We'll use GDM as the vendoring solution to maintain consistency with other piece #### REST We'll use swagger interface definition to specify API and JSON validation. The goal is to emphasize designing to an interface facilitating parallel development. -#### Query Proxy - -The query proxy is a special endpoint to query InfluxDB/InfluxEnterprise. +#### Queries Features would include: 1. Load balancing against all data nodes in cluster. 1. Formatting the output results to be simple to use in frontend. 1. Decimating the results to minimize network traffic. -1. Use prepared queries to move query window and specify dimensions. +1. Use parameters to move query time range. 1. Allow different types of response protocols (http GET, websocket, etc.). -1. Efficiently handle many queries at once (for a dashboard). -1. Support multiple InfluxQL queries per request. -1. Only support `SELECT` queries. (no explicit validation? happens in plutonium) -Chronograf will take a different approach than InfluxEnterprise 1.0 and Grafana 2 & 3, which use a GET request with parameters. -It provides two endpoints for ephemeral and persistent queries. -Both endpoints accept a POST request with a JSON object containing similar parameters. +It provides two endpoints: -##### Ephemeral Queries +- **`/proxy`:** used to send queries directly to the Influx backend. They should be most useful for the data explorer or other ad hoc query functionality. +- **`/queries`:** Permanent resource that can repeatedly return results. Queries are specified using a JSON representation. The server constructs the raw influxql query from the representation. -Ephemeral queries are transient and unlikely to be requested multiple times. -They should be most useful for the data explorer or other ad hoc query functionality. +##### `/proxy` Queries -Uses a POST request to the `/queries` endpoint and returns results in the response. -No resource is created. -Caching may or may not be available, but should not be relied upon. +Queries to the `/proxy` endpoint do not create new REST resources. Instead, it returns results of the query. + +This endpoint uses POST with a JSON object to specify the query and the parameters. The endpoint's response will be the results of the query, or, the errors from the backend InfluxDB. + +Errors in the 4xx range come from the Influxdb data source. ```sequence -App->Proxy: POST query -Note right of Proxy: Query Validation -Note right of Proxy: Load balance query -Proxy->Influx/Relay/Cluster: SELECT -Influx/Relay/Cluster-->Proxy: Time Series -Note right of Proxy: Format and Decimate -Proxy-->App: Formatted results +App->/proxy: POST query +Note right of /proxy: Query Validation +Note right of /proxy: Load balance query +/proxy->Influx/Relay/Cluster: SELECT +Influx/Relay/Cluster-->/proxy: Time Series +Note right of /proxy: Format +/proxy-->App: Formatted results ``` -Example: +Request: ```http -POST /enterprise/v1/sources/{id}/query HTTP/1.1 +POST /enterprise/v1/sources/{id}/proxy HTTP/1.1 Accept: application/json Content-Type: application/json { - "query": [ - { - "name": "query1", - "query": "SELECT * from telegraf where time > $value" - } - ], - "format": "dygraph", - "max_points": 1000, - "type": "http" + "query": "SELECT * from telegraf where time > $value", + "format": "dygraph", } ``` @@ -143,166 +123,131 @@ Response: ```http HTTP/1.1 200 OK +Content-Type: application/json + { - "type": "http", - "format": "dygraph", - "query": [ - { - "name": "explorer1", - "results": "..." - } - ] + "results": "..." } ``` -##### Persistent Queries +Error Response: -Persistent queries are create a resource that provides results for frequent queries. -They should be most useful for dashboards where many queries are needed with similar dimensions. +```http +HTTP/1.1 400 Bad Request +Content-Type: application/json + +{ + "code": 400, + "message": "error parsing query: found..." +} +``` + + +##### `/queries` Queries + +POSTs to the `/queries` endpoint creates a resource, which will provide results for frequent queries. +Query resources should be most useful for dashboards. + +The `resultsLink` key in the POST response is the URI to `GET` the results of the query. The parameters `lower` and `upper` are passed to the `GET` request to specify the time range of the query. -Uses a POST request to the `/series` endpoint which returns the location of the resource in the response. -Using a GET request will return results for the original queries. -Parameters for prepared statements can be passed in through the GET request. ```sequence -App->Proxy: POST query -Note right of Proxy: Query Validation -Proxy-->App: Location of query resource -App->Proxy: GET Location -Note right of Proxy: Load balance query -Proxy->Influx/Relay/Cluster: SELECT -Influx/Relay/Cluster-->Proxy: Time Series -Note right of Proxy: Format and Decimate -Proxy-->App: -Note left of App: Format to dygraph +App->/queries: POST query +Note right of /queries: Query Validation +/queries-->App: Location of query resource: /query/{id} +App->/queries: GET /queries/{id}/results?lower=now()-1h&upper=now() +Note right of /queries: Load balance query +/queries->Influx/Relay/Cluster: SELECT +Influx/Relay/Cluster-->/queries: Time Series +Note right of /queries: Format and Decimate +/queries-->App: ``` -Example POST Request: +Example Request: ```http -POST /enterprise/v1/sources/{id}/series HTTP/1.1 +POST /enterprise/v1/sources/{id}/queries HTTP/1.1 Accept: application/json Content-Type: application/json -{ - "query": [ - { - "name": "query1", - "query": "SELECT * from telegraf where time > $value" - } - ], - "format": "dygraph", - "max_points": 1000, - "type": "http" - "ttl": "6h", - "every": "15s", // possible? -} -``` -Response: - -```http -HTTP/1.1 202 OK { - "link": { - "rel": "self", - "href": "/enterprise/v1/sources/{id}/series/{qid}", - "type": "http" + "database": "telegraf", + "measurement": "cpu", + "retentionPolicy": "autogen", + "fields": [ + {"field": "usage_system", "funcs": ["mean"]}, + {"field": "usage_user", "funcs": ["mean"]}, + ], + "tags": { + "cpu": ["cpu0", "cpu1", "cpu2"], + "host": ["host1"] + }, + "groupBy": { + "time": "1h", + "tags": [ + "cpu0", + "cpu1" + ] } -} -``` - -Example GET Request: - -```http -GET /enterprise/v1/sources/{id}/series/{qid} HTTP/1.1 -Accept: application/json -Content-Type: application/json -{ - "query": [ - { - "name": "query1", - "query": "SELECT * from telegraf where time > $value" - } - ], - "format": "dygraph", - "max_points": 1000, - "type": "http" - "ttl": "6h", - "every": "15s", // possible? + "format": "raw" } ``` Response: ```http -HTTP/1.1 200 OK +HTTP/1.1 204 Created +Content-Type: application/json +Location: /enterprise/v1/sources/{id}/queries/{qid} + { - "type": "http", - "format": "dygraph", - "query": [ - { - "name": "explorer1", - "results": "..." - } - ] + "database": "telegraf", + "measurement": "cpu", + "retentionPolicy": "autogen", + "fields": [ + {"field": "usage_system", "funcs": ["mean"]}, + {"field": "usage_user", "funcs": ["mean"]}, + ], + "tags": { + "cpu": ["cpu0", "cpu1", "cpu2"], + "host": ["host1"] + }, + "groupBy": { + "time": "1h", + "tags": [ + "cpu0", + "cpu1" + ] + } + "format": "raw", + "resultsLink": "/enterprise/v1/sources/{id}/queries/{qid}/results" } -``` - -##### Websockets - -__Not a priority for v1.1__ - -A websocket protocol will be developed to allow: - - 1. dynamic parameter changes - 2. streaming new points - -Setting the `type` parameter to `ws` will indicate the client wants to initiate a websockets request. -Simple protocol sketch: ``` - c:begin - s:data - s:end - c:ping - s:refresh - c:accept - s:data - s:end - c:disconnect +GET Request: - c:ping - c:update +```http +GET /enterprise/v1/sources/{id}/series/{qid}/results?lower=now()-1h&upper=now() HTTP/1.1 +``` + +Response: + +```http +HTTP/1.1 200 OK +Content-Type: application/json + +{ + "results": [ + "...", + ] +} ``` ##### Load balancing Use simple round robin load balancing requests to data nodes. -Discover active data nodes using `influxd-ctl show-cluster` on meta. - -##### Questions -_For back-end:_ -1. Define how load balancer behaves when influx breaks? - - - retry handling and when to return an error to app - - rate limit to avoid stampede if node is down - -_For front-end:_ -1. Is it desirable to use InfluxQL prepared statements or construct query from components (server-side or in-app)? -1. Does separating ephemeral and persistent queries make sense? (Perhaps it sucks to have to do a parameterized GET) -1. SHOW/DROP statements on separate endpoints. (`/users`, `/roles`, `/measurements`, `/tags`, `/fields`) - -_For core:_ -1. Which influx client do we use? - - - Current clients are not flexible - - New client in design phase - - non-SELECT queries need either InfluxDB or InfluxEnterprise client - -1. How should we handle cacheing? -1. Use websockets to support a subscription model? -1. Best way to discover whether a node is up? +Discover active data nodes using Plutonium meta client. #### Backend-server store We will build a interface for storing API resources. @@ -326,21 +271,17 @@ Future versions will support more HA data stores. 1. User - Version 1.1 will be a one-to-one mapping to influx. - - Do we want to extend this user object at some point? What other properties or relations are important? -1. Layouts +1. Dashboards - - We need to have another discussion about the goals. - - For now the design is an opaque JSON blob until we know how to structure this. - precanned dashboards for telegraf + - Includes location of query resources. 1. Queries - - We may store the queries object. - - Downside: when to expire? TTL? - - Upside: single endpoint for specific query with bindable parameters. + - Used to construct influxql. -1. Sessions for particular user? - - What data do we persist about a user's session, if any? +1. Sessions + - We could simply use the JWT token as the session information 1. Server Configuration @@ -351,11 +292,7 @@ Future versions will support more HA data stores. We want the backend data store (influx oss or influx meta) handle the authentication so that the web server has less responsibility. -##### Questions -1. Do we want shared secret authentication between the server and the influx data store? - -2. How will users be authenticated to the web server? - +We'll use JWT throughout. ### Testing Talk with Mark and Michael and talk about larger efforts. This will impact the repository layout. @@ -380,57 +317,3 @@ Because we are pulling together so many TICK stack components we will need stron - Deployment experience - Ease of use. - Speed to accomplish task, e.g. find specific info, change setting. - - -### Collection Agent - -The collection agent is at the very least telegraf and a configuration. - -The collection agent post-version 1 will feel similar to [Datadog](https://app.datadoghq.com/account/settings#agent). - -Good user experience is the key - -#### Quesions -1. Talk to Cameron about distribution/service - -1. Get his opinions on our basic designs (env vars?) - -1. Use confd vs something built into telegraf for dynamic configuration? - -1. Telegraf authentication with jwt? - -1. Should there be prebuilt package (rpm, deb) vs something else. Are there different config files for each one of these packages? -1. We could just have environment variables `INFLUX_URL` and `INFLUX_SHARED_SECRET`. Anything else? - -1. what product order are we supporting? - - v1 system stats - - v2 Docker stats - -1. Multiple telegrafs to support other services? E.g. a telegraf instance with only Postgres plugin - -### User Stories -#### Initial Setup v2 -1. User clicks on an icon that represents their system (e.g. Redhat). -2. User fills out a form that includes the information needed to configure telegraf. - - influx url - - influx authentication - - does telegraf have shared secret jwt? - - let's talk to nathaniel about this... in regards to how it worked with kapacitor. -3. User gets a download command (sh?) This command has enough to start a telegraf for docker container monitoring (v1.1) and send the data to influx. - - - Question: how do we remove machines? (e.g. I don't want to see my testing mac laptop anymore) - - Could use retention policies (fast) - - testing rp - - production rp - - Could use namespaced databases - - We should talk to the people that working on the new series index to help us handle paging-off of old/inactive instances problem gracefully. DROP SERIES WHERE "host" = 'machine1' - - 1. SHOW SERIES WHERE time > 1w # gets all host names for the last week SHOW TAG VALUE WITH KEY = 'host' WHERE time > 1w - 2. Performance?? - 2. DROP SERIES WHERE "host" = 'machine1' - 3. We could have a machine endpoint allowing GET/DELETE - 4. we want to filter machines by the times whey were active and the times they first showed up - -#### Update telegraf configuration on host -confd for telegraf configuration? -survey prometheus service discovery methodology and compare to our telegraf design (stand-alone service or built-in to telegraf)