From d9a3a957a925a8c5c653f29831655b8005c8514e Mon Sep 17 00:00:00 2001
From: Chris Goller <goller@gmail.com>
Date: Fri, 9 Sep 2016 13:53:42 -0500
Subject: [PATCH] Update design to reflect discussions.

---
 docs/design.md | 365 +++++++++++++++++--------------------------------
 1 file changed, 124 insertions(+), 241 deletions(-)

diff --git a/docs/design.md b/docs/design.md
index 956387d66..ebfebac74 100644
--- a/docs/design.md
+++ b/docs/design.md
@@ -15,37 +15,29 @@
 ### Initial Goals
 1. Produce pre-canned graphs for devops telegraf data for docker containers or system stats.
 2. Up and running in 2 minutes
-3. User administration for OSS and Plutonium
+3. User administration for Influx Enterprise.
 4. Leverage our existing enterprise front-end code.
 5. Leverage lessons-learned for enterprise back-end code.
-6. Minimum viable product by Oct 10th.
-7. Three to four weeks of testing and polishing before release.
 
 ### Versions
 
 Each version will contain more and more features around monitoring various devops components. 
 
-#### Cycles
-Two month cycles (typically, one month feature/one month polish)
-
 #### Features
 
-1. Nov
+1. v1
 	- Data explorer for both OSS and Enterprise
 	- Dashboards for telegraf system metrics
 	- User and Role adminstration
 	- Proxy queries over OSS and Enterprise
 	- Authenticate against OSS/Enterprise
 
-2. Jan
+2. v2
 	- Telegraf agent service
 	- Additional Dashboards for telegraf agent
 	
-3. Mar
-	- The next stuff
-
-### Supported Versions of Tick Stack
-... what versions are we supporting and not supporting? Nothing pre-1.0?
+### Supported Versions of TICK Stack
+We will only support 1.0 of the TICK stack.
 
 
 ### Closed source vs Open Source
@@ -82,60 +74,48 @@ We'll use GDM as the vendoring solution to maintain consistency with other piece
 #### REST
 We'll use swagger interface definition to specify API and JSON validation.  The goal is to emphasize designing to an interface facilitating parallel development.
 
-#### Query Proxy
-
-The query proxy is a special endpoint to query InfluxDB/InfluxEnterprise.
+#### Queries
 
 Features would include:
 
 1. Load balancing against all data nodes in cluster.
 1. Formatting the output results to be simple to use in frontend.
 1. Decimating the results to minimize network traffic.
-1. Use prepared queries to move query window and specify dimensions.
+1. Use parameters to move query time range.
 1. Allow different types of response protocols (http GET, websocket, etc.).
-1. Efficiently handle many queries at once (for a dashboard).
-1. Support multiple InfluxQL queries per request.
-1. Only support `SELECT` queries. (no explicit validation? happens in plutonium)
 
-Chronograf will take a different approach than InfluxEnterprise 1.0 and Grafana 2 & 3, which use a GET request with parameters.
-It provides two endpoints for ephemeral and persistent queries.
-Both endpoints accept a POST request with a JSON object containing similar parameters.
+It provides two endpoints:
 
-##### Ephemeral Queries
+- **`/proxy`:** used to send queries directly to the Influx backend.  They should be most useful for the data explorer or other ad hoc query functionality.
+- **`/queries`:** Permanent resource that can repeatedly return results. Queries are specified using a JSON representation. The server constructs the raw influxql query from the representation.
 
-Ephemeral queries are transient and unlikely to be requested multiple times.
-They should be most useful for the data explorer or other ad hoc query functionality.
+##### `/proxy` Queries
 
-Uses a POST request to the `/queries` endpoint and returns results in the response.
-No resource is created.
-Caching may or may not be available, but should not be relied upon.
+Queries to the `/proxy` endpoint do not create new REST resources.  Instead, it returns results of the query.
+
+This endpoint uses POST with a JSON object to specify the query and the parameters.  The endpoint's response will be the results of the query, or, the errors from the backend InfluxDB.
+
+Errors in the 4xx range come from the Influxdb data source.
 
 ```sequence
-App->Proxy: POST query
-Note right of Proxy: Query Validation
-Note right of Proxy: Load balance query
-Proxy->Influx/Relay/Cluster: SELECT
-Influx/Relay/Cluster-->Proxy: Time Series
-Note right of Proxy: Format and Decimate
-Proxy-->App: Formatted results
+App->/proxy: POST query
+Note right of /proxy: Query Validation
+Note right of /proxy: Load balance query
+/proxy->Influx/Relay/Cluster: SELECT
+Influx/Relay/Cluster-->/proxy: Time Series
+Note right of /proxy: Format
+/proxy-->App: Formatted results
 ```
 
-Example:
+Request:
 
 ```http
-POST /enterprise/v1/sources/{id}/query HTTP/1.1
+POST /enterprise/v1/sources/{id}/proxy HTTP/1.1
 Accept: application/json
 Content-Type: application/json
 {
-  "query": [
-    {
-      "name": "query1",
-      "query": "SELECT * from telegraf where time > $value"
-    }
-  ],
-  "format": "dygraph",
-  "max_points": 1000,
-  "type": "http"
+  	"query": "SELECT * from telegraf where time > $value",
+  	"format": "dygraph",
 }
 ```
 
@@ -143,166 +123,131 @@ Response:
 
 ```http
 HTTP/1.1 200 OK
+Content-Type: application/json
+
 {
-  "type": "http",
-  "format": "dygraph",
-  "query": [
-    {
-      "name": "explorer1",
-      "results": "..." 
-    }
-  ]
+	"results": "..." 
 }
 ```
 
-##### Persistent Queries
+Error Response:
 
-Persistent queries are create a resource that provides results for frequent queries.
-They should be most useful for dashboards where many queries are needed with similar dimensions.
+```http
+HTTP/1.1 400 Bad Request
+Content-Type: application/json
+
+{
+	"code": 400,
+	"message": "error parsing query: found..."
+}
+```
+
+
+##### `/queries` Queries
+
+POSTs to the `/queries` endpoint creates a resource, which will provide results for frequent queries.
+Query resources should be most useful for dashboards.
+
+The `resultsLink` key in the POST response is the URI to `GET` the results of the query.  The parameters `lower` and `upper` are passed to the `GET` request to specify the time range of the query.
 
-Uses a POST request to the `/series` endpoint which returns the location of the resource in the response.
-Using a GET request will return results for the original queries.
-Parameters for prepared statements can be passed in through the GET request.
 
 ```sequence
-App->Proxy: POST query
-Note right of Proxy: Query Validation
-Proxy-->App: Location of query resource
-App->Proxy: GET Location
-Note right of Proxy: Load balance query
-Proxy->Influx/Relay/Cluster: SELECT
-Influx/Relay/Cluster-->Proxy: Time Series
-Note right of Proxy: Format and Decimate
-Proxy-->App: 
-Note left of App: Format to dygraph
+App->/queries: POST query
+Note right of /queries: Query Validation
+/queries-->App: Location of query resource: /query/{id}
+App->/queries: GET /queries/{id}/results?lower=now()-1h&upper=now()
+Note right of /queries: Load balance query
+/queries->Influx/Relay/Cluster: SELECT
+Influx/Relay/Cluster-->/queries: Time Series
+Note right of /queries: Format and Decimate
+/queries-->App: 
 ```
 
-Example POST Request:
+Example Request:
 
 ```http
-POST /enterprise/v1/sources/{id}/series HTTP/1.1
+POST /enterprise/v1/sources/{id}/queries HTTP/1.1
 Accept: application/json
 Content-Type: application/json
-{
-  "query": [
-    {
-      "name": "query1",
-      "query": "SELECT * from telegraf where time > $value"
-    }
-  ],
-  "format": "dygraph",
-  "max_points": 1000,
-  "type": "http"
-  "ttl": "6h",
-  "every": "15s", // possible?
-}
-```
 
-Response:
- 
-```http
-HTTP/1.1 202 OK
 {
-	"link": {
-		"rel": "self",
-		"href": "/enterprise/v1/sources/{id}/series/{qid}",
-		"type": "http"
+	"database": "telegraf",
+	"measurement": "cpu",
+	"retentionPolicy": "autogen",
+	"fields": [
+		{"field": "usage_system", "funcs": ["mean"]},
+		{"field": "usage_user", "funcs": ["mean"]},
+	],
+	"tags": {
+		"cpu": ["cpu0", "cpu1", "cpu2"],
+		"host": ["host1"]
+	},
+    "groupBy": {
+		"time": "1h",
+		"tags": [
+			"cpu0",
+			"cpu1"
+		]
 	}
-}
-```
-
-Example GET Request:
-
-```http
-GET /enterprise/v1/sources/{id}/series/{qid} HTTP/1.1
-Accept: application/json
-Content-Type: application/json
-{
-  "query": [
-    {
-      "name": "query1",
-      "query": "SELECT * from telegraf where time > $value"
-    }
-  ],
-  "format": "dygraph",
-  "max_points": 1000,
-  "type": "http"
-  "ttl": "6h",
-  "every": "15s", // possible?
+	"format": "raw"
 }
 ```
 
 Response:
  
 ```http
-HTTP/1.1 200 OK
+HTTP/1.1 204 Created
+Content-Type: application/json
+Location: /enterprise/v1/sources/{id}/queries/{qid}
+
 {
-  "type": "http",
-  "format": "dygraph",
-  "query": [
-    {
-      "name": "explorer1",
-      "results": "..." 
-    }
-  ]
+	"database": "telegraf",
+	"measurement": "cpu",
+	"retentionPolicy": "autogen",
+	"fields": [
+		{"field": "usage_system", "funcs": ["mean"]},
+		{"field": "usage_user", "funcs": ["mean"]},
+	],
+	"tags": {
+		"cpu": ["cpu0", "cpu1", "cpu2"],
+		"host": ["host1"]
+	},
+    "groupBy": {
+		"time": "1h",
+		"tags": [
+			"cpu0",
+			"cpu1"
+		]
+	}
+	"format": "raw",
+	"resultsLink": "/enterprise/v1/sources/{id}/queries/{qid}/results"
 }
-```
-
-##### Websockets
-
-__Not a priority for v1.1__
-
-A websocket protocol will be developed to allow:
-
-	1. dynamic parameter changes
-	2. streaming new points
-
-Setting the `type` parameter to `ws` will indicate the client wants to initiate a websockets request.
-Simple protocol sketch:
 
 ```
- c:begin
- s:data
- s:end
 
- c:ping
- s:refresh
- c:accept
- s:data
- s:end
- c:disconnect
+GET Request:
 
- c:ping
- c:update
+```http
+GET /enterprise/v1/sources/{id}/series/{qid}/results?lower=now()-1h&upper=now() HTTP/1.1
+```
+
+Response:
+ 
+```http
+HTTP/1.1 200 OK
+Content-Type: application/json
+
+{
+      "results": [
+	  	"...",
+	]
+}
 ```
 
 ##### Load balancing
 
 Use simple round robin load balancing requests to data nodes.
-Discover active data nodes using `influxd-ctl show-cluster` on meta.
-
-##### Questions
-_For back-end:_
-1. Define how load balancer behaves when influx breaks?
-
-	- retry handling and when to return an error to app
-	- rate limit to avoid stampede if node is down
-
-_For front-end:_
-1. Is it desirable to use InfluxQL prepared statements or construct query from components (server-side or in-app)?
-1. Does separating ephemeral and persistent queries make sense? (Perhaps it sucks to have to do a parameterized GET)
-1. SHOW/DROP statements on separate endpoints. (`/users`, `/roles`, `/measurements`, `/tags`, `/fields`)
-
-_For core:_
-1. Which influx client do we use?
-
-	- Current clients are not flexible
-	- New client in design phase
-	- non-SELECT queries need either InfluxDB or InfluxEnterprise client
-
-1. How should we handle cacheing?
-1. Use websockets to support a subscription model?
-1. Best way to discover whether a node is up?
+Discover active data nodes using Plutonium meta client.
 
 #### Backend-server store
 We will build a interface for storing API resources.
@@ -326,21 +271,17 @@ Future versions will support more HA data stores.
 1. User
 
 	- Version 1.1 will be a one-to-one mapping to influx.
-	- Do we want to extend this user object at some point?  What other properties or relations are important?
 
-1. Layouts
+1. Dashboards
 
-	- We need to have another discussion about the goals.
-	- For now the design is an opaque JSON blob until we know how to structure this.
 	- precanned dashboards for telegraf
+	- Includes location of query resources.
 
 1. Queries
-	- We may store the queries object.
-	- Downside: when to expire? TTL?
-	- Upside: single endpoint for specific query with bindable parameters.
+	- Used to construct influxql.
 
-1. Sessions for particular user?
-	- What data do we persist about a user's session, if any?
+1. Sessions
+	- We could simply use the JWT token as the session information
 	
 1. Server Configuration
 	
@@ -351,11 +292,7 @@ Future versions will support more HA data stores.
 
 We want the backend data store (influx oss or influx meta) handle the authentication so that the web server has less responsibility.
 
-##### Questions
-1. Do we want shared secret authentication between the server and the influx data store?
-
-2. How will users be authenticated to the web server?
-
+We'll use JWT throughout.
 
 ### Testing
 Talk with Mark and Michael and talk about larger efforts.  This will impact the repository layout.
@@ -380,57 +317,3 @@ Because we are pulling together so many TICK stack components we will need stron
 	- Deployment experience
 	- Ease of use.
 	- Speed to accomplish task, e.g. find specific info, change setting.
-
-
-### Collection Agent
-
-The collection agent is at the very least telegraf and a configuration.
-
-The collection agent post-version 1 will feel similar to [Datadog](https://app.datadoghq.com/account/settings#agent).
-
-Good user experience is the key
-
-#### Quesions
-1. Talk to Cameron about distribution/service
-
-1. Get his opinions on our basic designs (env vars?)
-
-1. Use confd vs something built into telegraf for dynamic configuration?
-
-1. Telegraf authentication with jwt?
-
-1. Should there be prebuilt package (rpm, deb) vs something else.  Are there different config files for each one of these packages?
-1. We could just have environment variables `INFLUX_URL` and `INFLUX_SHARED_SECRET`. Anything else?
-
-1. what product order are we supporting? 
-	- v1 system stats
-	- v2 Docker stats
-
-1. Multiple telegrafs to support other services? E.g. a telegraf instance with only Postgres plugin
-
-### User Stories
-#### Initial Setup v2
-1. User clicks on an icon that represents their system (e.g. Redhat).
-2. User fills out a form that includes the information needed to configure telegraf.
-	- influx url
-	- influx authentication
-	- does telegraf have shared secret jwt?
-	- let's talk to nathaniel about this... in regards to how it worked with kapacitor.
-3. User gets a download command (sh?)  This command has enough to start a telegraf for docker container monitoring (v1.1) and send the data to influx.
-
-	- Question: how do we remove machines? (e.g. I don't want to see my testing mac laptop anymore)
-		- Could use retention policies (fast)
-			- testing rp
-			- production rp
-		- Could use namespaced databases
-		- We should talk to the people that working on the new series index to help us handle paging-off of old/inactive instances problem gracefully. DROP SERIES WHERE "host" = 'machine1'
-
-	1. SHOW SERIES WHERE time > 1w # gets all host names for the last week SHOW TAG VALUE WITH KEY = 'host' WHERE time > 1w
-	2. Performance??
-	2. DROP SERIES WHERE "host" = 'machine1'
-	3. We could have a machine endpoint allowing GET/DELETE
-	4. we want to filter machines by the times whey were active and the times they first showed up
-
-#### Update telegraf configuration on host
-confd for telegraf configuration?
-survey prometheus service discovery methodology and compare to our telegraf design (stand-alone service or built-in to telegraf)