The InfluxQL Transpiler exists to rewrite an InfluxQL query into its equivalent query in Flux. The transpiler works off of a few simple rules that match with the equivalent method of constructing queries in InfluxDB.
**NOTE:** The transpiler code is not finished and may not necessarily reflect what is in this document. When they conflict, this document is considered to be the correct way to do it. If you wish to change how the transpiler works, modify this file first.
### <a name="identify-cursors"></a> Identify the cursors
The InfluxQL query engine works by filling in variables and evaluating the query for the values in each row. The first step of transforming a query is identifying the cursors so we can figure out how to fill them correctly. A cursor is any point in the query that has a **variable or a function call**. Math functions do not count as function calls and are handled in the eval phase.
There are four types of queries: meta, raw, aggregate, and selector. A meta query is one that retrieves descriptive information about a measurement or series, rather than about the data within the measurement or series. A raw query is one where all of the cursors reference a variable. An aggregate is one where all of the cursors reference a function call. A selector is one where there is exactly one function call that is a selector (such as `max()` or `min()`) and the remaining variables, if there are any, are variables. If there is only one function call with no variables and that function is a selector, then the function type is a selector.
We group the cursors based on the query type. For raw queries and selectors, all of the cursors are put into the same group. \ For aggregates, each function call is put into a separate group so they can be joined at the end.
Each of the variables in the group are identified. This involves inspecting the condition to collect the common variables in the expression while also retrieving the variables for each expression within the group. For a function call, this retrieves the variable used as a function argument rather than the function itself.
If a wildcard is identified in the fields, then the field filter is cleared and only the measurement filter is used. If a regex wildcard is identified, it is added as one of the field filters.
The `<measurement>` is equal to the measurement name from the `FROM` clause. The `<field_expr>` section is generated differently depending on the fields that were found. If more than one field was selected, then each of the field filters is combined by using `or` and the expression itself is surrounded by parenthesis. For a non-wildcard field, the following expression is used:
If the `GROUP BY time(...)` doesn't exist, `window()` is skipped. Grouping will have a default of [`_measurement`, `_start`], regardless of whether a GROUP BY clause is present. If there are keys in the group by clause, they are concatenated with the default list. If a wildcard is used for grouping, then this step is skipped.
If a function was evaluated and the query type is an aggregate type or if we are grouping by time, then all of the functions need to have their time normalized. If the function is an aggregate, the following is added:
If there are multiple groups, as is the case when there are multiple function calls, then we perform an `outer_join` using the time and any remaining group keys.
After joining the results if a join was required, then a `map` call is used to both evaluate the math functions and name the columns. The time is also passed through the `map()` function so it is available for the encoder.
TODO(jsternberg): The `_time` variable is only needed for selectors and raw queries. We can actually drop this variable for aggregate queries and use the `_start` time from the group key. Consider whether or not we should do this and if it is worth it.
In 2.0, not all "buckets" will be conceptually equivalent to a 1.X database. If a bucket is intended to represent a collection of 1.X data, it will be specifically identified as such. `flux` provides a special function `databases()` that will retrieve information about all registered 1.X compatible buckets.
The cursor is trivially implemented as a no-argument call to the `databases` function:
```
databases()
```
### <a name="show-databases-name"></a>Rename and Keep the databaseName Column
The result of `databases()` has several columns. In this application, we only need the `databaseName` but in 1.X output, the label is `name`:
```
databases()
|> rename(columns: {databaseName: "name"})
|> keep(columns: ["name"])
```
## <a name="show-retention-policies"></a> Show Retention Policies
Similar to `SHOW DATABASES`, show retention policies also returns information only for 1.X compatible buckets. It uses different columns from the same `databses()` function.
The cursor is trivially implemented as a no-argument call to the `databases` function:
```
databases()
```
### <a name="show-retention-policies-database-filter"></a> Filter by the database name
The databases function will return rows of database/retention policy pairs for all databases. The result of `SHOW RETENTION POLICIES` is defined for a single database, so we filter:
### <a name="show-retention-policies-static-cols"></a> Set Static Columns
Two static columns are set. In 1.X the columns for `shardGroupDuration` and `replicaN` could vary depending on the database/retention policy definition. In 2.0, there is no shardGroups to configure, and the replication level is always 2.
In flux, retrieving the tag values is different than influxql. In influxdb 1.x, tags were included in the index and restricting them by time did not exist or make any sense. In the 2.0 platform, tag keys and values are scoped by time and it is more expensive to retrieve all of the tag values for all time. For this reason, there are some small changes to how the command works and therefore how it is transpiled.
If no time specifier is specified, as would be expected by most transpiled queries, we default to the last hour. If a time range is present in the `WHERE` clause, that time is used instead.
### <a name="show-tag-values-measurement-filter"></a> Filter by the measurement
If a `FROM <measurement>` clause is present in the statement, then we filter by the measurement name.
This step may be skipped if the `FROM` clause is not present. In which case, it will return the tag values for every measurement.
### <a name="show-tag-values-evaluate-condition"></a> Evaluate the condition
The condition within the `WHERE` clause is evaluated. It generates a filter in the same way that a [select statement)(#evaluate-condition) would, but with the added assumption that all of the values refer to tags. There is no attempt made at determining if a value is a field or tag.
### <a name="show-tag-values-key-values"></a> Retrieve the key values
TODO(jsternberg): The schema function has not been solidifed, but the basics are that we take the list of group keys and then run a filter using the condition.
At this point, we have a table with the partition key that is organized by the keys and values of the selected columns.
### <a name="show-tag-values-distinct-key-values"></a> Find the distinct key values
We group by the measurement and the key and then use `distinct` on the values. After we find the distinct values, we group these values back by their measurements again so all of the tag values for a measurement are grouped together. We then rename the columns to the expected names.
Each statement will be terminated by a `yield()` call. This call will embed the statement id as the result name. The result name is always of type string, but the transpiler will encode an integer in this field so it can be parsed by the encoder. For example:
```
result |> yield(name: "0")
```
The edge nodes from the query specification will be used to encode the results back to the user in the JSON format used in 1.x. The JSON format from 1.x is below:
The measurement name is retrieved from the `_measurement` column in the results. For the tags, the values in the group key that are of type string are included with both the keys and the values mapped to each other. Any values in the group key that are not strings, like the start and stop times, are ignored and discarded. If the `_field` key is still present in the group key, it is also discarded. For all normal fields, they are included in the array of values for each row. The `_time` field will be renamed to `time` (or whatever the time alias is set to by the query).
The chunking options that existed in 1.x are not supported by the encoder and should not be used. To minimize the amount of breaking code, using a chunking option will be ignored and the encoder will operate as normal, but it will include a message in the result so that a user can be informed that an invalid query option was used. The 1.x format has a field for sending back informational messages in it already.
**TODO(jsternberg):** Find a way for a column to be both used as a tag and a field. This is not currently possible because the encoder can't tell the difference between the two.