docs-v2/content/shared/influxql-v3-reference/internals.md

155 lines
6.7 KiB
Markdown

Learn about the implementation of InfluxQL to understand how
results are processed and how to create efficient queries:
- [Query life cycle](#query-life-cycle)
- [Understanding iterators](#understanding-iterators)
- [Cursors](#cursors)
- [Auxiliary fields](#auxiliary-fields)
- [Built-in iterators](#built-in-iterators)
- [Call iterators](#call-iterators)
## Query life cycle
1. InfluxQL query string is tokenized and then parsed into an abstract syntax
tree (AST). This is the code representation of the query itself.
2. The AST is passed to the `QueryExecutor` which directs queries to the
appropriate handlers. For example, queries related to meta data are executed
by the **meta service** and `SELECT` statements are executed by the shards
themselves.
3. The query engine then determines the shards that match the `SELECT`
statement's time range. From these shards, iterators are created for each
field in the statement.
4. Iterators are passed to the emitter which drains them and joins the resulting
points. The emitter's job is to convert simple time/value points into the
more complex result objects that are returned to the client.
### Understanding iterators
Iterators provide a simple interface for looping over a set of points.
For example, this is an iterator over Float points:
```
type FloatIterator interface {
Next() *FloatPoint
}
```
These iterators are created through the `IteratorCreator` interface:
```
type IteratorCreator interface {
CreateIterator(opt *IteratorOptions) (Iterator, error)
}
```
The `IteratorOptions` provide arguments about field selection, time ranges,
and dimensions that the iterator creator can use when planning an iterator.
The `IteratorCreator` interface is used at many levels such as the `Shards`,
`Shard`, and `Engine`. This allows optimizations to be performed when applicable
such as returning a precomputed `COUNT()`.
Iterators aren't just for reading raw data from storage, though. Iterators can be
composed so that they provide additional functionality around an input
iterator. For example, a `DistinctIterator` can compute the distinct values for
each time window for an input iterator. Or a `FillIterator` can generate
additional points that are missing from an input iterator.
This composition also lends itself well to aggregation.
For example, in the following SQL, `MEAN(value)` is a `MeanIterator` that wraps an iterator from the
underlying shards:
```sql
SELECT MEAN(value) FROM cpu GROUP BY time(10m)
```
The following example wraps `MEAN(value)` with an additional iterator (`DERIVATIVE()`) to determine
the derivative of the mean:
```sql
SELECT DERIVATIVE(MEAN(value), 20m) FROM cpu GROUP BY time(10m)
```
### Cursors
A **cursor** identifies data by shard in tuples (time, value) for a single series (measurement, tag set and field). The cursor traverses data stored as a log-structured merge-tree and handles deduplication across levels, tombstones for deleted data, and merging the cache (Write Ahead Log). A cursor sorts the `(time, value)` tuples by time in ascending or descending order.
For example, a query that evaluates one field for 1,000 series over 3 shards constructs a minimum of 3,000 cursors (1,000 per shard).
### Auxiliary fields
Because InfluxQL allows users to use selector functions such as `FIRST()`,
`LAST()`, `MIN()`, and `MAX()`, the engine must provide a way to return related
data at the same time with the selected point.
Let's look at the following query:
```sql
SELECT FIRST(value), host FROM cpu GROUP BY time(1h)
```
We are selecting the first `value` that occurs every hour but we also want to
retrieve the `host` associated with that point. Since the `Point` types only
specify a single typed `Value` for efficiency, we push the `host` into the
auxiliary fields of the point. These auxiliary fields are attached to the point
until it is passed to the emitter where the fields get split off to their own
iterator.
### Built-in iterators
There are many helper iterators that let us build queries:
* Merge Iterator - This iterator combines one or more iterators into a single
new iterator of the same type. This iterator guarantees that all points
within a window will be output before starting the next window, but does not
provide ordering guarantees within the window. This allows for fast access
for aggregate queries that don't need stronger sorting guarantees.
* Sorted Merge Iterator - Like `MergeIterator`, this iterator combines one or more iterators
into a new iterator of the same type. However, this iterator guarantees
time ordering of every point. This makes it slower than the `MergeIterator`
but this ordering guarantee is required for non-aggregate queries which
return the raw data points.
* Limit Iterator - This iterator limits the number of points per name or tag
group. This is the implementation of the `LIMIT` & `OFFSET` syntax.
* Fill Iterator - This iterator injects extra points if they are missing from
the input iterator. It can provide `null` points, points with the previous
value, or points with a specific value.
* Buffered Iterator - This iterator provides the ability to "unread" a point
back onto a buffer so it can be read again next time. This is used extensively
to provide lookahead for windowing.
* Reduce Iterator - This iterator calls a reduction function for each point in
a window. When the window is complete, then all points for that window are
output. This is used for simple aggregate functions such as `COUNT()`.
* Reduce Slice Iterator - This iterator collects all points for a window first,
and then passes them all to a reduction function at once. The results are
returned from the iterator. This is used for aggregate functions such as
`DERIVATIVE()`.
* Transform Iterator - This iterator calls a transform function for each point
from an input iterator. This is used for executing binary expressions.
* Dedupe Iterator - This iterator only outputs unique points. Because it is resource-intensive, this iterator is only used for small queries such as meta query statements.
### Call iterators
Function calls in InfluxQL are implemented at two levels:
- Some calls can be
wrapped at multiple layers to improve efficiency. For example, a `COUNT()` can
be performed at the shard level and then multiple `CountIterator`s can be
wrapped with another `CountIterator` to compute the count of all shards. These
iterators can be created using `NewCallIterator()`.
- Some iterators are more complex or need to be implemented at a higher level.
For example, the `DERIVATIVE()` function needs to retrieve all points for a window
before performing the calculation. This iterator is created by the engine itself
and is never requested to be created by the lower levels.