* feat(query): hyper log log counting in query engine
In addition to helping with normal queries, this can improve the 'SHOW CARDINALITY'
meta-queries:
time influx -database mydb -execute 'select count_hll(sum_hll(_seriesKey)) from big'
name: big
time count_hll
---- ---------
0 200767781
influx -database mydb -execute 0.06s user 0.12s system 0% cpu 8:49.99 total
The sort order of points when performing aggregates never took into
account if they were ascending or descending so when multiple series
were aggregated, it would ensure they were sorted in the correct order.
But it wouldn't reverse this order when descending was used.
Additionally, it seems that the iterator template and the iterator file
itself became out of sync. It seems the template was not reverted
correctly from a previously incorrect change and only the float type was
changed to the correct version and the tests used the float version.
This adds numerous technical analysis algorithms:
* exponential_moving_average
* double_exponential_moving_average
* triple_exponential_moving_average
* relative_strength_index
* triple_exponential_average
* kaufmans_efficiency_ratio (commonly referred to as just "Efficiency Ratio")
* kaufmans_adaptive_moving_average
* chande_momentum_oscillator (both the common 'smoothed' version, and the ta-lib version)
This also fixes the cursor system to abandon iterators that will not
produce meaningful results since the variables are all unknown types.
This creates a weird behavior that existed in previous releases and we
are keeping here for backwards compatibility. If a subquery referenced a
field that didn't exist in the subquery, it will return nothing. But, if
there are two subqueries and one of them has the field exist and the
other doesn't, the second will return all null values.
This is not complete, but it is a starting point for more thorough tests
of subqueries.
This also reorders the use of `cmp.Diff` so the `want` is first and
`got` is second. This way, the `want` shows up as a minus sign in the
diff rather than, confusingly, as a plus sign.
This change makes it so that we simplify the math engine so it doesn't
use a complicated set of nested iterators. That way, we have to change
math in one fewer place.
It also greatly simplifies the query engine as now we can create the
necessary iterators, join them by time, name, and tags, and then use the
cursor interface to read them and use eval to compute the result. It
makes it so the auxiliary iterators and all of their complexity can be
removed.
This also makes use of the new eval functionality that was recently
added to the influxql package.
No math functions have been added, but the scaffolding has been included
so things like trigonometry functions are just a single commit away.
This also introduces a small breaking change. Because of the call
optimization, it is now possible to use the same selector multiple times
as a selector. So if you do this:
SELECT max(value) * 2, max(value) / 2 FROM cpu
This will now return the timestamp of the max value rather than zero
since this query is considered to have only a single selector rather
than multiple separate selectors. If any aspect of the selector is
different, such as different selector functions or different arguments,
it will consider the selectors to be aggregates like the old behavior.
The Cursor returned will be capable of scanning rows into a structure.
It replaces part of the function for why the Emitter existed. The
Emitter would both join the resulting rows and then transform the values
into a models.Row so it could be returned to the results.
In the future, we will be able to use the Cursor directly to write out
values which should be more memory efficient.
Along with modifying ExecutionContext to be a context and have the
TaskManager return the context itself, this also creates a Monitor
interface and exposes the Monitor through the Context. This way, we can
access the monitor from within the query.Select method and keep all of
the limits inside of the query package instead of leaking them into the
statement executor.
An eventual goal is to remove the InterruptCh from the IteratorOptions
and use the Context instead, but for now, we'll just assign the done
channel from the Context to the IteratorOptions so at least they refer
to the same channel.
The implicit time range for an interval is supposed to be now when no
end is specified. In a subquery though, the interval doesn't exist and
so it doesn't set the end time to now, but to the max time. Since the
subquery qualifies as something that should have the implicit end time
apply, this results in a query that runs slowly because it is filling in
a bunch of unasked for intervals if a fill is specified.
This hack adds the implicit end time if it sees the parent query's end
time is set to the maximum available time.
This is a temporary fix for this problem. The query compilation should
perform these time range calculations in the compilation stage and the
subqueries should use the compilation stage during execution instead of
ignoring it. That work takes a lot more effort though and is more prone
to running into unforeseen bugs.
This fix introduces a subtle, but likely rare to run into bug. If the
top level query specifies the maximum time as the end time and the
subquery has an interval, the subquery should use the end time rather
than now as the time range. With this hack, it will interpret it as an
implicit time rather than an explicit one. This is unlikely to matter
though.
When a meta query does not include a time component then it can be
answered exclusively by the index. This should result in a much faster
query execution that if the TSM engine was engaged.
This commit rewrites the following queries such that they make use
of the index where no time component is present:
- SHOW MEASUREMENTS
- SHOW SERIES
- SHOW TAG KEYS
- SHOW FIELD KEYS
* Introduces EXPLAIN ANALYZE command, which
produces a detailed tree of operations used to
execute the query.
introduce context.Context to APIs
metrics package
* create groups of named measurements
* safe for concurrent access
tracing package
EXPLAIN ANALYZE implementation for OSS
Serialize EXPLAIN ANALYZE traces from remote nodes
use context.Background for tests
group with other stdlib packages
additional documentation and remove unused API
use influxdb/pkg/testing/assert
remove testify reference
Field math works similar to condition evaluation, but not the exact same
because we have more information to work with in field expressions than
we do in conditional math because fields retain the information about
their source while conditions do not.
The main difference is that you cannot add an unsigned literal to the
output of an integer iterator while you can inside of a condition. You
can perform math on a positive integer literal to an unsigned iterator.
Inside of the condition, we aren't sure if an integer is because of a
literal or because of an iterator so we can't make that distinction.
It prints the statistics of each iterator that will access the storage
engine. For each access of the storage engine, it will print the number
of shards that will potentially be accessed, the number of files that
may be accessed, the number of series that will be created, the number
of blocks, and the size of those blocks.
Now, the prepared statement keeps the open resource and closing the open
resource created from `Prepare` is the responsibility of the prepared
statement.
This also nils out the local shard mapping after it is closed to prevent
it from being used after it is closed.
The first call is to compile the query. This performs some initial
processing that can be done before having any access to the shards. At
the moment, it does very little, but it's intended to be changed to
eventually perform initial validations of the query and create an
internal graph structure for the execution of the query.
The second call is to prepare the query. This step has access to the
shard mapper. Right now, it just maps the shards and rewrites the fields
of the query for any wildcards. In the future, it is intended to do the
above, but also to prepare the final directed acyclical graph that will
execute the query.
The third call is to select the query. This step is intended to create
all of the iterators for processing the query. At the moment, much of
the work intended for the second step is performed in the third step.
The statement rewriting logic should be in the query engine as part of
preparing a query. This creates a shard mapper interface that the query
engine expects and then passes it to the query engine instead of
requiring the query to be preprocessed before being input into the query
engine. This interface is (mostly) the same as the old interface, just
moved to a different package.
This change provides a clear separation between the query engine
mechanics and the query language so that the language can be parsed and
dealt with separate from the query engine itself.