* feat(query): hyper log log counting in query engine
In addition to helping with normal queries, this can improve the 'SHOW CARDINALITY'
meta-queries:
time influx -database mydb -execute 'select count_hll(sum_hll(_seriesKey)) from big'
name: big
time count_hll
---- ---------
0 200767781
influx -database mydb -execute 0.06s user 0.12s system 0% cpu 8:49.99 total
When `top()` or `bottom()` were used and selected auxiliary values, they
would return the wrong values that would be equal to the last point
selected. This is because the aggregators saved the memory address of
the auxiliary fields instead of copying them over. Since the same
auxiliary fields memory location is used for every value returned by the
storage engine, this resulted in the values being incorrect because they
were overwritten with incorrect values.
This fixes that so the auxiliary fields are copied out when they are
saved rather than only the memory location.
This is not complete, but it is a starting point for more thorough tests
of subqueries.
This also reorders the use of `cmp.Diff` so the `want` is first and
`got` is second. This way, the `want` shows up as a minus sign in the
diff rather than, confusingly, as a plus sign.
This change makes it so that we simplify the math engine so it doesn't
use a complicated set of nested iterators. That way, we have to change
math in one fewer place.
It also greatly simplifies the query engine as now we can create the
necessary iterators, join them by time, name, and tags, and then use the
cursor interface to read them and use eval to compute the result. It
makes it so the auxiliary iterators and all of their complexity can be
removed.
This also makes use of the new eval functionality that was recently
added to the influxql package.
No math functions have been added, but the scaffolding has been included
so things like trigonometry functions are just a single commit away.
This also introduces a small breaking change. Because of the call
optimization, it is now possible to use the same selector multiple times
as a selector. So if you do this:
SELECT max(value) * 2, max(value) / 2 FROM cpu
This will now return the timestamp of the max value rather than zero
since this query is considered to have only a single selector rather
than multiple separate selectors. If any aspect of the selector is
different, such as different selector functions or different arguments,
it will consider the selectors to be aggregates like the old behavior.
The Cursor returned will be capable of scanning rows into a structure.
It replaces part of the function for why the Emitter existed. The
Emitter would both join the resulting rows and then transform the values
into a models.Row so it could be returned to the results.
In the future, we will be able to use the Cursor directly to write out
values which should be more memory efficient.
* Introduces EXPLAIN ANALYZE command, which
produces a detailed tree of operations used to
execute the query.
introduce context.Context to APIs
metrics package
* create groups of named measurements
* safe for concurrent access
tracing package
EXPLAIN ANALYZE implementation for OSS
Serialize EXPLAIN ANALYZE traces from remote nodes
use context.Background for tests
group with other stdlib packages
additional documentation and remove unused API
use influxdb/pkg/testing/assert
remove testify reference
Field math works similar to condition evaluation, but not the exact same
because we have more information to work with in field expressions than
we do in conditional math because fields retain the information about
their source while conditions do not.
The main difference is that you cannot add an unsigned literal to the
output of an integer iterator while you can inside of a condition. You
can perform math on a positive integer literal to an unsigned iterator.
Inside of the condition, we aren't sure if an integer is because of a
literal or because of an iterator so we can't make that distinction.
It prints the statistics of each iterator that will access the storage
engine. For each access of the storage engine, it will print the number
of shards that will potentially be accessed, the number of files that
may be accessed, the number of series that will be created, the number
of blocks, and the size of those blocks.
This refactors the validation code so it is more flexible and performs a
small bit of work to make preparing and executing the query easier.
The general idea is that compilation will eventually do more heavy
lifting in creating the initial plan and prepare will construct an
actual plan rather than just doing some basic field rewriting.
This change at least sets us up for that change in the future and moves
the validation code to the query execution instead of in the parser.
This also frees up the parser to parse the complete AST without worrying
if the query itself is valid. That could be useful for client code that
wants to compile a partial query to an AST and then perform
modifications on the AST for some reason.
The statement rewriting logic should be in the query engine as part of
preparing a query. This creates a shard mapper interface that the query
engine expects and then passes it to the query engine instead of
requiring the query to be preprocessed before being input into the query
engine. This interface is (mostly) the same as the old interface, just
moved to a different package.
This will allow refactoring the query engine and the select statement
more easily since fewer code locations will need to be changed. It also
reduces the amount of code while still keeping individual tests that are
filterable.
This change provides a clear separation between the query engine
mechanics and the query language so that the language can be parsed and
dealt with separate from the query engine itself.