When a `SELECT ... INTO ...` is used with `top()` or `bottom()` used
with tags, the points will be written with the tags still intact instead
of converted to fields.
The previous version of `top()` and `bottom()` would gather all of the
points to use in a slice, filter them (if necessary), then use a
slightly modified heap sort to retrieve the top or bottom values.
This performed horrendously from the standpoint of memory. Since it
consumed so much memory and spent so much time in allocations (along
with sorting a potentially very large slice), this affected speed too.
These calls have now been modified so they keep the top or bottom points
in a min or max heap. For `top()`, a new point will read the minimum
value from the heap. If the new point is greater than the minimum point,
it will replace the minimum point and fix the heap with the new value.
If the new point is smaller, it discards that point. For `bottom()`, the
process is the opposite.
It will then sort the final result to ensure the correct ordering of the
selected points.
When `top()` or `bottom()` contain a tag to select, they have now been
modified so this query:
SELECT top(value, host, 2) FROM cpu
Essentially becomes this query:
SELECT top(value, 2), host FROM (
SELECT max(value) FROM cpu GROUP BY host
)
This should drastically increase the performance of all `top()` and
`bottom()` queries.
`top()` and `bottom()` will now organize the points by time and also
keep the points original time even when a time grouping is used. At the
same time, `top()` and `bottom()` will no longer honor any fill options
that are present since they don't really make sense for these specific
functions.
This also fixes the aggregate and selectors to honor the ordered
iterator option so iterator remain ordered and to also respect the
buckets that are created by the final dimensions of the query so that
two buckets don't overlap each other within the same reducer. A test has
been added for this situation. This should clarify and encourage the use
of the ordered attribute within the query engine.
This adds query syntax support for subqueries and adds support to the
query engine to execute queries on subqueries.
Subqueries act as a source for another query. It is the equivalent of
writing the results of a query to a temporary database, executing
a query on that temporary database, and then deleting the database
(except this is all performed in-memory).
The syntax is like this:
SELECT sum(derivative) FROM (SELECT derivative(mean(value)) FROM cpu GROUP BY *)
This will execute derivative and then sum the result of those derivatives.
Another example:
SELECT max(min) FROM (SELECT min(value) FROM cpu GROUP BY host)
This would let you find the maximum minimum value of each host.
There is complete freedom to mix subqueries with auxiliary fields. The only
caveat is that the following two queries:
SELECT mean(value) FROM cpu
SELECT mean(value) FROM (SELECT value FROM cpu)
Have different performance characteristics. The first will calculate
`mean(value)` at the shard level and will be faster, especially when it comes to
clustered setups. The second will process the mean at the top level and will not
include that optimization.
`percentile()` is supposed to be a selector and return the time of the
point, but that only got changed when the input was a float. Updating
the integer processor to also return the time of the point rather than
the beginning of the interval.
Strings would always return an empty string and stddev is meaningless
when it comes to strings. This removes that functionality so strings
don't automatically get picked up when using a wildcard.
The `cumulative_sum()` function can be used to sum each new point and
output the current total. For the following points:
cpu value=2 0
cpu value=4 10
cpu value=6 20
This would output the following points:
> SELECT cumulative_sum(value) FROM cpu
time value
---- -----
0 2
10 6
20 12
As can be seen, each new point adds to the sum of the previous point and
outputs the value with the same timestamp.
The function can also be used with an aggregate like `derivative()`.
> SELECT cumulative_sum(mean(value) FROM cpu WHERE time >= now() - 10m GROUP BY time(1m)
First Pass at implementing sample
Add sample iterators for all types
Remove size from sample struct
Fix off by one error when generating random number
Add benchmarks for sample iterator
Add test and associated fixes for off by one error
Add test for sample function
Remove NumericLiteral from sample function call
Make clear that the counter is incr w/ each call
Rename IsRandom to AllSamplesSeen
Add a rng for each reducer that is created
The default rng that comes with math/rand has a global lock. To avoid
having to worry about any contention on the lock, each reducer now has
its own time seeded rng.
Add sample function to changelog
For aggregate queries, derivatives will now alter the start time to one
interval behind and will use that interval to find the derivative of the
first point instead of giving no value for that interval. Null values
will still be discarded so if the interval before the one you are
querying is null, then it will be discarded like if it were in the
middle of the query. You can use `fill(0)` to fill in these values.
This does not apply to raw queries yet.
Also modified the derivative and difference aggregates to use the stream
iterator instead of the reduce slice iterator for space efficiency.
Fixes#3247. Contributes to #5943.
Change distinct so it uses a custom reducer that keeps internal state
instead of requiring all of the points to be kept as a slice in memory.
Fixes#6261.
The simple moving average will gradually emit points instead of waiting
until the end. This should apply to derivative and difference in the
future too.
Fixes#6112.
The difference function is implemented very similar to how derivative is
implemented. It is an aggregate function that acts over the entire
aggregate. This function will also have the same problems that
derivative has with getting values from the previous interval or point.
This will be fixed separately as part of #5943.
Fixes#1825.
Numbers in the query without any decimal will now be emitted as integers
instead and be parsed as an IntegerLiteral. This ensures we keep the
original context that a query was issued with and allows us to act more
similar to how programming languages are typically structured when it
comes to floats and ints.
This adds functionality for dealing with integers promoting to floats in
the various different places where math are used.
Fixes#5744 and #5629.
Normalize the time for the distinct() call to either be at the beginning
of the group by interval or the start time similar to every other call.
The timestamp previously just showed the first time found and didn't
make a lot of sense in the context of what the function was supposed to
do.
Fixes#6040.
All three of these iterators are supposed to support all four types of
iterators, but the implementation was never done for string or boolean.
Fixes#5886.
This refactor is primarily to support Kapacitor. Kapacitor doesn't care
about the iterators and mostly keeps the points it handles in memory.
The iterator interface is more than Kapacitor cares about.
This commit refactors and opens up the internals of aggregating and
reducing incoming points so it can be used by an outside library with
the same code. It also makes the iterators used by the call iterators
publically usable with new functionality.
Reducers are split into two methods which are separate interfaces that
can be combined for dealing with casting between different types. The
Aggregator interfaces accept points into the aggregator and retain any
internal state they need. The Emitter interface will then create a point
from that aggregated state which can be fed to the iterator. The
Emitters do not fill in the name or tag of the point as that is expected
to be done by the person aggregating the point. While the Emitters do
sometimes fill in the time, that value will also be overwritten by the
iterator. Filling in the time is to allow a future version that will
allow returning the point time instead of just the interval time.
A new attribute has been added to points to track how many points were
used to calculate that point. This is particularly useful for finding
the mean as we can then split mean calculation into two phases: one at
the shard level and a second at the shards level.
This optimization is now used so we don't have to hold so many points in
memory while calculating the mean.
top() and bottom() point ordering was incorrect and using an inefficient
method of sorting. It has now been updated to use a heap and ordering is
being done by value first and time second (with earlier times always
taking priority).
Removed unit tests that test using `time` inside of the query to get the
real time instead of the interval time and only allowing the default
behavior. We will have another mechanism to get the real time during an
interval, but the current method is deprecated.
The top() and bottom() methods now have integer support.
last() would always return the last output of the iterator (which isn't
necessarily the last time value due to how the merge iterator works) and
first() would always return the first output of the iterator (wrong for
the same reason).
Now the time is kept by the reduce function and the times are wiped as
part of the reduce iterator after the value has been found.
It matches more in functionality to the functions in call_iterator.go
than iterator.go. iterator.go mostly has base iterators and
call_iterator.go has iterators related to functional calls, which is
the only time integerReduceSliceFloatIterator is used.
Also fixes the `first()` and `last()` calls to do the same thing as
`min()` and `max()` by returning the time corresponding to the start of
the interval rather than the point's real time.
This does not implement the time selector, but everything else is
implemented. Unfortunately, there are no tests for bottom() in the old
query engine, so only top() is properly tested.