This interface was once specially implemented by the RUB. The only
actual implementation of it is within the querier that just forwards it
to a simple schema scan. Lift this semantic to `iox_query_influxrpc`
instead so all the chunks can use it.
If we ever want to optimize this again, we should use `QueryChunk::data`
instead (i.e. instead of implementing it within the chunk it should use
the data method and do something smart based on that).
First half of #8096.
* feat: metrics for main tokio runtime
* feat: instrument executor tokio runtime
---------
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
Do not (ab)use per-chunk delete predicates for the retention policy.
Instead use a per-table predicate.
This makes the code way cleaner, since the scoping is correct (i.e.
delete predicates are a table-wide attribute, not a chunk-based one) and
it is consistent time predicates that the user providers (e.g. via
`WHERE time > x`).
It also allows us to remove delete predicates (in their current,
non-scalable form) from the query path. A potential future version would
likely not use per chunk predicates (and "is processed" markers) but use
the timestamp / chunk order to determine to which data the predicate
should be applied.
Note that the lowering of the retention policy changed slightly from
```text
(time > (now() - retention)) AND (time < MAX)
```
to
```text
time > (now() - retention)
```
Since the `MAX` cut is just an artifact of the lowering and was unnecessary.
Closes#7409.
Closes#7410.
* chore: Update datafusion + arrow/arrow-flight/parquet to version `42.0.0`
* chore: Update for new APIs
---------
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* chore: Update DataFusion pin
* chore: Update API changes
* chore: Don't use deprecated API
* chore: Run cargo hakari tasks
* chore: Update tests due to changes in logical plan nodes from DF update
* chore: Fix broken links in docs
* chore: Adjust changes to expected output
---------
Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
See optimizer pipeline here:
5d0bb68c5b/iox_query/src/physical_optimizer/mod.rs (L33-L35)
After generating the naive initial plan w/ many partitions, we must
consider more than 100 partitions to split the key space.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* chore: Update DataFusion pin
* chore: Update cargo
* fix: update for API changes
* fix: Update plans
* chore: Update for new api
* fix: Update plans
* chore: Update for API changes more
---------
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
This is the major part of #7470. Additional clean ups (e.g. to remove
the actual types from `data_types`) will follow.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* feat: aggregator for DataFusion statistics
Required to implement #7470, esp. to implement the statistics folding
done within `RecordBatchesExec`.
* docs: improve
* refactor: consolidate pruning code
Let's have a single chunk pruning implementation in our code, not two.
Also removes a bit of crust from `QueryChunk` since it is technically no
longer responsible for pruning (this part has been pushed into the
querier for early pruning and bits for the `iox_query_influxrpc` for
some RPC shenanigans).
* test: regression test for incident
* fix: chunk pruning
* docs: add some test notes
This commit fixes loads of crates (47!) had unused dependencies, or
mis-configured dependencies (test deps as normal deps).
I added the "unused_crate_dependencies" to all crates to help prevent
this mess from growing again!
https://doc.rust-lang.org/beta/nightly-rustc/rustc_lint_defs/builtin/static.UNUSED_CRATE_DEPENDENCIES.html
This has the minor downside of false-positives when specifying
dev-dependencies for test/bench binaries - these are files in /test or
/benches (not normal tests). This commit includes a workaround,
importing them in lib.rs (gated by a feature flag). I think the
trade-off of better dependency management is worth it!
Let's have a single chunk pruning implementation in our code, not two.
Also removes a bit of crust from `QueryChunk` since it is technically no
longer responsible for pruning (this part has been pushed into the
querier for early pruning and bits for the `iox_query_influxrpc` for
some RPC shenanigans).
* test: add dedup test for multiple partitions and ranges
* refactor: remove `RedudantSort` optimizer pass
Similar to #7807 this is now covered by DataFusion, as demonstrated by
the fact that all query tests (incl. explain tests) still pass.
The good thing is: passes that are no longer required don't require any
upstreaming, so this also closes#7411.
DataFusion is now smart enough to do that using the builtin passes. No
`EXPLAIN` tests regressed.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* test: reproducer for idpe_17556
* fix: `ParquetSortness` and partial opt
1. correctly handle cases where `ParquetSortness` would optimize one
child branch but not the other
2. handle cases where `ParquetSortness` recusion should stop a bit
clearer (using `TreeNodeRewriter`)
3. rename query tests to be a bit clearer
4. add test case with many (but not too many) duplicate files and an
ingester (basically a prod use case where the compactor is slightly
behind)
---------
Co-authored-by: Marco Neumann <marco@crepererum.net>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
This is somewhat a left-over from the old phys. plan construction where
we tried to fold in the sorts at the right place. Now the optimizer
takes care of that, so we can just express this as a standard logical
node (the same as SQL and InfluxQL). This makes the plan construction a
bit cleaner since the actual scan provider only performs the minimal
work that is required by DataFusion and the users (SQL, InfluxQL, reorg)
request what they actually need.
The tests in `iox_query::frontend::reorg::tests` that assert the tests
still pass and proof that the actual physical plans are identical w/
this approach.
Closes#7785.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* chore: update DataFusion and arrow/parquet/arrow-flight to 39.0.0
* chore: update DataFusion and arrow/parquet/arrow-flight to 39.0.0 in workspace-hack/Cargo.toml
* chore: Run cargo hakari tasks
* chore: fix CI test and lint
* chore: update csv schema
* refactor: remove type-annotate for `Arc`
---------
Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* test: add test for gap fill query missing time bounds
* chore: update unit test
---------
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
As of #7250 / #7449 the partition sort key is no longer required for
query planning. Instead we use a combination of
`QueryChunk::partition_id` and `QueryChunk::sort_key` which is more
robust and easier to reason about.
Removing it simplifies the querier code a lot since we no longer need to
have a sort key for the ingester chunks and also don't need to "sync"
the sort key between chunks for consistency.
This is helpful to test changes in our defaults but also for testing.
Required for https://github.com/influxdata/idpe/issues/17474 .
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* chore: Update datafusion and arrow/parquet to 37, tonic to 0.9.1
* refactor: Update for FieldRef and other API changes
* fix: Update field size calculation
* fix: Use `NullBuffer` directly
* fix: remove outdated comment
* chore: Update test for tonic
* chore: Run cargo hakari tasks
* chore: cargo update
---------
Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* refactor: move logic for knowing how much to buffer into GapFiller
* chore: clippy
* chore: add some clarifying comments
* refactor: clean up relationships between gap filling types
* refactor: remove use of RefCell from BufferedInput
---------
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
1. Add loads of tests for `ChunkTableProvider::scan` (= the naive phys.
plan before running any phys. optimizers)
2. Fix interaction of "no de-dup" and predicate pushdown. This might
be used by the ingester at some point and I would like to have this
correct before someone silently introduces a bug by pushing field
predicates into the ingester.
This is mostly prep-work for #7406 so I know that test coverage is
sufficient.
* feat: "parquet sortness" optimizer pass
Trade wider fan-out for the not having to fully sort parquet files.
For #6098.
* test: rename
Co-authored-by: Andrew Lamb <alamb@influxdata.com>
---------
Co-authored-by: Andrew Lamb <alamb@influxdata.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
With #6098 our `TableProvider` will declare `supports_filter_pushdown`
as "exact" since we handle the predicate pushdown ourselves. This has
two effects:
1. The phys. plan no longer contains an additional `FilterExec` node
even if we already do all the correct filtering. This will improve
performance.
2. The logical plan no longer contains a `Filter` node but instead the
predicate is part of the `TableScan`. This simplifies the logical
plan.
For (2) we need to adjust the gap fill logical optimizer to find the
time range again. Otherwise the optimizer pass will fail (which is
currently somewhat swallowed by DataFusion even though it is logged) and
the physical plan will contain our placeholder UDFs that are not
executable.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
We should resort properly when performing projection pushdown. Extended
test utils to actually catch this by checking the plan schemas.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* feat: update gap fill planner rule to use LOCF
* chore: cargo fmt
---------
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* fix: projection pushdown should project `ParquetExec` ordering
Bug found while working on the final steps for #6098.
* fix: Update expected output
* test: make test even harder
---------
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Try to combine chunks even when not all Union-arms/inputs are
combinable. This will later help to transform
```yaml
---
union:
- parquet:
files: [f1]
- parquet:
files: [f2]
- dedup:
parquet:
files: [f3]
```
into
```yaml
---
union:
- parquet:
files: [f1, f2]
- dedup:
parquet:
files: [f3]
```
Helps #6098.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* chore: Update DataFusion
* refactor: Update predicate crate for new transform API
* refactor: Update iox_query crate for new APIs
* refactor: Update influxql for new API
* chore: Run cargo hakari tasks
---------
Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
`extract_chunks` never runs after predicate pushdown. However IF this
should ever happen, we would potentially forget the predicates attached
to `ParquetExec`. So let's make sure we refuse chunk extraction in this
case. This is similar to the existing behavior, i.e. we don't support
chunk extraction after filter pushdown (i.e. if there is a filter around
an `RecordBatchesExec`).
For #6098.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
This is helpful so that optimizer passes to forget the sort key, esp.
when the run after `DedupNullColumns` and `DedupSortOrder`.
For #6098.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
Similar to #7217 there is no need to convert the arrow schema to an IOx
schema. This also makes it easier to handle the chunk order column in #6098.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
We don't need a validated IOx schema in this method. This will simplify
some work on #6098.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* feat: implement gap fill with previous value
* test: update fill prev test to include null value
---------
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* feat: projection pushdown phys. optimizer
The is by far the largest pass (at least test-wise), because projections
are added last in the naive plan and you have to push them through
everything else. The actual code however isn't that complicated mostly
because we can reuse some DataFusion functionality and the different
variants for the different "child nodes" are very similar.
For #6098.
* feat: projection pushdown for `RecordBatchesExec`
* test: `test_ignore_when_partial_impure_projection_rename`
* test: more dedup projection tests
* test: integration
* feat: `SchemaAdapterStream` may create virtual columns
For chunk order handling in #6098.
* fix: improve `SchemaAdapterStream` docs and error handling
* chore: Upgrade to Rust 1.68
* fix: Remove unnecessary into_iter, thanks Clippy!
* fix: Use the size of the type, not a reference to the type... oops.
Thanks clippy!
* fix: Return block directly instead of creating a variable
Thanks clippy!
---------
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* refactor: Break unnecessary dependencies from `iox_query` crate
In the process, the test code has been simplified.
* refactor: Move InfluxQL plan module to iox_query_influxql crate
* refactor: Move remaining behaviour from iox_query to iox_query_influxql
* chore: rustfmt 🙄
I was under the impression `clippy` would catch formatting
* feat: determine cheap de-dup sort order
For #6098.
* test: `test_three_chunks_different_subsets`
* fix: ensure that columns can be drawn early
* docs: improve algo explaination
* refactor: make code clearer
* chore: Normalise name of Call expression to lowercase
Simplifies matching functions in planner, as they are guaranteed to be
lowercase.
This also ensures compatibility with InfluxQL when generating column
alias names, which are reflected in updated tests.
* chore: Ensure aggregate functions fail gracefully.
* feat: GROUP BY tag support
* feat: Ensure schema-level metadata is propagated
Requires: https://github.com/apache/arrow-rs/issues/3779
* chore: Add some tests to validate GROUP BY output
* chore: Add clarifying comment
* chore: Declare message in flight.proto
The metadata is public API, so best practice is to encode this in a way
that is most compatible for clients in other languages, and will also
document the history of schema changes.
Added tests to validate the metadata is encoded correctly.
* chore: Placate linters
* chore: Use correct column in test cases
* chore: Add `is_projected` to the TagKeyColumn message
`is_projected` is necessary to inform a client whether it should include
the tag key is used exclusively for the group key (false) or also
projected in the `SELECT` column list.
* refactor: Move constants to `schema` crate per PR feedback
* chore: rustfmt 🙄
* chore: Update docs for InfluxQlMetadata
Co-authored-by: Andrew Lamb <alamb@influxdata.com>
---------
Co-authored-by: Andrew Lamb <alamb@influxdata.com>