* feat: "parquet sortness" optimizer pass
Trade wider fan-out for the not having to fully sort parquet files.
For #6098.
* test: rename
Co-authored-by: Andrew Lamb <alamb@influxdata.com>
---------
Co-authored-by: Andrew Lamb <alamb@influxdata.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
With #6098 our `TableProvider` will declare `supports_filter_pushdown`
as "exact" since we handle the predicate pushdown ourselves. This has
two effects:
1. The phys. plan no longer contains an additional `FilterExec` node
even if we already do all the correct filtering. This will improve
performance.
2. The logical plan no longer contains a `Filter` node but instead the
predicate is part of the `TableScan`. This simplifies the logical
plan.
For (2) we need to adjust the gap fill logical optimizer to find the
time range again. Otherwise the optimizer pass will fail (which is
currently somewhat swallowed by DataFusion even though it is logged) and
the physical plan will contain our placeholder UDFs that are not
executable.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
We should resort properly when performing projection pushdown. Extended
test utils to actually catch this by checking the plan schemas.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* feat: update gap fill planner rule to use LOCF
* chore: cargo fmt
---------
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* fix: projection pushdown should project `ParquetExec` ordering
Bug found while working on the final steps for #6098.
* fix: Update expected output
* test: make test even harder
---------
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Try to combine chunks even when not all Union-arms/inputs are
combinable. This will later help to transform
```yaml
---
union:
- parquet:
files: [f1]
- parquet:
files: [f2]
- dedup:
parquet:
files: [f3]
```
into
```yaml
---
union:
- parquet:
files: [f1, f2]
- dedup:
parquet:
files: [f3]
```
Helps #6098.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* chore: Update DataFusion
* refactor: Update predicate crate for new transform API
* refactor: Update iox_query crate for new APIs
* refactor: Update influxql for new API
* chore: Run cargo hakari tasks
---------
Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
`extract_chunks` never runs after predicate pushdown. However IF this
should ever happen, we would potentially forget the predicates attached
to `ParquetExec`. So let's make sure we refuse chunk extraction in this
case. This is similar to the existing behavior, i.e. we don't support
chunk extraction after filter pushdown (i.e. if there is a filter around
an `RecordBatchesExec`).
For #6098.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
This is helpful so that optimizer passes to forget the sort key, esp.
when the run after `DedupNullColumns` and `DedupSortOrder`.
For #6098.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
Similar to #7217 there is no need to convert the arrow schema to an IOx
schema. This also makes it easier to handle the chunk order column in #6098.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
We don't need a validated IOx schema in this method. This will simplify
some work on #6098.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* feat: implement gap fill with previous value
* test: update fill prev test to include null value
---------
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* feat: projection pushdown phys. optimizer
The is by far the largest pass (at least test-wise), because projections
are added last in the naive plan and you have to push them through
everything else. The actual code however isn't that complicated mostly
because we can reuse some DataFusion functionality and the different
variants for the different "child nodes" are very similar.
For #6098.
* feat: projection pushdown for `RecordBatchesExec`
* test: `test_ignore_when_partial_impure_projection_rename`
* test: more dedup projection tests
* test: integration
* feat: `SchemaAdapterStream` may create virtual columns
For chunk order handling in #6098.
* fix: improve `SchemaAdapterStream` docs and error handling
* chore: Upgrade to Rust 1.68
* fix: Remove unnecessary into_iter, thanks Clippy!
* fix: Use the size of the type, not a reference to the type... oops.
Thanks clippy!
* fix: Return block directly instead of creating a variable
Thanks clippy!
---------
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* refactor: Break unnecessary dependencies from `iox_query` crate
In the process, the test code has been simplified.
* refactor: Move InfluxQL plan module to iox_query_influxql crate
* refactor: Move remaining behaviour from iox_query to iox_query_influxql
* chore: rustfmt 🙄
I was under the impression `clippy` would catch formatting
* feat: determine cheap de-dup sort order
For #6098.
* test: `test_three_chunks_different_subsets`
* fix: ensure that columns can be drawn early
* docs: improve algo explaination
* refactor: make code clearer
* chore: Normalise name of Call expression to lowercase
Simplifies matching functions in planner, as they are guaranteed to be
lowercase.
This also ensures compatibility with InfluxQL when generating column
alias names, which are reflected in updated tests.
* chore: Ensure aggregate functions fail gracefully.
* feat: GROUP BY tag support
* feat: Ensure schema-level metadata is propagated
Requires: https://github.com/apache/arrow-rs/issues/3779
* chore: Add some tests to validate GROUP BY output
* chore: Add clarifying comment
* chore: Declare message in flight.proto
The metadata is public API, so best practice is to encode this in a way
that is most compatible for clients in other languages, and will also
document the history of schema changes.
Added tests to validate the metadata is encoded correctly.
* chore: Placate linters
* chore: Use correct column in test cases
* chore: Add `is_projected` to the TagKeyColumn message
`is_projected` is necessary to inform a client whether it should include
the tag key is used exclusively for the group key (false) or also
projected in the `SELECT` column list.
* refactor: Move constants to `schema` crate per PR feedback
* chore: rustfmt 🙄
* chore: Update docs for InfluxQlMetadata
Co-authored-by: Andrew Lamb <alamb@influxdata.com>
---------
Co-authored-by: Andrew Lamb <alamb@influxdata.com>
When combining sort keys, we have to check the schema of the chunk to
differentiate between "column does not exist within this chunk" and
"column exists but is not sorted".
This is unlikely an issue in prod at the moment (if there is not bug in
the ingester or compactor), but this was found while working on tests
for #6098. Overall this should improve robustness.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* refactor: remove unused `ColumnSort`
* refactor: remove invalid assertion
It is true that time SHOULD be the last sort key, but we absoletely
don't require that, esp. not in the query tier. The ingester will
currently always produce sort keys where time is last, but if we ever
going to deal w/ external data sources like bulk loaded parquet files,
this may not always be the case.
Found while constructing some edge case tests.
---------
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>