docs: explain new physical plan construction (#6970)
* docs: explain new physical plan construction There is no code yet, but this is the rough plan. Ref #6098. * docs: clarify wording Co-authored-by: Andrew Lamb <alamb@influxdata.com> * docs: typos Co-authored-by: Andrew Lamb <alamb@influxdata.com> * docs: add links * docs: extend reasoning --------- Co-authored-by: Andrew Lamb <alamb@influxdata.com>pull/24376/head
parent
ce31ad7607
commit
a3b662f30b
|
@ -0,0 +1,482 @@
|
|||
# IOx Physical Plan Construction
|
||||
This document describes how DataFusion physical plans should be constructed in IOx. As a reminder: Our logical plans
|
||||
contain a `ChunkTableProvider` (implements [`TableProvider`]) that contains a set of `QueryChunk`s. The main entry point
|
||||
for DataFusion is [`TableProvider::scan`] which receives the following information:
|
||||
|
||||
- context (unused)
|
||||
- projection (optional, otherwise retrieve all column)
|
||||
- filter expression
|
||||
- limit (unused)
|
||||
|
||||
We want to design a system that is:
|
||||
|
||||
- correct (esp. it handles deduplication)
|
||||
- efficient (esp. it should support [`TableProvider::supports_filter_pushdown`] = [`TableProviderFilterPushDown::Exact`])
|
||||
- scalable (esp. it should avoid large fan-outs)
|
||||
- easy to understand and extend
|
||||
- work hand-in-hand w/ DataFusion's optimizer framework
|
||||
|
||||
We use the physical plan to fullfill the requirements instead of the logical planning system, because:
|
||||
|
||||
- it allows the querier to create a logical plan w/o contacting the ingesters (helpful for debugging and UX)
|
||||
- currently (2023-02-14) the physical plan in DataFusion seems to be way more customizable than the logical plan
|
||||
- it expresses the "what? scan a table" (logical) vs "how? perform dedup on this data" (physical) better
|
||||
|
||||
The overall strategy is the following:
|
||||
|
||||
1. **Initial Plan:** Construct a semantically correct, naive physical plan.
|
||||
2. **IOx Optimizer Passes:** Apply IOx optimizer passes (in the order in which they occur in this document)].
|
||||
3. **DataFusion Optimizer Passes:** Apply DataFusion's optimizer passes. This will esp. add all required sorts:
|
||||
- before `DeduplicateExec`
|
||||
- if output sorting is required (e.g. for SQL queries a la `SELECT ... FROM ... ORDER BY ...`)
|
||||
- if any group-by expression within the DataFusion plan can use sorting
|
||||
|
||||
This document uses [YAML] to illustrate the plans or plan transformations (in which case two plans are shown). Only the
|
||||
relevant parameters are shown.
|
||||
|
||||
We assume that `QueryChunk`s can be transformed into `RecordBatchesExec`/[`ParquetExec`] and that we can always recover
|
||||
the original `QueryChunk`s from these nodes. We use `ChunkExec` as a placeholder for `RecordBatchesExec`/[`ParquetExec`]
|
||||
if the concrete node type is irrelevant.
|
||||
|
||||
## Initial Plan
|
||||
The initial plan should be correct under [`TableProvider::supports_filter_pushdown`] =
|
||||
[`TableProviderFilterPushDown::Exact`]. Hence it must capture the projection and filter parameters.
|
||||
|
||||
```yaml
|
||||
---
|
||||
ProjectionExec: # optional
|
||||
FilterExec:
|
||||
DeduplicateExec:
|
||||
UnionExec:
|
||||
- RecordBatchesExec
|
||||
- ParquetExec:
|
||||
store: A
|
||||
# if there are multiple stores (unlikely)
|
||||
- ParquetExec:
|
||||
store: B
|
||||
```
|
||||
|
||||
The created [`ParquetExec`] does NOT contain any predicates or projections. The files may be grouped into
|
||||
[`target_partitions`] partitions.
|
||||
|
||||
## Union Handling
|
||||
There are some essential transformations around [`UnionExec`] that always apply. They may be used at any at point (either
|
||||
as an extra optimizer rule or built into some of the other rules). They are mentioned here once so the remaining
|
||||
transformations are easier to follow.
|
||||
|
||||
### Union Un-nesting
|
||||
```yaml
|
||||
---
|
||||
UnionExec:
|
||||
- UnionExec:
|
||||
- SomeExec1
|
||||
- SomeExec2
|
||||
- SomeExec3
|
||||
|
||||
---
|
||||
UnionExec:
|
||||
- SomeExec1
|
||||
- SomeExec2
|
||||
- SomeExec3
|
||||
```
|
||||
|
||||
### 1-Unions
|
||||
```yaml
|
||||
---
|
||||
UnionExec:
|
||||
- SomeExec1
|
||||
|
||||
---
|
||||
SomeExec1
|
||||
```
|
||||
|
||||
## Empty Chunk Nodes
|
||||
`RecordBatchesExec` w/o any `RecordBatch`es and [`ParquetExec`] w/o any files may just be removed from the plan if the
|
||||
parent node is a [`UnionExec`].
|
||||
|
||||
## Deduplication Scope
|
||||
Deduplication must only be performed if there are duplicate tuples (based on their primary key). This is the case if
|
||||
either duplicates may occur within a chunk (e.g. freshly after ingest) or if the key space of chunks overlap. Since
|
||||
deduplication potentially requires sorting and is a costly operation in itself, we may want to avoid it as good as
|
||||
possible and also limit the scope (i.e. set of tuples) on which a deduplication acts on.
|
||||
|
||||
### Partition Split
|
||||
```yaml
|
||||
---
|
||||
DeduplicateExec:
|
||||
UnionExec: # optional, may only contain a single child node
|
||||
- ChunkExec:
|
||||
partition: A
|
||||
- ChunkExec:
|
||||
partition: B
|
||||
- ChunkExec:
|
||||
partition: A
|
||||
|
||||
---
|
||||
Union:
|
||||
- DeduplicateExec:
|
||||
UnionExec:
|
||||
- ChunkExec:
|
||||
partition: A
|
||||
- ChunkExec:
|
||||
partition: A
|
||||
- DeduplicateExec:
|
||||
UnionExec:
|
||||
- ChunkExec:
|
||||
partition: B
|
||||
```
|
||||
|
||||
### Time Split
|
||||
From the `QueryChunk` statistics we always know the time ranges of a chunk. If chunks do NOT overlap in these ranges,
|
||||
they also do NOT overlap in their key space. Hence we can use the time range to split `DeduplicateExec` nodes.
|
||||
|
||||
```yaml
|
||||
---
|
||||
DeduplicateExec:
|
||||
UnionExec: # optional, may only contain a single child node
|
||||
- ChunkExec:
|
||||
ts_min: 1
|
||||
ts_max: 10
|
||||
- ChunkExec:
|
||||
ts_min: 2
|
||||
ts_max: 5
|
||||
- ChunkExec:
|
||||
ts_min: 5
|
||||
ts_max: 5
|
||||
- ChunkExec:
|
||||
ts_min: 8
|
||||
ts_max: 9
|
||||
- ChunkExec:
|
||||
ts_min: 11
|
||||
ts_max: 15
|
||||
- ChunkExec:
|
||||
ts_min: 16
|
||||
ts_max: 17
|
||||
- ChunkExec:
|
||||
ts_min: 17
|
||||
ts_max: 18
|
||||
|
||||
---
|
||||
Union:
|
||||
- DeduplicateExec:
|
||||
UnionExec:
|
||||
- ChunkExec:
|
||||
ts_min: 1
|
||||
ts_max: 10
|
||||
- ChunkExec:
|
||||
ts_min: 2
|
||||
ts_max: 5
|
||||
- ChunkExec:
|
||||
ts_min: 5
|
||||
ts_max: 5
|
||||
- ChunkExec:
|
||||
ts_min: 8
|
||||
ts_max: 9
|
||||
- DeduplicateExec:
|
||||
UnionExec:
|
||||
- ChunkExec:
|
||||
ts_min: 11
|
||||
ts_max: 15
|
||||
- DeduplicateExec:
|
||||
UnionExec:
|
||||
- ChunkExec:
|
||||
ts_min: 16
|
||||
ts_max: 17
|
||||
- ChunkExec:
|
||||
ts_min: 17
|
||||
ts_max: 18
|
||||
```
|
||||
|
||||
### No Duplicates
|
||||
If a `DeduplicateExec` has a single child node and that node does NOT contain any duplicates (based on the primary key),
|
||||
then we can remove the `DeduplicateExec`:
|
||||
|
||||
```yaml
|
||||
---
|
||||
DeduplicateExec:
|
||||
ChunkExec
|
||||
may_contain_pk_duplicates: false
|
||||
|
||||
---
|
||||
ChunkExec
|
||||
may_contain_pk_duplicates: false
|
||||
```
|
||||
|
||||
## Node Grouping
|
||||
After the deduplication handling, chunks may or may not be contained in singular exec nodes. These transformations try
|
||||
to reorganize them in a way that query execution is efficient.
|
||||
|
||||
### Type Grouping
|
||||
`RecordBatchesExec`s can be grouped into a single node. [`ParquetExec`]s can be grouped by object store (this is a single)
|
||||
store in most cases:
|
||||
|
||||
```yaml
|
||||
---
|
||||
UnionExec:
|
||||
- RecordBatchesExec:
|
||||
chunks: [C1, C2]
|
||||
- ParquetExec:
|
||||
chunks: [C4]
|
||||
store: A
|
||||
- ParquetExec:
|
||||
chunks: [C5]
|
||||
store: B
|
||||
- RecordBatchesExec
|
||||
chunks: [C6]
|
||||
- ParquetExec:
|
||||
chunks: [C7]
|
||||
store: A
|
||||
|
||||
---
|
||||
UnionExec:
|
||||
- RecordBatchesExec:
|
||||
chunks: [C1, C2, C6]
|
||||
- ParquetExec:
|
||||
chunks: [C4, C7]
|
||||
store: A
|
||||
- ParquetExec:
|
||||
chunks: [C5]
|
||||
store: B
|
||||
```
|
||||
|
||||
### Sort Grouping
|
||||
Since DataFusion will insert the necessary [`SortExec`]s for us, it is important that we are able to tell it about already
|
||||
sorted data. This only concerns [`ParquetExec`], since `RecordBatchesExec` are based on not-yet-sorted ingester data.
|
||||
[`ParquetExec`] is able to express its sorting per partition, so we are NOT required a [`ParquetExec`] per existing sorting.
|
||||
We just need to make sure that files/chunks with the same sorting end up in the same partition and that we make sure
|
||||
that DataFusion knows about it ([`FileScanConfig::output_ordering`]).
|
||||
|
||||
This somewhat interferes with the [`target_partitions`] setting. We shall find a good balance between avoiding resorts and
|
||||
"too wide" fan-outs.
|
||||
|
||||
## Predicate Pushdown
|
||||
We may push down filters closer to the source under certain circumstances.
|
||||
|
||||
### Predicates & Unions
|
||||
[`FilterExec`] can always be pushed through [`UnionExec`], since [`FilterExec`] only allows row-based operations:
|
||||
|
||||
```yaml
|
||||
---
|
||||
FilterExec:
|
||||
UnionExec:
|
||||
- SomeExec1
|
||||
- SomeExec2
|
||||
|
||||
---
|
||||
UnionExec:
|
||||
- FilterExec:
|
||||
SomeExec1
|
||||
- FilterExec:
|
||||
SomeExec2
|
||||
```
|
||||
|
||||
### Predicates & Projections
|
||||
With the current "Initial Plan" and rule ordering, it should not be required to push predicates through projections. The
|
||||
opposite however is the case, see "Projections & Predicates".
|
||||
|
||||
Note that we could add a transformation implementing this if we ever require it.
|
||||
|
||||
### Predicates & Dedup
|
||||
[`FilterExec`]s contain a [`PhysicalExpr`]. If this is a AND-chain / logical conjunction, we can split it into
|
||||
sub-expressions (otherwise we treat the whole expression as a single sub-expression). For each sub-expression, we can
|
||||
tell which columns it uses. If it refers to only primary-key columns (i.e. no references to fields), we can push it through
|
||||
`DeduplicateExec`:
|
||||
|
||||
```yaml
|
||||
---
|
||||
FilterExec:
|
||||
expr: (field = 1) AND (field=2 OR tag=3) AND (tag > 0)
|
||||
child:
|
||||
DeduplicateExec:
|
||||
SomeExec
|
||||
|
||||
---
|
||||
FilterExec:
|
||||
expr: (field = 1) AND (field=2 OR tag=3)
|
||||
child:
|
||||
DeduplicateExec:
|
||||
FilterExec:
|
||||
expr: tag > 0
|
||||
child:
|
||||
SomeExec
|
||||
```
|
||||
|
||||
Note that empty filters are removed during this process:
|
||||
|
||||
```yaml
|
||||
---
|
||||
FilterExec:
|
||||
expr: tag > 0
|
||||
child:
|
||||
DeduplicateExec:
|
||||
SomeExec
|
||||
|
||||
---
|
||||
DeduplicateExec:
|
||||
FilterExec:
|
||||
expr: tag > 0
|
||||
child:
|
||||
SomeExec
|
||||
```
|
||||
|
||||
### Predicates & Parquet
|
||||
Predicates can be pushed down into [`ParquetExec`] and are partially evaluated there (depending on various other configs
|
||||
and the complexity of the filter). Note that the [`ParquetExec`] itself decides when/how to evaluate predicates, so we are
|
||||
note required to perform any predicate manipulation here:
|
||||
|
||||
```yaml
|
||||
---
|
||||
FilterExec:
|
||||
expr: (tag1 > 0) AND (some_fun(tag2) = 2)
|
||||
child:
|
||||
DeduplicateExec:
|
||||
ParquetExec:
|
||||
files: ...
|
||||
|
||||
---
|
||||
FilterExec:
|
||||
expr: (tag1 > 0) AND (some_fun(tag2) = 2)
|
||||
child:
|
||||
DeduplicateExec:
|
||||
ParquetExec:
|
||||
predicate: (tag1 > 0) AND (some_fun(tag2) = 2)
|
||||
files: ...
|
||||
```
|
||||
|
||||
### Predicates & Record Batches
|
||||
`RecordBatchesExec` does not have any filter mechanism built in and hence relies on [`FilterExec`] to evaluate predicates.
|
||||
We therefore do NOT push down any predicates into `RecordBatchesExec`.
|
||||
|
||||
## Projection Pushdown
|
||||
This concerns the pushdown of columns selections only. Note that [`ProjectionExec`] may contain renaming columns or even
|
||||
the calculation of new ones; these are NOT part of this rule and are never generated by the "Initial Plan".
|
||||
|
||||
### Projections & Unions
|
||||
Projections can always be pushed through union operations since they are only column-based and all union inputs are
|
||||
required to have the same schema:
|
||||
|
||||
```yaml
|
||||
---
|
||||
ProjectionExec:
|
||||
keep: tag1, field2, time
|
||||
child:
|
||||
UnionExec:
|
||||
- SomeExec1
|
||||
- SomeExec2
|
||||
|
||||
---
|
||||
UnionExec:
|
||||
- ProjectionExec:
|
||||
keep: tag1, field2, time
|
||||
child:
|
||||
SomeExec1
|
||||
- ProjectionExec:
|
||||
keep: tag1, field2, time
|
||||
child:
|
||||
SomeExec2
|
||||
```
|
||||
|
||||
### Projections & Predicates
|
||||
Projections may be pushed through [`FilterExec`] if they keep all columns required to evaluate the filter expression:
|
||||
|
||||
```yaml
|
||||
---
|
||||
ProjectionExec:
|
||||
keep: tag1, field2, time
|
||||
child:
|
||||
FilterExec:
|
||||
predicate: field3 > 0
|
||||
child:
|
||||
SomeExec
|
||||
|
||||
---
|
||||
ProjectionExec:
|
||||
keep: tag1, field2, time
|
||||
child:
|
||||
FilterExec:
|
||||
predicate: field3 > 0
|
||||
child:
|
||||
ProjectionExec:
|
||||
keep: tag1, field2, field3, time
|
||||
child:
|
||||
SomeExec
|
||||
```
|
||||
|
||||
### Projections & Dedup
|
||||
Projections that do NOT remove primary keys can be pushed through the deduplication. This is also compatible with the
|
||||
[`SortExec`]s added by DataFusion, since these will only act on the primary keys:
|
||||
|
||||
```yaml
|
||||
---
|
||||
ProjectionExec:
|
||||
keep: tag1, field2, time
|
||||
child:
|
||||
DeduplicateExec:
|
||||
# We assume a primary key of [tag1, tag2, time] here,
|
||||
# but `SomeExec` may have more fields (e.g. [field1, field2]).
|
||||
SomeExec
|
||||
|
||||
---
|
||||
ProjectionExec:
|
||||
keep: tag1, field2, time
|
||||
child:
|
||||
DeduplicateExec:
|
||||
ProjectionExec:
|
||||
keep: tag1, tag2, field2, time
|
||||
child:
|
||||
SomeExec
|
||||
```
|
||||
|
||||
### Projections & Parquet
|
||||
[`ParquetExec`] can be instructed to only deserialize required columns via [`FileScanConfig::projection`]. Note that we
|
||||
shall not modify [`FileScanConfig::file_schema`] because we MUST NOT remove columns that are used for pushdown predicates.
|
||||
|
||||
```yaml
|
||||
---
|
||||
ProjectionExec:
|
||||
keep: tag1, field2, time
|
||||
child:
|
||||
ParquetExec:
|
||||
predicate: field1 > 0
|
||||
projection: null
|
||||
|
||||
---
|
||||
ParquetExec:
|
||||
predicate: field1 > 0
|
||||
projection: tag1, field2, time
|
||||
```
|
||||
|
||||
### Projections & Record Batches
|
||||
While `RecordBatchesExec` does not implement any predicate evaluation, it implements projection (column selection). The
|
||||
reason is that it creates NULL-columns for batches that do not contain the required output columns. Hence it is valuable
|
||||
to push down projections into `RecordBatchesExec` so we can avoid creating columns that we would throw away anyways:
|
||||
|
||||
```yaml
|
||||
---
|
||||
ProjectionExec:
|
||||
keep: tag1, field2, time
|
||||
child:
|
||||
RecordBatchesExec:
|
||||
schema: tag1, tag2, field1, field2, time
|
||||
|
||||
---
|
||||
RecordBatchesExec:
|
||||
schema: tag1, field2, time
|
||||
```
|
||||
|
||||
|
||||
[`FileScanConfig::file_schema`]: https://docs.rs/datafusion/18.0.0/datafusion/physical_plan/file_format/struct.FileScanConfig.html#structfield.file_schema
|
||||
[`FileScanConfig::output_ordering`]: https://docs.rs/datafusion/18.0.0/datafusion/physical_plan/file_format/struct.FileScanConfig.html#structfield.output_ordering
|
||||
[`FileScanConfig::projection`]: https://docs.rs/datafusion/18.0.0/datafusion/physical_plan/file_format/struct.FileScanConfig.html#structfield.projection
|
||||
[`FilterExec`]: https://docs.rs/datafusion/18.0.0/datafusion/physical_plan/filter/struct.FilterExec.html
|
||||
[`ParquetExec`]: https://docs.rs/datafusion/18.0.0/datafusion/physical_plan/file_format/struct.ParquetExec.html
|
||||
[`PhysicalExpr`]:https://docs.rs/datafusion/18.0.0/datafusion/physical_plan/trait.PhysicalExpr.html
|
||||
[`ProjectionExec`]: https://docs.rs/datafusion/18.0.0/datafusion/physical_plan/projection/struct.ProjectionExec.html
|
||||
[`SortExec`]: https://docs.rs/datafusion/18.0.0/datafusion/physical_plan/sorts/sort/struct.SortExec.html
|
||||
[`TableProvider`]: https://docs.rs/datafusion/18.0.0/datafusion/datasource/datasource/trait.TableProvider.html
|
||||
[`TableProvider::scan`]: https://docs.rs/datafusion/18.0.0/datafusion/datasource/datasource/trait.TableProvider.html#tymethod.scan
|
||||
[`TableProvider::supports_filter_pushdown`]: https://docs.rs/datafusion/18.0.0/datafusion/datasource/datasource/trait.TableProvider.html#method.supports_filter_pushdown
|
||||
[`TableProviderFilterPushDown::Exact`]: https://docs.rs/datafusion/18.0.0/datafusion/datasource/datasource/enum.TableProviderFilterPushDown.html#variant.Exact
|
||||
[`target_partitions`]: https://docs.rs/datafusion/18.0.0/datafusion/config/struct.ExecutionOptions.html#structfield.target_partitions
|
||||
[`UnionExec`]: https://docs.rs/datafusion/18.0.0/datafusion/physical_plan/union/struct.UnionExec.html
|
||||
[YAML]: https://yaml.org/
|
Loading…
Reference in New Issue