Commit Graph

129 Commits (eb6abb5d670799e9e1d9a1a4784305aadf5a6914)

Author SHA1 Message Date
Nga Tran 95ed41f140
feat: Projection pushdown for querier -> ingester for rpc queries (#5782)
* feat: initial step to identify where the projection should be provided

* feat: start getting columns of all expressions

* chore: format

* test: test for the table_chunk_stream

* fix: fix a compile error. Thanks @alamb

* test: full tests for table_chunk_stream

* chore: cleanup

* fix: do not cut any columns in case all fields are needed

* test: add one more test case of reading all columns

* refactor: move code that identify columbs ot push down to a function. Add the use of  field_columns

* chore: cleanup

* refactor: make sream_from_batch support empty batches

* chore: cleanup

* chore: fix clippy after auto merge

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-10-06 17:21:23 +00:00
Marco Neumann c4c83e0840
fix: query error propagation (#5801)
- treat OOM protection as "resource exhausted"
- use `DataFusionError` in more places instead of opaque `Box<dyn Error>`
- improve conversion from/into `DataFusionError` to preserve more
  semantics

Overall, this improves our error handling. DF can now return errors like
"resource exhausted" and gRPC should now automatically generate a
sensible status code for it.

Fixes #5799.
2022-10-06 08:54:01 +00:00
Dom Dwyer cd4087e00d style: add no todo!() or dbg!() lints
Some crates had theme, some not - lets be consistent and have the
compiler spot dbg!() and todo!() macro calls - they should never be in
prod code!
2022-09-29 13:10:07 +02:00
Andrew Lamb 66dbb9541f
chore: Update datafusion and `arrow`/`parquet`/`arrow-flight` to 23.0.0, `thrift` to 0.16.0 (#5694)
* chore: Update datafusion and `arrow`/`parquet`/`arrow-flight`  to 23.0.0

* chore: Update thrift / remove parquet_format

* fix: Update APIs

* chore: Update lock + Run cargo hakari tasks

* fix: use patched version of arrow-rs to work around https://github.com/apache/arrow-rs/issues/2779

* chore: Run cargo hakari tasks

Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-09-27 12:50:54 +00:00
Carol (Nichols || Goulding) c8108f01e7
chore: Upgrade to Rust 1.64 (#5727)
* chore: Upgrade to Rust 1.64

* fix: Use iter find instead of a for loop, thanks clippy

* fix: Remove some needless borrows, thanks clippy

* fix: Use then_some rather than then with a closure, thanks clippy

* fix: Use iter retain rather than filter collect, thanks clippy

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-09-22 18:04:00 +00:00
dependabot[bot] ea1e822e3b
chore(deps): Bump itertools from 0.10.4 to 0.10.5 (#5707)
Bumps [itertools](https://github.com/rust-itertools/itertools) from 0.10.4 to 0.10.5.
- [Release notes](https://github.com/rust-itertools/itertools/releases)
- [Changelog](https://github.com/rust-itertools/itertools/blob/master/CHANGELOG.md)
- [Commits](https://github.com/rust-itertools/itertools/commits)

---
updated-dependencies:
- dependency-name: itertools
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-09-21 08:15:59 +00:00
Marco Neumann 7e00426d49
refactor: concurrent table scan for "tag values" (#5671)
Ref #5668.
2022-09-19 14:11:51 +00:00
Marco Neumann 274bd80ecd
refactor: concurrent table scan for "tag keys" (#5670)
* refactor: concurrent table scan for "tag keys"

Ref #5668.

* feat: add table name to context metadata
2022-09-19 13:27:18 +00:00
Marco Neumann ef09573255
refactor: concurrent table scan in "field columns" (#5651)
* refactor: concurrent table scan in "field columns"

Similar to #5647 and #5649.

* docs: improve

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-09-19 10:50:25 +00:00
Marco Neumann e346433914
refactor: concurrent table scan for "table names" (#5649)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-09-15 15:39:00 +00:00
Marco Neumann 159250e776
refactor: concurrent table planning in InfluxRPC (#5647)
* refactor: concurrent table planning in InfluxRPC

Some InfluxRPC can scan multiple tables. Prior to this PR we were always
scanning the tables in sequence, adding up potential latencies (catalog,
ingester, object store). There is no reason we need to do this,
"ordinary" SQL queries would not serialize this way either.

So let's scan tables concurrently. This add concurrency to:

- read filter
- read group
- read window aggregate

There are other query types that could benefit from a similar treatment.
They will be changed in a follow-up.

* docs: improve

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* test: explain `Send` assertion

* refactor: change `CONCURRENT_TABLE_JOBS` to 10

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
2022-09-15 13:55:22 +00:00
dependabot[bot] 7e1f013346
chore(deps): Bump itertools from 0.10.3 to 0.10.4 (#5631)
Bumps [itertools](https://github.com/rust-itertools/itertools) from 0.10.3 to 0.10.4.
- [Release notes](https://github.com/rust-itertools/itertools/releases)
- [Changelog](https://github.com/rust-itertools/itertools/blob/master/CHANGELOG.md)
- [Commits](https://github.com/rust-itertools/itertools/compare/v0.10.3...v0.10.4)

---
updated-dependencies:
- dependency-name: itertools
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-09-14 14:02:14 +00:00
Andrew Lamb 45d795055a
feat: Support calling influxql/flux selector aggregates from IOx SQL (#5628)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-09-14 10:37:17 +00:00
Andrew Lamb 1fd31ee3bf
chore: Update datafusion / `arrow` / `arrow-flight` / `parquet` to version 22.0.0 (#5591)
* chore: Update datafusion / `arrow` / `arrow-flight` / `parquet` to version 22.0.0

* fix: enable dynamic comparison flag

* chore: derive Eq for clippy

* chore: update explain plans

* chore: Update sizes for ReadBuffer encoding

* chore: update more tests

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-09-12 17:45:03 +00:00
Marco Neumann 762c2af91e
refactor: do not store chunks in `Deduplicator` (#5617)
Only store context, settings (if any) and the schema interner within the
de-duplicator. Extract a new `Chunks` type that handles the chunk
classification and can passed around in a somewhat clean fashion.

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-09-12 15:29:27 +00:00
Marco Neumann 8933f47ec1
refactor: make `QueryChunk::partition_id` non-optional (#5614)
In our data model, a chunk always belongs to a partition[^1], so let's
not make this attribute optional. The optional value only leads to
-- mostly surprising -- conditional behavior, ranging from "do not equalize
the partition sort key" (querier) to "always consider the chunk overlapping"
(iox_query when dealing with ingester chunks).

[^1]: This is even true when the chunk belongs to a parquet file that is not
      yet added to the catalog, contrary to what a comment in the ingester
      stated. The catalog and data model used by the querier are two totally
      different things.
2022-09-12 13:52:51 +00:00
Marco Neumann b676049358
fix: apply selection in `TestChunk::read_filter` (#5613)
* fix: apply selection in `TestChunk::read_filter`

TBH I have no idea how this worked so well before, but the chunks are
expected to apply the given selection. This is because
`IOxReadFilterNode::execute` will wrap the `QueryChunk::read_filter`
output into a `SchemaAdapterStream` and this one expects that there are
no input columns that are absent in the output schema (i.e. it will only
add null columns, it won't remove any). Funnily the `SchemaAdapterStream`
error will blame DataFusion for the mess.

* test: make `test_storage_rpc_tag_values_grouped_by_measurement_and_tag_key` a bit harder

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-09-12 13:10:37 +00:00
Marco Neumann caa0dfd1e0
refactor: query code clean ups (#5612)
* refactor: remove dead code

* refactor: `Deduplicator::build_scan_plan` consumes `self`

There is no good reason to use the same `Deduplicator` twice. In
contrast I'm quite sure that this would lead to nasty bugs, because
`split_overlapped_chunks` exists early in some cases so the 2nd plan
would have old and new chunks mixed together.
2022-09-12 13:00:56 +00:00
YIXIAO SHI fa6c26b38d
chore: fix comment typo (#5550)
Co-authored-by: Dom <dom@itsallbroken.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-09-07 08:57:34 +00:00
YIXIAO SHI 52ae60bf2e
chore: fix comment typo (#5551)
Co-authored-by: Dom <dom@itsallbroken.com>
2022-09-07 08:49:29 +00:00
Marco Neumann adeacf416c
ci: fix (#5569)
* ci: use same feature set in `build_dev` and `build_release`

* ci: also enable unstable tokio for `build_dev`

* chore: update tokio to 1.21 (to fix console-subscriber 0.1.8

* fix: "must use"
2022-09-06 14:13:28 +00:00
Andrew Lamb 6669d85fb4
chore: Update datafusion + arrow/parquet to `21.0.0` (#5519)
* chore: Update arrow/arrow-flight/parquet to 21.0.0

* chore: Update datafusion pin

* chore: Fix arrow update script

* chore: Update Cargo.lock

* chore: Update for new API
2022-08-31 13:30:47 +00:00
Sam Arnold 05657ea068
fix: optimizations for metadata fetch and chunk pruning (#5467)
* fix: hoist repeated computation out of chunk creation

We have hundreds of chunks per table, so it is beneficial to only
do common work once.

* chore: remove TableCache as it is no longer used

* fix: prune chunks both before and after metadata fetch

Fetching the metadata for all the chunks in a table is expensive,
especially when we have a narrow time range query that only
needs a few chunks.

* chore: fix clippy

* fix: fix up some last tests

* fix: review comments

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-08-29 14:59:05 +00:00
Andrew Lamb 9aac78d30b
fix: Correctly lexigraphically sort `_field` and `_measurement` with upper case tag keys (#5436)
Co-authored-by: Dom <dom@itsallbroken.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-08-29 13:45:03 +00:00
Carol (Nichols || Goulding) d4a472f775
fix: Remove let bindings that are immediately returned
As now caught by clippy. https://rust-lang.github.io/rust-clippy/master/index.html#let_and_return
2022-08-11 15:21:02 -04:00
Carol (Nichols || Goulding) 3a501a4a10
fix: Remove an immediate ref to a deref
Caught by clippy now. https://rust-lang.github.io/rust-clippy/master/index.html#borrow_deref_ref
2022-08-11 15:04:14 -04:00
Carol (Nichols || Goulding) b982bdaf2f
fix: Derive Eq when we derive PartialEq and members can derive Eq
Allow this in generated code that we don't control, though.

Recommended by clippy now. https://rust-lang.github.io/rust-clippy/master/index.html#derive_partial_eq_without_eq
2022-08-11 15:04:06 -04:00
Marco Neumann 90fec1365f
feat: intern schemas during query planning (#5215)
* feat: intern schemas during query planning

Helps with #5202.

* refactor: `SchemaMerger::build` shall return an `Arc`

* feat: `SchemaMerger::with_interner`

* refactor: hash-based schema interning
2022-08-11 12:28:51 +00:00
Andrew Lamb b834bc630c
chore: more readability improvements to sort keys (#5366)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-08-10 17:59:25 +00:00
Andrew Lamb ce3e2c3a15
chore: make terminology in iox_query::Provider consistent (remove super notation) (#5349)
* chore: make terminology in iox_query::Provider consistent (remove super notation)

* refactor: be more specific about *which* sort key is meant

* refactor: rename another sort_key --> output_sort_key

* refactor: rename additional sort_key to output_sort_key

* refactor: rename sort_key --> chunk_sort_key

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-08-10 10:59:47 +00:00
Andrew Lamb 16ddc5efc6
chore: Update datafusion / arrow/parquet/arrow-flight and prost/tonic ecosystem (#5360)
* chore: Update datafusion and arrow

* chore: Update Cargo.lock

* chore: update to Decimal128

* chore: Update tonic/prost/pbjson/etc

* chore: Run cargo hakari tasks

* fix: doctest in generated types

Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
2022-08-09 17:30:44 +00:00
Marco Neumann fc1870ff76
fix: chunk pruning stats (#5319)
- emit a warning if we cannot even attempt to prune chunks due to an
  error. This is always either a missing feature or a bug (even though
  it does not impact correctness but _only_ performance). Also see
  https://github.com/influxdata/conductor/issues/1107
- change metrics to clearly differentiate between "could not prune" and
  "not pruned"
- add new "not pruned" observer hook (this was missing for some reason,
  the "pruned" hook existed though)
2022-08-05 10:50:31 +00:00
Andrew Lamb e82214ed38
chore: fix `cargo audit`, update deps to get new chrono (#5316)
* chore: update deps to get new chrono

* chore: Run cargo hakari tasks

* chore: migrate away from deprecated API

Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
2022-08-04 20:49:28 +00:00
Marco Neumann 0d714878ca
feat: chunk pruning metrics (#5273)
* refactor: make could-not-prune reason a static string

* refactor: introduce `QuerierTableArgs`

* feat: chunk pruning metrics

Closes #4974.

* refactor: address review comments

* refactor: use static typing for not-pruned reason

* refactor: pass chunk to not-pruned observer and use it for some metrics

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-08-04 15:29:21 +00:00
Nga Tran 34ccc9c7f5 chore: Revert "chore: Revert "refactor: bump batch size (#5251)" (#5288)" (#5300)
This reverts commit 471b8be92f.

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-08-04 13:19:46 +00:00
Nga Tran 471b8be92f
chore: Revert "refactor: bump batch size (#5251)" (#5288)
This reverts commit bb172f8fa8.
2022-08-03 14:23:45 +00:00
Marco Neumann bb172f8fa8
refactor: bump batch size (#5251)
This is what DataFusion uses by default and I don't see a reason why we
should use such small batch sizes.

The affect is probably only visible in certain filter-aggregate queries
that don't focus on a single series (because there we likely end up with
1 or 2 batches only, esp. after #5250) for coarse-grained filters, esp.
  when the filter key is not the first sort key.

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-08-01 13:49:58 +00:00
Andrew Lamb 9215a534d0
chore: Update datafusion and `arrow`/`parquet`/`arrow-flight` to `19.0.0` (#5229)
* chore: Update datafusion and `arrow`/`parquet`/`arrow-flight` to `19.0.0`

* chore: Run cargo hakari tasks

* fix: Update for API changes

* fix: clippy

Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-07-28 08:10:47 +00:00
Marco Neumann 9a9a1a4777
feat: limit per-table chunk data for every query (#5223)
* feat: `QueryChunk::as_any`

* feat: allo `ChunkPruner::prune_chunks` to fail

* feat: limit per-table chunk data for every query

Closes #5211.

* fix: address review comments

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
2022-07-27 13:20:05 +00:00
Andrew Lamb fbf672015e
refactor: Reduce ceremony requried to create a `Span` from `SpanContext` (#5181)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-07-22 11:19:38 +00:00
Marco Neumann 0561423475
refactor: enforce proper `IOxSessionContext` (#5158)
- remove `IOxSessionContext::default()` because untracked contexts
  should only be created by tests
- remove `Option<IOxSessionContext>` because it is a typed workaround
  for `IOxSessionContext::default`

Tests should use `IOxSessionContext::testing` and all _normal_ users
should create proper contexts.

I suspect this will help tracing or at least prevent silent regressions.
See #5129.

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-07-20 16:25:43 +00:00
Marko Mikulicic b8236e2b9d
fix: Fix SeriesKey sort order for special _measurement and _field (#5150)
* fix: Fix SeriesKey sort order for special _measurement and _field

* fix: Update expected test output

* fix: Update more tests

* fix: Re-sort tag key when using binary encoding

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
2022-07-20 08:45:17 +00:00
Marco Neumann b8d9799a26
feat: wire span all the way to `QuerierTable::chunks` (#5134)
* feat: pass context to `QueryDatabase::chunks`

* feat: wire span all the way to `QuerierTable::chunks`

This is required for #5129.
2022-07-19 14:12:55 +00:00
Nga Tran c8f4000f04
feat: Select compaction candidates (#5131)
* feat: initial implementation for selecting compaction candidates

* feat: 2 catalog functions to choose the most thorughput partitions to compact and the selecting candidate function itself

* test: tests for the new 2 queries

* feat: more tests and metrics for chooing compaction candidates

* chore: Apply self suggestions from self review

* chore: cleanup

* chore: fix doc comment

* chore: Apply suggestions from code review

Co-authored-by: Carol (Nichols || Goulding) <193874+carols10cents@users.noreply.github.com>

* refactor: address review comments

* fix: get the right time provider for the tests

* refactor: remove the left over compaction_

* fix: typos

* fix: make the param name and env name consistent

* refactor: make relevant iSomething to uSomething

* fix: typo

Co-authored-by: Carol (Nichols || Goulding) <193874+carols10cents@users.noreply.github.com>
2022-07-18 18:05:13 +00:00
Andrew Lamb e2d871b00b
chore: Update datafusion and arrow/parquet/arrow-flight to `18.0.0` (#5079)
* chore: Update datafusion to 10.0.0, arrow/parquet/arrow-flight to 18

* chore: Run cargo hakari tasks

* fix: update cargo pin

Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-07-18 15:01:03 +00:00
Marco Neumann 9c2b6cd96c
fix: always pass proper context to `InfluxRpcPlanner` (#5144)
There were some instances were we forgot to pass context (and therefore
tracing) information to `InfluxRpcPlanner`. This removes the `Default`
implementation requires to always pass a context when creating
`InfluxRpcPlanner` to prevent this type of bug.

Ref #5129.

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-07-18 14:45:22 +00:00
dependabot[bot] 9b67de2f43
chore(deps): Bump tokio from 1.19.2 to 1.20.0
Bumps [tokio](https://github.com/tokio-rs/tokio) from 1.19.2 to 1.20.0.
- [Release notes](https://github.com/tokio-rs/tokio/releases)
- [Commits](https://github.com/tokio-rs/tokio/compare/tokio-1.19.2...tokio-1.20.0)

---
updated-dependencies:
- dependency-name: tokio
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2022-07-14 01:21:43 +00:00
Marco Neumann b1b2cb5d4a
feat: load read buffer on demand (#5091)
* refactor: extract `select_schema`

* refactor: improve `InternalLostInputField` error message

* test: improve SQL runner output

* feat: load read buffer on demand

Closes #5032.

* refactor: move `[Half]OwnedSelection` to `schema` crate`
2022-07-13 08:51:40 +00:00
Marco Neumann f1467cf4d8
refactor: try to query "cheap" chunks first (#5075)
This should help a lot once #5032 is implemented. Currently it doesn't
really make a difference.

See #5037, which also proposes a more advanced but more complex system.
The team however agreed to try something simple first.

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-07-08 11:19:25 +00:00
Andrew Lamb c46e1c6347
chore: Update datafusion + arrow/parquet/arrow-flight to `17.0.0` (#5021)
* fix: correct nullability declaration of system tables

* chore: Update datafusion and arrow/parquet/arrow-flight

* chore: Run cargo hakari tasks

* fix: Update tests

* fix: Update tests

* fix: predicate pruning

* fix: add some tests

* fix: query_functions

* fix: fix read_buffer test

* fix: fix clippy

Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-07-07 19:22:15 +00:00
Marco Neumann aacdeaca52
refactor: prep work for #5032 (#5060)
* refactor: remove parquet chunk ID to `ChunkMeta`

* refactor: return `Arc` from `QueryChunk::summary`

This is similar to how we handle other chunk data like schemas. This
allows a chunk to change/refine its "believe" over its own payload while
it is passed around in the query stack.

Helps w/ #5032.
2022-07-07 13:21:48 +00:00
Sam Arnold e193913ed3
fix: optimize field columns for all-time predicates (#5046)
* fix: optimize field columns for all-time predicates

Also fix timestamp range to allow selecting points at MAX_NANO_TIME

* fix: clamp end to MIN_NANO_TIME for safety

* refactor: add contains_all method to TimestampRange
2022-07-06 12:01:28 +00:00
Marco Neumann 16bd3e67c0
refactor: unify `apply_predicate_to_metadata` (#5030)
Instead of using some hand-rolled timestamp-based logic (or just
"unknown") all over the place, just use logic introduced in #5017.

This requires slightly improved table summaries within the querier that
at least has min/max for the timestamp column. For that, the former
`IngesterChunk`-specific `calculate_summary` method was extended to
`create_basic_summary` to include that data and is now also used by
`QuerierParquetChunk`.

Note: `QuerierRBChunk` already has detailled metrics that are provided
by the read buffer implementation.

Should we ever need even better pruning for `QuerierParquetChunk` (or
`IngesterChunk`) then we _only_ need add extra data to the table
summaries.

Closes #4976.

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-07-05 12:51:59 +00:00
Sam Arnold 03f456d8fd
fix: optimize tag_keys to go only to schema when predicate is empty (#4985)
* docs: fix comment

* test: add test for delete behaviour

* fix: tag_keys optimization for empty predicate

Also need to eliminate 'true' predicates from simplified predicate so
is_empty works correctly.

* refactor: use lit instead of spelling out literal true

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-07-05 12:45:25 +00:00
Marco Neumann 05189fdb00
fix: ensure panic is propagated throw `AdapterStream` (#4980)
Prior to this change background tasks that we feed into `AdapterStream`
can panic but that would just end the stream without any user-visible
error (except for the panic message on stdout/stderr).

This was found while developing #4964. I have proposed another fix in #4966
but found that I actually developed an existing solution a 2nd time:
`watch_task`. But I also see a major issue with the existing API: one
can create `AdapterStream` with ordinary tokio tasks that are not
watched at all, leaving the burden to the implementor to check for that
(and actually we forgot that in `parquet_file`).

So this change takes a slightly different approach:
The `AdapterStream` does NOT accept ordinary join handles any longer but
requires that you pass a "watched task". The newly introduced
`WatchedTask` does the same as we did manually before: wrapping a future
into a tokio task, watch it and wrap the watcher into a task.

It is now way more difficult to do anything stupid (sure you can still
mix up the tasks and the channels, but we need at least some flexibility
here to allow for "split" and potential future fan-in/out constructs).

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-06-30 07:57:58 +00:00
Andrew Lamb 01fb2e132d
chore: Update datafusion pin (#4969)
* chore: Update datafusion pin

* fix: Update for api

* fix: Explicitly set coalsce batch size

* fix: Update batch size as well

* fix: update tests for new explain plan, and improved coercion
2022-06-29 17:52:37 +00:00
Nga Tran cfcc4b8426
refactor: change level 1 to level 2 preparing for next design changes (#4954)
* refactor: change level 1 to level 2 preparing for next design changes

* fix: make level-2 consistent everywhere

* chore: remove unused comments

* refactor: change all the name level_1 to level_2 to completely replace 1 with 2 to amke everything consistent

* chore: add correspinding constants for the comapction levels in the comments

Co-authored-by: Dom <dom@itsallbroken.com>
2022-06-29 14:08:58 +00:00
Marco Neumann 3b78bf1c48
refactor: remove binary parquet file MD from compactor (#4938)
* refactor: simplify sort key calculation

* refactor: use schema from catalog instead from file

* refactor: do not request parquet file MD in compactor

* test: ensure that `QueryableParquetChunk` works correctly
2022-06-27 15:11:15 +00:00
Nga Tran 35dacf388b
feat: Compact now can split compacted results into multiple non-overlapped files based on config max file size (#4918)
* feat: split times of compacting results based on the max file size

* feat: cosider max file size while computing split time

* test: tests for comput_split_time

* feat: first step to teach the function split_the_steam to know how to split data into n streams using n-1 input PhysicalExprs

* feat: make StreamSplitNode support a list of expression

* docs: explain how StreamSplitNode works

* feat: Teach compute_split_time to split a time range into many contiguous ranges and split compacted result into multiple non-overlapped files based on the config comapction_max_size_bytes

* chore: cleanup

* chore: clean up doc

* chore: address review comments

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-06-23 18:54:03 +00:00
Andrew Lamb 776c34e03d
chore: Update datafusion (#4927)
* chore: Update datafusion

* fix: Update for API changes

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-06-23 09:30:43 +00:00
Andrew Lamb 164e75f328
refactor: Remove unused `Option` (#4839)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-06-15 10:24:51 +00:00
Andrew Lamb e91d00b10c
chore: Update datafusion + `arrow`/`parquet`/`arrow-flight` to `16.0.0 (#4851)
* chore: TEMP Update DataFusion to pre-release

* chore: update arrow et al to 16.0.0

* chore: Run cargo hakari tasks

* fix: update reader read_dictionary API

* chore: Update to real Datafusion release

* fix: Update parquet API

* fix: update test

Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
2022-06-14 16:31:40 +00:00
Andrew Lamb 34e8659876
refactor: consolidate plan creation from `QueryChunk`s in `iox_query` (#4837)
* refactor: consolidate plan creation from Chunks

* docs: update docstrings

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-06-14 14:36:07 +00:00
Andrew Lamb 9fdbfb05e7
refactor: Use scan_and_filter in ReorgPlanner (#4822)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-06-10 17:31:25 +00:00
Nga Tran 13c57d524a
feat: Change data type of catalog partition's sort_key from a string to an array of string (#4801)
* feat: Change data type of catalog Postgres partition's sort_key from a string to an array of string

* test: add column with comma

* fix: use new protonuf field to avoid incompactible

* fix: ensure sort_key is an empty array rather than NULL

* refactor: address review comments

* refactor: address more comments

* chore: clearer comments

* chore: Update iox_catalog/migrations/20220607102200_change_sort_key_type_to_array.sql

* chore: Update iox_catalog/migrations/20220607102200_change_sort_key_type_to_array.sql

* fix: Rename migration so it will be applied after

Co-authored-by: Marko Mikulicic <mkm@influxdata.com>
2022-06-10 13:31:31 +00:00
Andrew Lamb 11cec18edc
refactor: Move `scan_and_filter` into a `common` module for reuse (#4823)
* refactor: remove unused error variants

* refactor: move scan_and_filter into a module so it can be reused

* docs: update comments about pruning
2022-06-10 11:15:47 +00:00
Andrew Lamb 2ec7764fdd
refactor: rename builder like predicate methods to be `with_` (#4808)
* refactor: rename builder like predicate methods to be `with_`

* fix: merge conflict

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-06-09 11:26:03 +00:00
Andrew Lamb 107e5f7284
docs: Add some docs about `StreamSplit` (#4810)
* docs: Add some docs about `StreamSplit`

* docs: fix struct name
2022-06-09 10:53:34 +00:00
Andrew Lamb f34282be2c
fix: Do not run DataFusion optimizer pass twice (#4809)
* fix: Do not run DataFusion optimizer pass twice

* docs: improve docstring and logging
2022-06-08 21:01:22 +00:00
Andrew Lamb afc1c12062
refactor: consolidate `PredicateBuilder` into `Predicate` (#4799)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-06-08 12:21:24 +00:00
Andrew Lamb 8e96a2721d
chore: Update datafusion (again) (#4788)
* chore: Update datafusion

* chore: Update imports

* refactor: update API usage

* refactor: clean up some uses of binary_expr

* fix: remove unused export

* fix: update explain output

* chore: update more explain tests

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-06-07 08:17:56 +00:00
dependabot[bot] e03bf94420
chore(deps): Bump tokio from 1.18.2 to 1.19.1 (#4783)
Bumps [tokio](https://github.com/tokio-rs/tokio) from 1.18.2 to 1.19.1.
- [Release notes](https://github.com/tokio-rs/tokio/releases)
- [Commits](https://github.com/tokio-rs/tokio/compare/tokio-1.18.2...tokio-1.19.1)

---
updated-dependencies:
- dependency-name: tokio
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-06-06 14:15:12 +00:00
Carol (Nichols || Goulding) e1061ce623
docs: Don't attempt to list out chunk types exhaustively 2022-06-03 16:29:29 -04:00
Andrew Lamb 3592aa52d8
chore: Update datafusion + `arrow`/`parquet`/`arrow-flight` to `15.0.0` (#4743)
* chore: Update datafusion + `arrow`/`parquet`/`arrow-flight` to `15.0.0`

* chore: Update APIs

* chore: Run cargo hakari tasks

* feat: normalize parquet file metadata

* chore: update size tests

* chore: add docs on metadata stripping

* chore: TEMP UPDATE TO DF BRANCH

* chore: Update for new API

* fix: Update to latest DF

* fix: cargo hakari

Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
Co-authored-by: Raphael Taylor-Davies <r.taylordavies@googlemail.com>
2022-06-03 10:32:26 +00:00
Ryan Russell d279deddad
docs(various): Improve Readability (#4768)
Signed-off-by: Ryan Russell <git@ryanrussell.org>

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-06-02 18:01:06 +00:00
Andrew Lamb 95e6a8ed46
chore: Update datafusion (again) (#4679)
* chore: Update datafusion deps

* fix: fix for changes in ScalarValue

* fix: fix for using TableSource rather than TableProvider

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-05-24 15:54:39 +00:00
Dom Dwyer 8ff1a73797 revert: fix: compaction deadlock
This reverts commit 00b5c1b296.

This change reverts the StreamSplitExec plan to using bounded, blocking
channels, with the possibility of deadlock added to the docs.

This is now tolerable because of the concurrent consumption of both
output partitions in the compactor.
2022-05-24 14:12:00 +01:00
Marco Neumann addc45327e
fix: ensure that query tokio background tasks are canceled (#4643)
* fix: ensure that query tokio background tasks are canceled

While I am not entirely sure if this explains some of the memory leaks I
am seeing in prod, not canceling the tasks correctly certainly makes
debugging way harder and also renders certain form of throttling (e.g.
max. concurrent queries) somewhat ineffective.

Note that parquet file downloads are currently NOT canceled because
tokios `spawn_blocking` cannot be canceled.

* refactor: `Vec` -> `Option`

* refactor: `spawn_blocking` creates a join handle, even though it is useless

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-05-20 07:18:52 +00:00
Marco Neumann 52346642a0
ci: fix cargo deny (#4629)
* ci: fix cargo deny

* chore: downgrade `socket2`, version 0.4.5 was yanked

* chore: rename `query` to `iox_query`

`query` is already taken on crates.io and yanked and I am getting tired
of working around that.
2022-05-18 09:38:35 +00:00