Commit Graph

592 Commits (29462d0fe570990f7eb08c3af545f1ba6d2d3800)

Author SHA1 Message Date
dependabot[bot] ed7d02a225
chore(deps): Bump tokio from 1.24.2 to 1.25.0
Bumps [tokio](https://github.com/tokio-rs/tokio) from 1.24.2 to 1.25.0.
- [Release notes](https://github.com/tokio-rs/tokio/releases)
- [Commits](https://github.com/tokio-rs/tokio/commits/tokio-1.25.0)

---
updated-dependencies:
- dependency-name: tokio
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2023-01-30 01:57:27 +00:00
Marko Mikulicic db7e6335ca
feat(ingester2): New objecstore paths will have no shard id (#6735)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2023-01-27 17:16:54 +00:00
Nga Tran b8a80869d4
feat: introduce a new way of max_sequence_number for ingester, compactor and querier (#6692)
* feat: introduce a new way of max_sequence_number for ingester, compactor and querier

* chore: cleanup

* feat: new column max_l0_created_at to order files for deduplication

* chore: cleanup

* chore: debug info for chnaging cpu.parquet

* fix: update test parquet file

Co-authored-by: Marco Neumann <marco@crepererum.net>
2023-01-26 10:52:47 +00:00
Marco Neumann ed694d3be4
feat: introduce scratchpad store for compactor (#6706)
* feat: introduce scratchpad store for compactor

Use an intermediate in-memory store (can be a disk later if we want) to
stage all inputs and outputs of the compaction. The reasons are:

- **fewer IO ops:** DataFusion's streaming IO requires slightly more
  IO requests (at least 2 per file) due to the way it is optimized to
  read as little as possible. It first reads the metadata and then
  decides which content to fetch. In the compaction case this is (esp.
  w/o delete predicates) EVERYTHING. So in contrast to the querier,
  there is no advantage of this approach. In contrary this easily adds
  100ms latency to every single input file.
- **less traffic:** For divide&conquer partitions (i.e. when we need to
  run multiple compaction steps to deal with them) it is kinda pointless
  to upload an intermediate result just to download it again. The
  scratchpad avoids that.
- **higher throughput:** We want to limit the number of concurrent
  DataFusion jobs because we don't wanna blow up the whole process by
  having too much in-flight arrow data at the same time. However while
  we perform the actual computation, we were waiting for object store
  IO. This was limiting our throughput substantially.
- **shadow mode:** De-coupling the stores in this way makes it easier to
  implement #6645.

Note that we assume here that the input parquet files are WAY SMALLER
than the uncompressed Arrow data during compaction itself.

Closes #6650.

* fix: panic on shutdown

* refactor: remove shadow scratchpad (for now)

* refactor: make scratchpad safe to use
2023-01-26 10:03:08 +00:00
Marco Neumann cb02262b9d
refactor: extract "exec DF plan" and "store stream to file" components (#6663)
* refactor: extract `PartitionInfo`

* refactor: extract DF exec component

* feat: add some error conversions

* refactor: make fn public

* refactor: extract file sink component

* fix: clippy
2023-01-23 14:40:35 +00:00
Andrew Lamb 57f08dbccd
chore: Update datafusion to Jan 9, 2023 (1 / 2) (#6603)
* refactor: Update DataFusion pin to early Jan 2023

* fix: Update tests now that planning is async

* fix: Updates for API changes

* chore: Run cargo hakari tasks

* fix: Update comment

* refactor: nicer config setup

* fix: gapfill async

Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
2023-01-18 12:19:32 +00:00
dependabot[bot] b49cc2e35e
chore(deps): Bump tokio from 1.24.0 to 1.24.1 (#6545)
Bumps [tokio](https://github.com/tokio-rs/tokio) from 1.24.0 to 1.24.1.
- [Release notes](https://github.com/tokio-rs/tokio/releases)
- [Commits](https://github.com/tokio-rs/tokio/compare/tokio-1.24.0...tokio-1.24.1)

---
updated-dependencies:
- dependency-name: tokio
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2023-01-10 09:48:44 +00:00
dependabot[bot] 3fac475b63
chore(deps): Bump base64 from 0.20.0 to 0.21.0 (#6530)
* chore(deps): Bump base64 from 0.20.0 to 0.21.0

Bumps [base64](https://github.com/marshallpierce/rust-base64) from 0.20.0 to 0.21.0.
- [Release notes](https://github.com/marshallpierce/rust-base64/releases)
- [Changelog](https://github.com/marshallpierce/rust-base64/blob/master/RELEASE-NOTES.md)
- [Commits](https://github.com/marshallpierce/rust-base64/compare/v0.20.0...v0.21.0)

---
updated-dependencies:
- dependency-name: base64
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* chore: Run cargo hakari tasks

* fix: deprecations

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
Co-authored-by: Marco Neumann <marco@crepererum.net>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2023-01-09 09:39:40 +00:00
Raphael Taylor-Davies e1036a0c63
refactor: cleanup schema boxing (#6511)
* refactor: cleanup Schema boxing

* chore: clippy
2023-01-06 10:57:39 +00:00
Andrew Lamb dbe52f1ca1
chore: Upgrade datafusion (#6467)
* chore: Update datafusion

* fix: Update for new apis

* chore: Update expected plan

* fix: Update for new config construction

* chore: update clippy

* fix: Fix error codes

* fix: update another test

* chore: Run cargo hakari tasks

Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2023-01-03 15:29:11 +00:00
Carol (Nichols || Goulding) 72aab99951
fix: Remove needless borrow. Thanks clippy! 2022-12-21 14:32:34 -05:00
dependabot[bot] 299f0e99f9
chore(deps): Bump thiserror from 1.0.37 to 1.0.38
Bumps [thiserror](https://github.com/dtolnay/thiserror) from 1.0.37 to 1.0.38.
- [Release notes](https://github.com/dtolnay/thiserror/releases)
- [Commits](https://github.com/dtolnay/thiserror/compare/1.0.37...1.0.38)

---
updated-dependencies:
- dependency-name: thiserror
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
2022-12-19 10:33:32 +00:00
dependabot[bot] 95969ad24f
chore(deps): Bump base64 from 0.13.1 to 0.20.0 (#6371)
* chore(deps): Bump base64 from 0.13.1 to 0.20.0

Bumps [base64](https://github.com/marshallpierce/rust-base64) from 0.13.1 to 0.20.0.
- [Release notes](https://github.com/marshallpierce/rust-base64/releases)
- [Changelog](https://github.com/marshallpierce/rust-base64/blob/master/RELEASE-NOTES.md)
- [Commits](https://github.com/marshallpierce/rust-base64/compare/v0.13.1...v0.20.0)

---
updated-dependencies:
- dependency-name: base64
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* chore: Run cargo hakari tasks

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-12-12 07:07:19 +00:00
dependabot[bot] 1d38d400f0
chore(deps): Bump object_store from 0.5.1 to 0.5.2 (#6339)
* chore(deps): Bump object_store from 0.5.1 to 0.5.2

Bumps [object_store](https://github.com/apache/arrow-rs) from 0.5.1 to 0.5.2.
- [Release notes](https://github.com/apache/arrow-rs/releases)
- [Changelog](https://github.com/apache/arrow-rs/blob/master/CHANGELOG-old.md)
- [Commits](https://github.com/apache/arrow-rs/compare/object_store_0.5.1...object_store_0.5.2)

---
updated-dependencies:
- dependency-name: object_store
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

* chore: Run cargo hakari tasks

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-12-06 07:53:54 +00:00
Andrew Lamb 14a9bc92e9
Revert "Revert "chore: Update Datafusion and arrow/arrow-flight/parquet to `28.0.0` (#6279)" (#6294)" (#6296)
This reverts commit b7e52c0d8d.

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-12-01 14:20:43 +00:00
Andrew Lamb b7e52c0d8d
Revert "chore: Update Datafusion and arrow/arrow-flight/parquet to `28.0.0` (#6279)" (#6294)
This reverts commit 039a45ddd1.
2022-12-01 11:38:42 +00:00
Andrew Lamb 039a45ddd1
chore: Update Datafusion and arrow/arrow-flight/parquet to `28.0.0` (#6279)
* chore: Update Datafusion and arrow/arrow-flight/parquet to `28.0.0`

* chore: Update thrift to 0.17

* fix: use workspace arrow-flight in ingester2

* chore: Update for API changes

* fix: test

* chore: Update hakari

* chore: Update hakari again

* chore: Update trace_exporters to latest thrift

* fix: update test

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-11-30 14:12:30 +00:00
dependabot[bot] 74cf99eac7
chore(deps): Bump zstd from 0.11.2+zstd.1.5.2 to 0.12.0+zstd.1.5.2 (#6225)
* chore(deps): Bump zstd from 0.11.2+zstd.1.5.2 to 0.12.0+zstd.1.5.2

Bumps [zstd](https://github.com/gyscos/zstd-rs) from 0.11.2+zstd.1.5.2 to 0.12.0+zstd.1.5.2.
- [Release notes](https://github.com/gyscos/zstd-rs/releases)
- [Commits](https://github.com/gyscos/zstd-rs/commits)

---
updated-dependencies:
- dependency-name: zstd
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* chore: Run cargo hakari tasks

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-11-25 10:00:28 +00:00
Andrew Lamb 1a1ea74cb7
chore: Upgrade datafusion again (#6160)
* Revert "Revert "chore: Update datafusion again (#6108)""

This reverts commit 766b3bbeb440618cfe332f6ee7d4f8a8217acc48.

* fix: Respect the partition sort key

* chore: update plans

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-11-22 19:28:26 +00:00
dependabot[bot] 04c00bbb62
chore(deps): Bump bytes from 1.2.1 to 1.3.0 (#6199)
Bumps [bytes](https://github.com/tokio-rs/bytes) from 1.2.1 to 1.3.0.
- [Release notes](https://github.com/tokio-rs/bytes/releases)
- [Changelog](https://github.com/tokio-rs/bytes/blob/master/CHANGELOG.md)
- [Commits](https://github.com/tokio-rs/bytes/commits)

---
updated-dependencies:
- dependency-name: bytes
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-11-22 08:23:24 +00:00
dependabot[bot] a9db7581cd
chore(deps): Bump tokio from 1.21.2 to 1.22.0 (#6183)
Bumps [tokio](https://github.com/tokio-rs/tokio) from 1.21.2 to 1.22.0.
- [Release notes](https://github.com/tokio-rs/tokio/releases)
- [Commits](https://github.com/tokio-rs/tokio/compare/tokio-1.21.2...tokio-1.22.0)

---
updated-dependencies:
- dependency-name: tokio
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-11-21 10:21:24 +00:00
Andrew Lamb 67712b595c
Revert "chore: Update datafusion again (#6108)" (#6159)
This reverts commit fbe9f27f10.
2022-11-16 21:14:55 +00:00
Andrew Lamb fbe9f27f10
chore: Update datafusion again (#6108)
* chore: Update datafusion pin + api code

* chore: Run cargo hakari tasks

* refactor: combine_sort_key is more idomatic and add rationale comments

* refactor: satisfy borrow checker and updated comments

* fix: Add test case for combine_sort_key

* fix:  Apply suggestions from code review

Co-authored-by: Marco Neumann <marco@crepererum.net>

* fix: Add back test for deeply nested expression

* fix: Update output ordering

Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
Co-authored-by: Marco Neumann <marco@crepererum.net>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-11-16 14:41:52 +00:00
dependabot[bot] a969754819
chore(deps): Bump chrono from 0.4.22 to 0.4.23 (#6129)
* chore(deps): Bump chrono from 0.4.22 to 0.4.23

Bumps [chrono](https://github.com/chronotope/chrono) from 0.4.22 to 0.4.23.
- [Release notes](https://github.com/chronotope/chrono/releases)
- [Changelog](https://github.com/chronotope/chrono/blob/main/CHANGELOG.md)
- [Commits](https://github.com/chronotope/chrono/compare/v0.4.22...v0.4.23)

---
updated-dependencies:
- dependency-name: chrono
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

* refactor: chrono future compat

Integer->timstamp conversions should not silently panic.

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Marco Neumann <marco@crepererum.net>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-11-14 13:34:09 +00:00
Carol (Nichols || Goulding) bdff4e8848
fix: Consistently use 'namespace' instead of 'database' in comments and other internal text 2022-11-11 15:46:04 -05:00
Andrew Lamb 6c17ee29a5
feat: make logging clearer when parquet files upload is retried (#6056)
* feat: log success when parquet files are retried

* fix: Update parquet_file/src/storage.rs

Co-authored-by: Carol (Nichols || Goulding) <193874+carols10cents@users.noreply.github.com>

* fix: fmt

Co-authored-by: Carol (Nichols || Goulding) <193874+carols10cents@users.noreply.github.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-11-10 11:59:54 +00:00
Carol (Nichols || Goulding) 43687a86d2
fix: Remove lots of needless borrows that Clippy can now identify
Except for in generated code that we don't control.
2022-11-09 10:54:18 -05:00
Andrew Lamb 4fb2843d05
refactor: Rename `schema::selection::Selection` to `schema::projection::Projection` (#6037)
* chore: Rename `schema::selection::Selection` to `schema::projection::Projection`

* fix: docs

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-11-02 18:15:04 +00:00
Marco Neumann 45b3984aa3
refactor: simplify `QueryChunk` data access (#6015)
* refactor: simplify `QueryChunk` data access

We have only two types for chunks (now that the RUB is gone):

1. In-memory RecordBatches
2. Parquet files

Loads of logic is duplicated in the different `read_filter`
implementations. Also `read_filter` hides a solid amount of logic from
DataFusion, which will prevent certain (future) optimizations. To enable #5897
and to simplify the interface, let the chunks return the data (batches
or metadata for parquet files) directly and let `iox_query` perform the
actual heavy-lifting.

* docs: improve

Co-authored-by: Andrew Lamb <alamb@influxdata.com>

* docs: improve

Co-authored-by: Andrew Lamb <alamb@influxdata.com>

Co-authored-by: Andrew Lamb <alamb@influxdata.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-11-02 08:18:33 +00:00
Andrew Lamb 9c1f0a3644
refactor: move SessionConfig creation into datafusion_utils (#6011)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-10-31 20:04:49 +00:00
Carol (Nichols || Goulding) b27e2bd7d1
feat: Use workspace dep inheritance for the parquet crate 2022-10-26 10:37:51 -04:00
Carol (Nichols || Goulding) 3145e2c05b
feat: Use workspace dep inheritance for the arrow crate 2022-10-26 10:34:29 -04:00
Carol (Nichols || Goulding) 44936f661a
feat: Use workspace dep inheritance for datafusion instead of shim crate 2022-10-26 10:33:56 -04:00
Marco Neumann 9b48437711
refactor: make influx column type mandatory (#5978)
We basically assume everywhere that a column falls into one of the three
known categories (time, tag, field), so lets encode this in our type
system instead of defining "unknown" as "undefined behavior, may or may
not crash".

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-10-26 11:20:29 +00:00
Carol (Nichols || Goulding) 2e83e04eab
feat: Use workspace package metadata to reduce differences and repetition 2022-10-24 13:04:09 -04:00
Marco Neumann 284f253846
refactor: remove unused constant (#5956)
Now that we read throw `ParquetExec`, `ROW_GROUP_READ_SIZE` is no longer
used.

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-10-24 11:08:44 +00:00
Marco Neumann e0062f2d40
refactor: do NOT use fake DF context for parquet reading (#5942)
Use the proper top-level DataFusion context and register the object
store there.

Note that we still hide the `ParquetExec` behind an opaque record batch
stream. Fixing that is next on my list.

Helps with #5897.

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-10-24 08:20:26 +00:00
Andrew Lamb 7781ed0455
chore: Update datafusion (#5928)
* chore: Upgrade datafusion

* chore: Update for new API

* chore: Update expected output

* fix: Update comment

* fix: compilation

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-10-20 14:37:49 +00:00
Marco Neumann 42b89ade03
refactor: use `SendableRecordBatchStream` to write parquets (#5911)
Use a proper typed stream instead of peeking the first element. This is
more in line with our remaining stack and shall also improve error
handling.

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-10-19 12:59:53 +00:00
Marco Neumann eb5a661ab3
refactor: prep work for #5897 (#5907)
* refactor: add ID to `ParquetStorage`

* refactor: remove duplicate code

* refactor: use dedicated `StorageId`
2022-10-19 11:54:42 +00:00
Andrew Lamb d706f8221d
chore: Update datafusion and arrow / parquet / arrow-flight 25.0.0 (#5900)
* chore: Update datafusion and  `arrow` / `parquet` / `arrow-flight` 25.0.0

* chore: Update for structure changes

* chore: Update for new projection pushdown

* chore: Run cargo hakari tasks

* fix: fmt

Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-10-18 20:58:47 +00:00
Andrew Lamb d57c99638c
chore: Update datafusion + `arrow`, `arrow-flight`, and `parquet` to 24.0.0.0 (#5792)
* chore: Update datafusion + `arrow`, `arrow-flight`, and `parquet` to 24.0.0.0

* fix: Update for coercion, fix explain plans for change in column name display

* chore: Update datafusion lock

* fix: Update for other API changes

* chore: Update to latest datafusion pin

* chore: Run cargo hakari tasks

Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-10-12 16:19:14 +00:00
dependabot[bot] 933493fab3
chore(deps): Bump object_store from 0.5.0 to 0.5.1
Bumps [object_store](https://github.com/apache/arrow-rs) from 0.5.0 to 0.5.1.
- [Release notes](https://github.com/apache/arrow-rs/releases)
- [Changelog](https://github.com/apache/arrow-rs/blob/master/CHANGELOG-old.md)
- [Commits](https://github.com/apache/arrow-rs/compare/object_store_0.5.0...object_store_0.5.1)

---
updated-dependencies:
- dependency-name: object_store
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
2022-10-11 01:19:10 +00:00
Jake Goulding 7389fbe528
feat: Write data to Parquet files from the data generator 2022-09-29 16:03:33 -04:00
dependabot[bot] 227dde1dfc
chore(deps): Bump thiserror from 1.0.36 to 1.0.37 (#5753)
Bumps [thiserror](https://github.com/dtolnay/thiserror) from 1.0.36 to 1.0.37.
- [Release notes](https://github.com/dtolnay/thiserror/releases)
- [Commits](https://github.com/dtolnay/thiserror/compare/1.0.36...1.0.37)

---
updated-dependencies:
- dependency-name: thiserror
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-09-29 10:37:14 +00:00
Andrew Lamb 66dbb9541f
chore: Update datafusion and `arrow`/`parquet`/`arrow-flight` to 23.0.0, `thrift` to 0.16.0 (#5694)
* chore: Update datafusion and `arrow`/`parquet`/`arrow-flight`  to 23.0.0

* chore: Update thrift / remove parquet_format

* fix: Update APIs

* chore: Update lock + Run cargo hakari tasks

* fix: use patched version of arrow-rs to work around https://github.com/apache/arrow-rs/issues/2779

* chore: Run cargo hakari tasks

Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-09-27 12:50:54 +00:00
dependabot[bot] b1740f45d6
chore(deps): Bump thiserror from 1.0.35 to 1.0.36 (#5737)
Bumps [thiserror](https://github.com/dtolnay/thiserror) from 1.0.35 to 1.0.36.
- [Release notes](https://github.com/dtolnay/thiserror/releases)
- [Commits](https://github.com/dtolnay/thiserror/compare/1.0.35...1.0.36)

---
updated-dependencies:
- dependency-name: thiserror
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-09-26 14:44:36 +00:00
Andrew Lamb 65f1550126
feat: Implement `debug parquet_to_lp` command to convert parquet to line protocol (#5734)
* feat: add `influxdb_iox debug parquet_to_lp` command

* chore: Run cargo hakari tasks

* fix: update command description

* fix: remove unecessary Result import

* fix: Apply suggestions from code review

Co-authored-by: Carol (Nichols || Goulding) <193874+carols10cents@users.noreply.github.com>

Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
Co-authored-by: Carol (Nichols || Goulding) <193874+carols10cents@users.noreply.github.com>
2022-09-26 14:17:27 +00:00
dependabot[bot] 0cc29300ce
chore(deps): Bump pbjson-types from 0.4.0 to 0.5.0 (#5703)
Bumps [pbjson-types](https://github.com/influxdata/pbjson) from 0.4.0 to 0.5.0.
- [Release notes](https://github.com/influxdata/pbjson/releases)
- [Commits](https://github.com/influxdata/pbjson/commits)

---
updated-dependencies:
- dependency-name: pbjson-types
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-09-21 09:44:26 +00:00
Andrew Lamb ea51feadf4
chore: Improve debug logging when parquet files are created (#5699)
* chore: Improve debug logging when parquet files are created

* fix: add duration for encoding parqut

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-09-20 20:43:34 +00:00
Carol (Nichols || Goulding) 414b0f02ca
fix: Use time helper methods in more places 2022-09-19 13:24:08 -04:00
dependabot[bot] b4a25fdb0e
chore(deps): Bump thiserror from 1.0.34 to 1.0.35 (#5629)
Bumps [thiserror](https://github.com/dtolnay/thiserror) from 1.0.34 to 1.0.35.
- [Release notes](https://github.com/dtolnay/thiserror/releases)
- [Commits](https://github.com/dtolnay/thiserror/compare/1.0.34...1.0.35)

---
updated-dependencies:
- dependency-name: thiserror
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-09-14 12:54:12 +00:00
Andrew Lamb 9d63edb299
chore: remove unecessary batch check (#5620)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-09-13 12:41:59 +00:00
Andrew Lamb f86d3e31da
chore: Update datafusion + object_store (#5619)
* chore: Update datafusion pin

* chore: update object_store to 0.5.0

* chore: Run cargo hakari tasks

Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-09-13 12:34:54 +00:00
Andrew Lamb 1fd31ee3bf
chore: Update datafusion / `arrow` / `arrow-flight` / `parquet` to version 22.0.0 (#5591)
* chore: Update datafusion / `arrow` / `arrow-flight` / `parquet` to version 22.0.0

* fix: enable dynamic comparison flag

* chore: derive Eq for clippy

* chore: update explain plans

* chore: Update sizes for ReadBuffer encoding

* chore: update more tests

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-09-12 17:45:03 +00:00
Nga Tran e56ebb5e31
fix: remove unnecessary warning (#5594)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-09-09 14:15:54 +00:00
Marco Neumann adeacf416c
ci: fix (#5569)
* ci: use same feature set in `build_dev` and `build_release`

* ci: also enable unstable tokio for `build_dev`

* chore: update tokio to 1.21 (to fix console-subscriber 0.1.8

* fix: "must use"
2022-09-06 14:13:28 +00:00
dependabot[bot] 9f0b0328f7
chore(deps): Bump thiserror from 1.0.33 to 1.0.34 (#5556)
Bumps [thiserror](https://github.com/dtolnay/thiserror) from 1.0.33 to 1.0.34.
- [Release notes](https://github.com/dtolnay/thiserror/releases)
- [Commits](https://github.com/dtolnay/thiserror/compare/1.0.33...1.0.34)

---
updated-dependencies:
- dependency-name: thiserror
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-09-06 09:18:41 +00:00
Marco Neumann 064f0e9b29
refactor: use DataFusion to read parquet files (#5531)
Remove our own hand-rolled logic and let DataFusion read the parquet
files.

As a bonus, this now supports predicate pushdown to the deserialization
step, so we can use parquets as in in-mem buffer.

Note that this currently uses some "nested" DataFusion hack due to the
way the `QueryChunk` interface works. Midterm I'll change the interface
so that the `ParquetExec` nodes are directly visible to DataFusion
instead of some opaque `SendableRecordBatchStream`.
2022-09-05 09:25:04 +00:00
dependabot[bot] 00ed79ff1b
chore(deps): Bump thiserror from 1.0.32 to 1.0.33 (#5524)
Bumps [thiserror](https://github.com/dtolnay/thiserror) from 1.0.32 to 1.0.33.
- [Release notes](https://github.com/dtolnay/thiserror/releases)
- [Commits](https://github.com/dtolnay/thiserror/compare/1.0.32...1.0.33)

---
updated-dependencies:
- dependency-name: thiserror
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-09-01 09:11:31 +00:00
Andrew Lamb 6669d85fb4
chore: Update datafusion + arrow/parquet to `21.0.0` (#5519)
* chore: Update arrow/arrow-flight/parquet to 21.0.0

* chore: Update datafusion pin

* chore: Fix arrow update script

* chore: Update Cargo.lock

* chore: Update for new API
2022-08-31 13:30:47 +00:00
Dom Dwyer 7698264768 refactor: raise error for no rows in parquet file
Previously when attempting to serialise a stream of one or more
RecordBatch containing no rows (resulting in an empty file), the parquet
serialisation code would panic.

This changes the code path to raise an error instead, to support the
compactor making multiple splits at once, which may overlap a single
chunk:

                  ────────────── Time ────────────▶

                          │                │
                  ┌█████──────────────────────█████┐
                  │█████  │    Chunk 1     │  █████│
                  └█████──────────────────────█████┘
                          │                │

                          │                │

                      Split T1         Split T2

In the example above, the chunk has an unusual distribution of write
timestamps over the time range it covers, with all data having a
timestamp before T1, or after T2. When a running a SplitExec to slice
this chunk at T1 and T2, the middle of the resulting 3 subsets will
contain no rows. Because we store only the min/max timestamps in the
chunk statistics, it is unfortunately impossible to prune one of these
split points from the plan ahead of time.
2022-08-30 14:52:31 +02:00
Carol (Nichols || Goulding) fe9c474620
fix: rustfmt 2022-08-29 14:06:45 -04:00
Carol (Nichols || Goulding) 240946d8f5
fix: Deprecate proto sequencer_id fields; add shard_id fields 2022-08-29 14:06:44 -04:00
Jake Goulding 4abf21c724
refactor: Rename Sequencer (and its entourage) to Shard 2022-08-29 14:06:43 -04:00
Andrew Lamb 7f0ae53d6f
chore: Update to (almost) released object_store 0.4.0 (#5419)
* chore: update object_store

* chore: update hakari config

* chore: Run cargo hakari tasks

Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
2022-08-17 13:44:48 +00:00
Carol (Nichols || Goulding) b982bdaf2f
fix: Derive Eq when we derive PartialEq and members can derive Eq
Allow this in generated code that we don't control, though.

Recommended by clippy now. https://rust-lang.github.io/rust-clippy/master/index.html#derive_partial_eq_without_eq
2022-08-11 15:04:06 -04:00
Andrew Lamb 16ddc5efc6
chore: Update datafusion / arrow/parquet/arrow-flight and prost/tonic ecosystem (#5360)
* chore: Update datafusion and arrow

* chore: Update Cargo.lock

* chore: update to Decimal128

* chore: Update tonic/prost/pbjson/etc

* chore: Run cargo hakari tasks

* fix: doctest in generated types

Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
2022-08-09 17:30:44 +00:00
Marco Neumann e24cecd926
fix: buffer allocation while reading parquet files (#5312)
Work around https://github.com/apache/arrow-rs/issues/2321 by limiting
reader batch size to number of rows (based on file-level metadata).

Fixes https://github.com/influxdata/conductor/issues/1103 .
2022-08-04 14:21:05 +00:00
dependabot[bot] 55e1e2ec2b
chore(deps): Bump thiserror from 1.0.31 to 1.0.32 (#5294)
* chore(deps): Bump thiserror from 1.0.31 to 1.0.32

Bumps [thiserror](https://github.com/dtolnay/thiserror) from 1.0.31 to 1.0.32.
- [Release notes](https://github.com/dtolnay/thiserror/releases)
- [Commits](https://github.com/dtolnay/thiserror/compare/1.0.31...1.0.32)

---
updated-dependencies:
- dependency-name: thiserror
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

* chore: Run cargo hakari tasks

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-08-03 16:20:36 +00:00
Nga Tran ee151c8b41
fix: make batch size back to 1024 to see if the OOM in the compactor go away (#5289)
* fix: make batch size back to 1024 to see if the OOM in the compactor go away

* fix: address review comments

* chore: Apply suggestions from code review

Co-authored-by: Marco Neumann <marco@crepererum.net>

* fix: import needed constant

Co-authored-by: Marco Neumann <marco@crepererum.net>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-08-03 15:13:04 +00:00
Nga Tran 8f1b6f2465
chore: reduce log info (#5254)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-08-01 16:00:34 +00:00
Marco Neumann 3ae89de324
refactor: increase batch size when reading parquet (#5250)
* refactor: increase batch size when reading parquet

This reduced our overhead when reading parquet files quite a lot.

In some internal benchmark, this reduces the size to perform a single
series aggregation of a rather large series with cold caches from 58s to
48s for cold caches. No real difference could be measured for warm
caches (~21ms for both).

This should also help the compactor since the record batches should be
larger.

* refactor: ensure that parquet row group size is in-sync

Ensure that we use the same row group size for reading and writing
parquet files. This is the same value as upstream currently uses as a
default, but let's make sure we don't diverge from that:

3032a521c9/parquet/src/file/properties.rs (L65)

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-08-01 10:31:26 +00:00
Andrew Lamb 9215a534d0
chore: Update datafusion and `arrow`/`parquet`/`arrow-flight` to `19.0.0` (#5229)
* chore: Update datafusion and `arrow`/`parquet`/`arrow-flight` to `19.0.0`

* chore: Run cargo hakari tasks

* fix: Update for API changes

* fix: clippy

Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-07-28 08:10:47 +00:00
Nga Tran d05f383a98
refactor: reduce compacting size and compacted file size to prevent compactor from waiting for reading a large file forever (#5206)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-07-25 20:08:11 +00:00
Nga Tran 69cb3f2b19
refactor: remove min_sequence_number from Compactor and Querier, add `count_by_overlaps_with_level_0` and `count_by_overlaps_with_level_1` to catalog (#5151)
* refactor: remove min_sequnce_number

* fix: typos

* fix: remove min_sequencer_number from new files from merging main

* fix: add back throwing error if the compactor compacts files persisted by the ingester after the ingester sends max seq_num back to querier

* test: add test_compactor_collision back but modify the input to make it work woth new changes

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-07-21 13:51:54 +00:00
dependabot[bot] 278a7f91af
chore(deps): Bump bytes from 1.1.0 to 1.2.0 (#5156)
Bumps [bytes](https://github.com/tokio-rs/bytes) from 1.1.0 to 1.2.0.
- [Release notes](https://github.com/tokio-rs/bytes/releases)
- [Changelog](https://github.com/tokio-rs/bytes/blob/master/CHANGELOG.md)
- [Commits](https://github.com/tokio-rs/bytes/compare/v1.1.0...v1.2.0)

---
updated-dependencies:
- dependency-name: bytes
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-07-20 10:00:08 +00:00
Andrew Lamb e2d871b00b
chore: Update datafusion and arrow/parquet/arrow-flight to `18.0.0` (#5079)
* chore: Update datafusion to 10.0.0, arrow/parquet/arrow-flight to 18

* chore: Run cargo hakari tasks

* fix: update cargo pin

Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-07-18 15:01:03 +00:00
Jake Goulding 635f535e0e refactor: replace level_2 with level_1 2022-07-16 21:49:45 -04:00
dependabot[bot] 9b67de2f43
chore(deps): Bump tokio from 1.19.2 to 1.20.0
Bumps [tokio](https://github.com/tokio-rs/tokio) from 1.19.2 to 1.20.0.
- [Release notes](https://github.com/tokio-rs/tokio/releases)
- [Commits](https://github.com/tokio-rs/tokio/compare/tokio-1.19.2...tokio-1.20.0)

---
updated-dependencies:
- dependency-name: tokio
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2022-07-14 01:21:43 +00:00
Carol (Nichols || Goulding) 61c023139b
refactor: Switch compaction levels to an enum with values rather than separate consts
Bonuses:

- Type checking
- Validation
- Less casting
- Exhaustiveness checking
- Less use of the numerical value
2022-07-13 11:30:36 -04:00
Marco Neumann b1b2cb5d4a
feat: load read buffer on demand (#5091)
* refactor: extract `select_schema`

* refactor: improve `InternalLostInputField` error message

* test: improve SQL runner output

* feat: load read buffer on demand

Closes #5032.

* refactor: move `[Half]OwnedSelection` to `schema` crate`
2022-07-13 08:51:40 +00:00
Carol (Nichols || Goulding) 80b6c5c82f
fix: Correct typo in constant name so searching for COMPACTION_LEVEL returns all (#5077) 2022-07-08 16:31:52 +00:00
Andrew Lamb c46e1c6347
chore: Update datafusion + arrow/parquet/arrow-flight to `17.0.0` (#5021)
* fix: correct nullability declaration of system tables

* chore: Update datafusion and arrow/parquet/arrow-flight

* chore: Run cargo hakari tasks

* fix: Update tests

* fix: Update tests

* fix: predicate pruning

* fix: add some tests

* fix: query_functions

* fix: fix read_buffer test

* fix: fix clippy

Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-07-07 19:22:15 +00:00
Marco Neumann 16bd3e67c0
refactor: unify `apply_predicate_to_metadata` (#5030)
Instead of using some hand-rolled timestamp-based logic (or just
"unknown") all over the place, just use logic introduced in #5017.

This requires slightly improved table summaries within the querier that
at least has min/max for the timestamp column. For that, the former
`IngesterChunk`-specific `calculate_summary` method was extended to
`create_basic_summary` to include that data and is now also used by
`QuerierParquetChunk`.

Note: `QuerierRBChunk` already has detailled metrics that are provided
by the read buffer implementation.

Should we ever need even better pruning for `QuerierParquetChunk` (or
`IngesterChunk`) then we _only_ need add extra data to the table
summaries.

Closes #4976.

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-07-05 12:51:59 +00:00
Marco Neumann 016dd93d9c
feat: filter chunks before requesting read buffers (#4996)
Fixes #4976.
2022-07-01 08:59:07 +00:00
Marco Neumann be53716e4d
refactor: use IDs for `parquet_file.column_set` (#4965)
* feat: `ColumnRepo::list_by_table_id`

* refactor: use IDs for `parquet_file.column_set`

Closes #4959.

* refactor: introduce `TableSchema::column_id_map`
2022-06-30 15:08:41 +00:00
Marco Neumann 05189fdb00
fix: ensure panic is propagated throw `AdapterStream` (#4980)
Prior to this change background tasks that we feed into `AdapterStream`
can panic but that would just end the stream without any user-visible
error (except for the panic message on stdout/stderr).

This was found while developing #4964. I have proposed another fix in #4966
but found that I actually developed an existing solution a 2nd time:
`watch_task`. But I also see a major issue with the existing API: one
can create `AdapterStream` with ordinary tokio tasks that are not
watched at all, leaving the burden to the implementor to check for that
(and actually we forgot that in `parquet_file`).

So this change takes a slightly different approach:
The `AdapterStream` does NOT accept ordinary join handles any longer but
requires that you pass a "watched task". The newly introduced
`WatchedTask` does the same as we did manually before: wrapping a future
into a tokio task, watch it and wrap the watcher into a task.

It is now way more difficult to do anything stupid (sure you can still
mix up the tasks and the channels, but we need at least some flexibility
here to allow for "split" and potential future fan-in/out constructs).

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-06-30 07:57:58 +00:00
Raphael Taylor-Davies 835e1c91c7
chore: update object_store to 0.3.0 (#4707)
* chore: update object_store to 0.3.0

* chore: review feedback

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-06-29 21:44:03 +00:00
Nga Tran cfcc4b8426
refactor: change level 1 to level 2 preparing for next design changes (#4954)
* refactor: change level 1 to level 2 preparing for next design changes

* fix: make level-2 consistent everywhere

* chore: remove unused comments

* refactor: change all the name level_1 to level_2 to completely replace 1 with 2 to amke everything consistent

* chore: add correspinding constants for the comapction levels in the comments

Co-authored-by: Dom <dom@itsallbroken.com>
2022-06-29 14:08:58 +00:00
Marco Neumann 9b66d02229
fix: parquet reader sort order (#4964) 2022-06-28 15:28:38 +00:00
Marco Neumann 215f297162
refactor: parquet file metadata from catalog (#4949)
* refactor: remove `ParquetFileWithMetadata`

* refactor: remove `ParquetFileRepo::parquet_metadata`

* refactor: parquet file metadata from catalog

Closes #4124.
2022-06-27 15:38:39 +00:00
Marco Neumann 1a74f84494
refactor: remove `ParquetFileWithMetadata` usage outside the catalog (#4948)
* refactor: remove `DecodedParquetFile` from `iox_tests`

* refactor: remove `DecodedParquetFile` from querier

Also pull out all the chunk schema and sort key handling into a function
so that RB chunks and parquet chunks mostly use the same code path.

* refactor: remove `DecodedParquetFile`

* refactor: remove `ParquetFileWithMetadata` usage

* fix: test data consistency
2022-06-27 15:19:29 +00:00
Nga Tran 3c0fb6e8ef
fix: avoid using min_time, which can be negative, for ChunkId. Using object store id which is uuid instead (#4942)
* fix: avoid using min_time, which can be negative, for ChunkId. Using object store id which is uuid instead

* chore: Apply suggestions from code review

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* chore: run fmt

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-06-23 19:00:13 +00:00
Marco Neumann bd6c4659af
refactor: slim down parquet chunk (remove Metadata) (#4934)
* feat: conversion from `ParquetFile` to `ParquetFilePath`

* refactor: slim down parquet chunk

- ensure it works without binary parquet metadata
- timestamp range is no longer optional (ensured by the NG type system)
- remove table summary: this is only needed for SOME API users. The
  compactor can perfectly work without statistics since has the timestamp
  range which is sufficient for the current overlap check (we don't use
  any other primary key stats at the moment). The querier currently does
  NOT use parquet chunks (was replaced by read buffer) but if it will
  again in some future it will likely need to find a way to fetch and
  cache the statistics.
- the schema is now provided by the API user since it can be
  reconstructed using the NG catalog only (and "wrong" column orders are
  tolerated as of #4921)

Ref #4124

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-06-23 10:55:16 +00:00
Marco Neumann c899c3a0f4
fix: column handling when reading parquet files (#4921)
* fix: column handling when reading parquet files

This improves/fixes/tests a few aspects when reading parquet files:

- fix usage of `Selection::Some(...)`. This was broken since #4912 but
  apparently no test caught that.
- ensure that the order of `Selection::Some(...)` is preserved
- ensure that schema metadata is attached to output batches
- ignore parquet columns that we don't care about (i.e. do not select)
- allow parquet file to have a different column order than our internal
  bookkeeping, this makes it way simpler to read parquet files w/o
  scanning the metadata first
- extend the test coverage

Ref #4124.

* test: even more tests for parquet reader
2022-06-22 13:51:30 +00:00
Marco Neumann 59accfe862
refactor: assorted fixes and prep work for #4124 (#4912)
* refactor: `TestPartition::update_sort_key` should return an `Arc`

The whole test framework is built around `Arc`s, so let's fix this
consistency issue.

* fix: actually calculate correct column set in test framework

* feat: check expected parquet file schema

While working on the querier I made some mistakes regarding schemas and
such a check would have greatly improved the debugging experience.

* feat: namespace cache expiration

* fix: improve parquet schema check

* fix: remove clone
2022-06-21 16:08:28 +00:00
Marco Neumann 0f63be26c3
refactor: pass path instead of metadata around to load parquet files (#4909) 2022-06-21 10:57:10 +00:00
Marco Neumann c3912e34e9
refactor: store per-file column set in catalog (#4908)
* refactor: store per-file column set in catalog

Together with the table-wide schema and the partition-wide sort key, this should
be everything we need to read a parquet file directly into memory
without peeking any file-level metadata.

The querier will use this to directly load parquet files into the read
buffer.

**WARNING: This requires a catalog wipe!**

Ref #4124.

* refactor: use proper `ColumnSet` type
2022-06-21 10:26:12 +00:00
Marco Neumann 0fbff981ec
chore(deps): Bump sqlx to 0.6.0 and uuid to 1 (#4894)
Closes #4889.
Closes #4890.

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-06-17 10:28:28 +00:00