Commit Graph

570 Commits (77d8967c8e04be7357abca6bd8b264bb8a25dd51)

Author SHA1 Message Date
Marco Neumann 7907a2bae3
fix: column summary conversion for "unknown" TS (#4379)
* fix: column summary conversion for "unknown" TS

Both IOx and DataFusion have the same data model for min/max statistics:

`Option<Option<i64>>` (or any other inner type)

The interpretation is:

1. **`None`:** Value unknown.
2. **`Some(None)`:** Value known to be NULL.
3. **`Some(Some(x))`:** Value known and non NULL.

The bug was that during the conversion from the IOx statistics type to
the DataFusion statistics type for timestamps, case 1 was converted into
case 2.

Up until now this didn't make a difference between timestamps were
basically known all the time, but during the development of NG there are
cases where the timestamps are unknown (this might change, but the query
engine should be correct w/o assuming that).

* docs: explain test

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-04-22 07:44:55 +00:00
Andrew Lamb e67cc9dbce
chore: Update datafusion again (#4385)
* chore: Update datafusion

* fix: Update imports

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-04-21 21:05:16 +00:00
Carol (Nichols || Goulding) c7a1c496cf
fix: incorrect overlapped grouping (#4082)
* test: Failing test for finding overlapped groups

* test: Failing test for query overlap too :(

* fix: Group parquet files overlapped by time correctly

Inspired by https://towardsdatascience.com/overlapping-time-period-problem-b7f1719347db

Not sure what the real name for this algorithm is

* refactor: Group items without an intermediate hashmap needed

* chore: cleanup

Co-authored-by: NGA-TRAN <nga-tran@live.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-04-21 18:51:30 +00:00
Andrew Lamb 73bed810da
chore: Update arrow, arrow-flight, parquet, tonic, prost, etc (#4357)
* chore: Update datafusion

* chore: Update arrow/arrow-flight/parquet to 12

* chore: update datafusion correctly

* chore: Update prost, tonic, and dependents

* fix: Fixup some api changes

* fix: Update test output in db

* fix: Update test output in parquet_file

* fix: remove old pbjson types

* fix: Add "--experimental_allow_proto3_optional" flag

* chore: Run cargo hakari tasks

* fix: compile error

* chore: Update heappy

Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-04-20 11:12:17 +00:00
Andrew Lamb e3d83fe757
chore: update datafusion (#4342)
* chore: update datafusion

* fix: Update imports for change in datafusion organization
2022-04-19 13:38:12 +00:00
Nga Tran 2a601c3099
fix: Revert "chore: Revert "fx: Revert "fix: Revert "feat: Use the sort key stored in the catalog during compaction" (#4299)" (#4303)" (#4327)" (#4328)
* fix: Revert "chore: Revert "fx: Revert "fix: Revert "feat: Use the sort key stored in the catalog during compaction" (#4299)" (#4303)" (#4327)"

This reverts commit 7e5d719027.

* chore: resolve merge conflict

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-04-18 15:27:39 +00:00
Nga Tran 8e2d158a37
test: deadlock test and add more debug log (#4319)
* test: use Paul deadlock reproducer and add more debug log

* test: remove compare many output rows

* test: verify the test putput

* chore: cleanup

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-04-14 18:06:22 +00:00
Nga Tran 7e5d719027
chore: Revert "fix: Revert "fix: Revert "feat: Use the sort key stored in the catalog during compaction" (#4299)" (#4303)" (#4327)
This reverts commit fe8d9948d5.
2022-04-14 17:11:55 +00:00
Carol (Nichols || Goulding) fe8d9948d5
fix: Revert "fix: Revert "feat: Use the sort key stored in the catalog during compaction" (#4299)" (#4303)
This reverts commit 7ddbf7c025.

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-04-14 15:42:28 +00:00
Dom Dwyer 31fdeaaabc refactor: log split worker panics at error level
When the split background worker panics, it now causes an ERROR level
log to be emitted.
2022-04-14 15:39:35 +01:00
Dom Dwyer 00b5c1b296 fix: compaction deadlock
This commit resolves the compaction deadlock described in #4306.

The deadlock occurs during StreamSplitExec execution, where a background
worker is spawned to read input record batches and partition them into
two groups. This code pushes the resulting split record batches into two
channels - one for records that match a given predicate, and another
channel for those that do not. These channels buffer at most 2 record
batches each.

The compactor that executes this plan reads the resulting partitions
sequentially to completion. Completion is indicated by reading until the
results stream ends, which ends when the underlying channel is closed,
and therefore the split worker task must have finished and closed the
results channel for the partition to be successfully read.

While the compactor is reading from the first partition, the worker is
attempting to push record batches into the second partition and blocks
due to the channel capacity being reached. The worker never drops the
channel for the first partition, so the compactor never finishes reading
the first partition, and nothing is reading the second partition to
unblock the worker. Deadlock!
2022-04-14 15:39:35 +01:00
Nga Tran 3070d78e8c
chore: add more compactor debug info (#4310)
* chore: add more compactor debug info

* chore: Apply suggestions from code review

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* chore: fix format

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-04-13 19:22:19 +00:00
Carol (Nichols || Goulding) 7ddbf7c025
fix: Revert "feat: Use the sort key stored in the catalog during compaction" (#4299)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-04-13 14:11:10 +00:00
kodiakhq[bot] 21f748062e
Merge branch 'main' into cn/sort-in-compactor 2022-04-13 12:43:31 +00:00
Andrew Lamb e96aed6949
chore: add comments and `trace` calls to query provider regarding sort keys (#4274)
* chore: add comments and debug to query provider

* docs: Update query/src/provider.rs

Co-authored-by: Carol (Nichols || Goulding) <193874+carols10cents@users.noreply.github.com>

Co-authored-by: Carol (Nichols || Goulding) <193874+carols10cents@users.noreply.github.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-04-12 16:36:39 +00:00
Carol (Nichols || Goulding) 55fe3b8d50
feat: Use the sort key stored in the catalog during compaction
Fixes #4249.
2022-04-11 14:09:45 -04:00
Andrew Lamb be4ebe2563
feat: Add more context to error messages (#4263)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-04-11 10:51:50 +00:00
Nga Tran f838cb78a2
fix: not to add IOxReadFilterNode for empty non-duplicated chunks (#4264)
* fix: not to add IOxReadFilterNode for no data of non-duplicated chunks if there is already scan node for overlapped/duplicated chunks

* refactor: address review comments

* chore: Apply suggestions from code review

Co-authored-by: Carol (Nichols || Goulding) <193874+carols10cents@users.noreply.github.com>

Co-authored-by: Carol (Nichols || Goulding) <193874+carols10cents@users.noreply.github.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-04-08 21:03:22 +00:00
Andrew Lamb bbbdcc75a8
feat: `QuerierDatabase::chunks` returns `Result` (#4260)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-04-08 18:54:17 +00:00
Andrew Lamb 34e65c23fa
fix: Update for signature change (#4252)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-04-08 11:21:07 +00:00
Carol (Nichols || Goulding) b16fcc284d
feat: Add new columns to the sort key during compaction
Connects to #4196.
2022-04-06 09:31:42 -04:00
Carol (Nichols || Goulding) 9043966443
docs: Fix some typos in comments as I noticed them 2022-03-31 16:34:47 -04:00
Andrew Lamb 22b24bdab3
chore: Update datafusion again (#4148)
* chore: update datafusoon

* refactor: Update for DataFusion API changes

* chore: TEMP TEMP change df to local copy

* chore: Update to datafusion again

* fix: Update Cargo.lock

* fix: logical conflict
2022-03-30 16:51:48 +00:00
Marco Neumann 20bbb88dc5
refactor: remove table name from `TableSummary` (#4170)
This allows us to remove the table name from the low-level chunk
representations (like `ParquetFile`, RUB, ...) since table names are
already tracked by the higher-level data structures (e.g. catalog,
catalog chunk) that manage the low-level chunk representations.

This is similar to #4167.

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-03-30 13:24:00 +00:00
Marco Neumann 2b76c31157
refactor: make statistics null counts optional (#4160)
Min/max values and distinct counts are already optional, so let's make
the null counts optional as well. This will be helpful for NG to deal w/
partial statistics (e.g. we only populate stats for the time column).

Note that the total count is still mandatory, but we normally have the
chunk/file-level row count at hand.
2022-03-29 17:47:57 +00:00
dependabot[bot] 17af5fcbd1
chore(deps): Bump tokio-util from 0.7.0 to 0.7.1 (#4154)
* chore(deps): Bump tokio-util from 0.7.0 to 0.7.1

Bumps [tokio-util](https://github.com/tokio-rs/tokio) from 0.7.0 to 0.7.1.
- [Release notes](https://github.com/tokio-rs/tokio/releases)
- [Commits](https://github.com/tokio-rs/tokio/compare/tokio-util-0.7.0...tokio-util-0.7.1)

---
updated-dependencies:
- dependency-name: tokio-util
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

* chore: Run cargo hakari tasks

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-03-29 08:39:02 +00:00
Andrew Lamb 5c69a3f43b
chore: Update deps: datafusion, arrow/arrow-flight/parquet to 11, zstd to 0.11 (#4119)
* chore: update datafusion

* chore(deps): Bump arrow from 10.0.0 to 11.0.0

Bumps [arrow](https://github.com/apache/arrow-rs) from 10.0.0 to 11.0.0.
- [Release notes](https://github.com/apache/arrow-rs/releases)
- [Changelog](https://github.com/apache/arrow-rs/blob/master/CHANGELOG.md)
- [Commits](https://github.com/apache/arrow-rs/compare/10.0.0...11.0.0)

---
updated-dependencies:
- dependency-name: arrow
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>

* chore(deps): Bump arrow-flight from 10.0.0 to 11.0.0

Bumps [arrow-flight](https://github.com/apache/arrow-rs) from 10.0.0 to 11.0.0.
- [Release notes](https://github.com/apache/arrow-rs/releases)
- [Changelog](https://github.com/apache/arrow-rs/blob/master/CHANGELOG.md)
- [Commits](https://github.com/apache/arrow-rs/compare/10.0.0...11.0.0)

---
updated-dependencies:
- dependency-name: arrow-flight
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>

* chore: update parquet to 11.0.0

* fix: error on create schema, test for same

* fix: upgrade zstd

* chore: Run cargo hakari tasks

* fix: fix logical merge conflict

* fix: hakari

* fix: hakari

* fix: update newly introduced dep

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-03-24 15:27:36 +00:00
Andrew Lamb b83b000590
chore: Update datafusion (#4071)
* chore: update to datafusion 5936edc2a94d5fb20702a41eab2b80695961b9dc

* chore: Update apis to match datafusion changes
2022-03-22 13:17:41 +00:00
Marco Neumann c9908b260c
refactor: dyn-dispatch database in query subsystem (#4083)
* refactor: dyn-dispatch database in query subsystem

This is similar to #4080 but concerns the database itself.

For #3934.

* docs: improve wording

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-03-22 09:15:52 +00:00
Marco Neumann d1df95df87 refactor: dyn-dispatch chunks in query subsystem
- this is what DataFusion is doing as well; it's also fast enough
  because the number of chunks in a query is not THAT massive (it's not
  like we are doing row-level dyn dispatching)
- it simplifies abstracting over different databases
- it allows us to drop our enum-based dispatching that we have for
  `DbChunk` and that we would also need for the querier (e.g. depending
  on if a chunk is backed by a parquet file or ingester data)
- it likely speeds up compile times because the `query` is no longer
  contains massive amounts of generic code

For #3934.
2022-03-21 12:47:54 +01:00
Marco Neumann ca152e7934 refactor: avoid generics in `QueryDatabase`
A step to make this trait object-safe.

Ref #3934.
2022-03-21 10:45:05 +01:00
Marco Neumann 0071b85c22 refactor: make `ExecutionContextProvider` object-safe
Ref #3934.
2022-03-21 10:40:53 +01:00
Marco Neumann 169fa2fb2f refactor: make `QueryChunk` object-safe
This makes it way easier to dyn-type database implementations. The only
real change is that we make `QueryChunk::Error` opaque. Nobody is going
to inspect that anyways, it's just printed to the user.

This is a follow-up of #4053.

Ref #3934.
2022-03-18 11:40:31 +01:00
Marco Neumann 0850a93f20
refactor: make `QueryDatabase::chunks` async (#4047)
For OG we can determine the chunks w/o any IO, for NG however this might
require a few catalog queries.

This is likely not the last change of this sort, i.e. the whole schema
handling is currently sync as well.

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-03-17 12:55:25 +00:00
Nga Tran 5a29d070ea
feat: Implement the compact function for NG Compactor (#4001)
* feat: initial implementation of compact a given list of overlapped parquet files

* feat: Add QueryableParquetChunk and some refactoring

* feat:  build queryable parquet chunks for parquet files with tombstones

* feat: second half the implementation for Compactor's compact. Tests will be next

* fix: comments for trait funnctions fof QueryChunkMeta

* test: add tests for compactor's compact function

* fix: typos

* refactor: address Jake's review comments

* refactor: address Andrew's comments and add one more test for files in different order in the vector

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-03-11 20:25:19 +00:00
Andrew Lamb 2c3d30ca32
chore: Update datafusion, arrow, flight and parquet (#4000)
* chore: Update datafusion, arrow, flight and parquet

* fix: api change

* fix: fmt

* fix: update test metadata size

* fix: Update sizes in parquet test

* fix: more metadata size update
2022-03-10 12:24:47 +00:00
Marco Neumann 77f6153f72
refactor: remove `QueryDatabase::chunk_summaries` (#3977)
- This is not used by the query engine at all.
- The query engine should not care about ALL chunks but only about the
  chunks it gets via `QueryDatabase::chunks` (which includes a table
  name and a predicate).
- All other users of that API are NOT really query-related.
2022-03-08 11:34:26 +00:00
Marco Neumann 5cc1c697fc
refactor: remove `QueryDatabase::partition_addr` (#3976)
- This was not actually used by the query engine.
- The query engine doesn't have a concept of a "partition", it only
  cares about chunks.
- Unbound access to all partitions in the database is quite expensive
  (esp. on NG).
2022-03-08 11:17:31 +00:00
Raphael Taylor-Davies 80fb75d90b
feat: add a flag to enable per-partition tracing (#3928)
* feat: add a flag to enable per-partition tracing

* chore: rename constant

* feat: use BooleanFlag and cache result
2022-03-07 13:49:23 +00:00
Raphael Taylor-Davies 7b28fb4366
feat: improve trace naming (#3931)
* feat: improve trace naming

* test: test span description

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-03-07 11:49:19 +00:00
Andrew Lamb 9d8bceccbf
test: Add test to verify deduplicating is working (#3937)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-03-04 20:05:17 +00:00
Andrew Lamb e09f39d6a0
chore: Update datafusion (#3943)
* chore: Update datafusion

* refactor: update for new datafusion

* chore: Run cargo hakari tasks

Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
2022-03-04 19:37:46 +00:00
Raphael Taylor-Davies e304613546
feat: include trace ID in query log (#3912) (#3923)
* feat: include trace ID in query log (#3912)

* chore: fmt

* chore: lint
2022-03-03 17:50:49 +00:00
Edd Robinson de7c46c9bb feat: add read_window_aggregate tracing 2022-03-03 14:30:27 +00:00
Edd Robinson ea32bc366a feat: add read_group tracing 2022-03-03 14:27:01 +00:00
Edd Robinson 32baaa1ee7 feat: add tracing to field_columns 2022-03-03 14:27:01 +00:00
Edd Robinson 787a848bf5 feat: add tracing for tag_values 2022-03-03 14:27:01 +00:00
Edd Robinson 6a6fbf73ae feat: add tracing support tag_keys 2022-03-03 14:27:01 +00:00
Edd Robinson 998e205c2c feat: trace table_names 2022-03-03 14:27:01 +00:00
Edd Robinson 301ae886ce feat: add tracing down to the chunk level (#3804)
* refactor: wire exectution context to Deduplicator

* feat: example trace to chunk read_filter

* refactor: make execution context required

* refactor: expose metadata API

* refactor: more span context for chunk read_filter

* refactor: fix build

* refactor: push context into result stream

* refactor: make executor optional
2022-03-03 14:27:00 +00:00