These were found by iterating over all of the dependencies of each
Cargo.toml, then grepping that crate for the dependency's name. If it
didn't show up, I attempted to remove it.
I left a few dependencies that this process flagged:
* generated_types
- `pbjson`,`serde`. Apparently used by the generated code.
* grpc-router-test-gen
- `prost`. Apparently used by the generated code.
* influxdb_iox
- `heappy`. Doesn't appear used, but is behind enough feature
flags that I don't care to reason about and it's already optional.
- `tikv_jemalloc_sys`. Appears to be setting a feature flag of an
indirect dependency.
* iox_gitops_adapter
- `k8s_openapi`. Appears to be setting a feature flag of an indirect
dependency.
* chore: Tool for automating arrow version update
* chore: Update datafusion and arrow/parquet/arrow-flight
* fix: update for changes in Arrow API
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* feat: use stored sort key to deduplicate data
* refactor: verify if one is a super sort key of the other
* test: unit tests for scan and deduplication plans
* fix: typo
* refactor: refactor and add comments
* feat: cache partition sort key to read during planning as needed
* test: tests for query plans with different overlap groups
* chore: cleanup
* chore: resolve merge conflicts
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* refactor: document and improve `MockIngesterConnection`
* refactor: split `OldOneMeasurementFourChunksWithDuplicates` for `EXPLAIN` queries
* fix: mark "IngsterPartition" chunks as unsorted
* fix: "group by" queries may require sorted comparison
* refactor: re-export a few more types from querier
* fix: ensure that test parquet files are de-duped
* test: chunks in ingester stage
* docs: explain test code
* refactor: grouping overlaps is now use the same overlap function in both compactor and deduplication
* chore: commit missing file
* chore: address review comments
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* fix: column summary conversion for "unknown" TS
Both IOx and DataFusion have the same data model for min/max statistics:
`Option<Option<i64>>` (or any other inner type)
The interpretation is:
1. **`None`:** Value unknown.
2. **`Some(None)`:** Value known to be NULL.
3. **`Some(Some(x))`:** Value known and non NULL.
The bug was that during the conversion from the IOx statistics type to
the DataFusion statistics type for timestamps, case 1 was converted into
case 2.
Up until now this didn't make a difference between timestamps were
basically known all the time, but during the development of NG there are
cases where the timestamps are unknown (this might change, but the query
engine should be correct w/o assuming that).
* docs: explain test
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* test: Failing test for finding overlapped groups
* test: Failing test for query overlap too :(
* fix: Group parquet files overlapped by time correctly
Inspired by https://towardsdatascience.com/overlapping-time-period-problem-b7f1719347db
Not sure what the real name for this algorithm is
* refactor: Group items without an intermediate hashmap needed
* chore: cleanup
Co-authored-by: NGA-TRAN <nga-tran@live.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* test: use Paul deadlock reproducer and add more debug log
* test: remove compare many output rows
* test: verify the test putput
* chore: cleanup
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
This commit resolves the compaction deadlock described in #4306.
The deadlock occurs during StreamSplitExec execution, where a background
worker is spawned to read input record batches and partition them into
two groups. This code pushes the resulting split record batches into two
channels - one for records that match a given predicate, and another
channel for those that do not. These channels buffer at most 2 record
batches each.
The compactor that executes this plan reads the resulting partitions
sequentially to completion. Completion is indicated by reading until the
results stream ends, which ends when the underlying channel is closed,
and therefore the split worker task must have finished and closed the
results channel for the partition to be successfully read.
While the compactor is reading from the first partition, the worker is
attempting to push record batches into the second partition and blocks
due to the channel capacity being reached. The worker never drops the
channel for the first partition, so the compactor never finishes reading
the first partition, and nothing is reading the second partition to
unblock the worker. Deadlock!
* chore: add more compactor debug info
* chore: Apply suggestions from code review
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
* chore: fix format
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* fix: not to add IOxReadFilterNode for no data of non-duplicated chunks if there is already scan node for overlapped/duplicated chunks
* refactor: address review comments
* chore: Apply suggestions from code review
Co-authored-by: Carol (Nichols || Goulding) <193874+carols10cents@users.noreply.github.com>
Co-authored-by: Carol (Nichols || Goulding) <193874+carols10cents@users.noreply.github.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
This allows us to remove the table name from the low-level chunk
representations (like `ParquetFile`, RUB, ...) since table names are
already tracked by the higher-level data structures (e.g. catalog,
catalog chunk) that manage the low-level chunk representations.
This is similar to #4167.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
Min/max values and distinct counts are already optional, so let's make
the null counts optional as well. This will be helpful for NG to deal w/
partial statistics (e.g. we only populate stats for the time column).
Note that the total count is still mandatory, but we normally have the
chunk/file-level row count at hand.
* refactor: dyn-dispatch database in query subsystem
This is similar to #4080 but concerns the database itself.
For #3934.
* docs: improve wording
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>