* ci: use same feature set in `build_dev` and `build_release`
* ci: also enable unstable tokio for `build_dev`
* chore: update tokio to 1.21 (to fix console-subscriber 0.1.8
* fix: "must use"
Remove our own hand-rolled logic and let DataFusion read the parquet
files.
As a bonus, this now supports predicate pushdown to the deserialization
step, so we can use parquets as in in-mem buffer.
Note that this currently uses some "nested" DataFusion hack due to the
way the `QueryChunk` interface works. Midterm I'll change the interface
so that the `ParquetExec` nodes are directly visible to DataFusion
instead of some opaque `SendableRecordBatchStream`.
* refactor: do not override parquet file size in querier
This is going to be an issue when we actually rely on the size for
reading, see #5531.
* refactor: use selected file size mocking in compactor
Do not blindly override parquet file sizes for all subsystems.
This is going to be an issue when we actually rely on the size for
reading, see #5531.
* refactor: remove ability to override file sizes in catalog
Blindly overriding data for all subsystems is dangerous, because some
parts of our stack actually rely on the actual file size. See #5531.
* docs: explain `size_overrides`
* feat: make compactors to select candidates based on the last n minutes to reduce workload for postgres catalog query
* refactor: remove 1-minute case per review comment
* fix: loop forever in compact_hot_partition_candidates
* chore: cleanup
* fix: avoid using continues that will cause bugs in corner cases
* fix: Pass compaction fn as a closure instead to allow collection of groups in test
* fix: Add Send bound as suggested by clippy
* fix: fix the test to return data of round 3 instead of round 2
Co-authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* feat: initial implementation of memory estimation for a compaction
* feat: estimate size of files and have the right actions for the needed budget
* feat: run candidates in parallel
* fix: have the right name for the column field of the output struct
* feat: add metrics for estimated budgets
* chore: cleanup
* chore: Apply suggestions from code review
Co-authored-by: Carol (Nichols || Goulding) <193874+carols10cents@users.noreply.github.com>
* fix: fix syntax after applying review's suggestions
* refactor: Convert a Vec to VecDeque to go well with pop and push
* chore: remove max_concurrent_size_bytes and input_size_threshold_bytes
* chore: remove input_file_count_threshold
* test: tests for estimate_arrow_bytes_for_file
Co-authored-by: Carol (Nichols || Goulding) <193874+carols10cents@users.noreply.github.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
Changes the compactor code to tolerate a SplitExec yielding an empty
partition (with no rows).
This raises a WARN as the situation in which this is acceptable is very
rare, and is more likely indicative of an opportunity to improve the
SplitExec usage (i.e. pruning out unnecessary split points).
* feat: file file_count_threshold for comapcting cold partitions to make it consistent with the hot case and help set up to avoid oom easier
* chore: remove unecessary commments
Cold partition compaction will (in the next commit) upgrade a level 0
file without any overlaps rather than running compaction.
Cold partition filtering gathers all level 0 files in the (already
deemed cold) partition with all overlapping level 1 files, and does not
limit the set of files being compacted by their number or size.