This encodes the result directly and has the FilterResult hold only the
relevant data to the state. So no longer any need to create or check for
empty vectors or 0 budget_bytes. Also creates a new type after checking
the filter result state and handling the budget, as actual compaction
doesn't need to care about that.
This could still use more refactoring to become a clearer pipeline of
different states, but I think this is a good start.
Table columns for a partition don't change, so rather than carrying
around table columns for the partition and parquet files to look up
repeatedly, have the `PartitionCompactionCandidateWithInfo` keep track
of its column types and be able to estimate bytes given a number of rows
from a parquet file.
Temporarily disable full compaction from level 1 to 2.
Re-use the memory budget estimation and parallelization for cold
compaction. Rather than choosing cold compaction candidates and then in
parallel compacting each partition from level 0 to 1 and then 1 to 2,
this commit switches to compacting in parallel (by memory budget) all
candidates form level 0 to 1. The next commit will re-enable full
compaction of all partitions in parallel (by memory budget).
Which is currently the only step, compacting any remaining level 0 files
into level 1. Make a TODO function for performing full compaction of all
level 1 files next.
In our data model, a chunk always belongs to a partition[^1], so let's
not make this attribute optional. The optional value only leads to
-- mostly surprising -- conditional behavior, ranging from "do not equalize
the partition sort key" (querier) to "always consider the chunk overlapping"
(iox_query when dealing with ingester chunks).
[^1]: This is even true when the chunk belongs to a parquet file that is not
yet added to the catalog, contrary to what a comment in the ingester
stated. The catalog and data model used by the querier are two totally
different things.
I'm spending way too long with the wrong number of arguments to
CompactorConfig::new and not a lot of help from the compiler. If these
struct fields are pub, they can be set directly and destructured, etc,
which the compiler gives way more help on. This also reduces duplication
and boilerplate that has to be updated when the config fields change.
* ci: use same feature set in `build_dev` and `build_release`
* ci: also enable unstable tokio for `build_dev`
* chore: update tokio to 1.21 (to fix console-subscriber 0.1.8
* fix: "must use"
Remove our own hand-rolled logic and let DataFusion read the parquet
files.
As a bonus, this now supports predicate pushdown to the deserialization
step, so we can use parquets as in in-mem buffer.
Note that this currently uses some "nested" DataFusion hack due to the
way the `QueryChunk` interface works. Midterm I'll change the interface
so that the `ParquetExec` nodes are directly visible to DataFusion
instead of some opaque `SendableRecordBatchStream`.
* refactor: do not override parquet file size in querier
This is going to be an issue when we actually rely on the size for
reading, see #5531.
* refactor: use selected file size mocking in compactor
Do not blindly override parquet file sizes for all subsystems.
This is going to be an issue when we actually rely on the size for
reading, see #5531.
* refactor: remove ability to override file sizes in catalog
Blindly overriding data for all subsystems is dangerous, because some
parts of our stack actually rely on the actual file size. See #5531.
* docs: explain `size_overrides`
* feat: make compactors to select candidates based on the last n minutes to reduce workload for postgres catalog query
* refactor: remove 1-minute case per review comment
* fix: loop forever in compact_hot_partition_candidates
* chore: cleanup
* fix: avoid using continues that will cause bugs in corner cases
* fix: Pass compaction fn as a closure instead to allow collection of groups in test
* fix: Add Send bound as suggested by clippy
* fix: fix the test to return data of round 3 instead of round 2
Co-authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* feat: initial implementation of memory estimation for a compaction
* feat: estimate size of files and have the right actions for the needed budget
* feat: run candidates in parallel
* fix: have the right name for the column field of the output struct
* feat: add metrics for estimated budgets
* chore: cleanup
* chore: Apply suggestions from code review
Co-authored-by: Carol (Nichols || Goulding) <193874+carols10cents@users.noreply.github.com>
* fix: fix syntax after applying review's suggestions
* refactor: Convert a Vec to VecDeque to go well with pop and push
* chore: remove max_concurrent_size_bytes and input_size_threshold_bytes
* chore: remove input_file_count_threshold
* test: tests for estimate_arrow_bytes_for_file
Co-authored-by: Carol (Nichols || Goulding) <193874+carols10cents@users.noreply.github.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
Changes the compactor code to tolerate a SplitExec yielding an empty
partition (with no rows).
This raises a WARN as the situation in which this is acceptable is very
rare, and is more likely indicative of an opportunity to improve the
SplitExec usage (i.e. pruning out unnecessary split points).
* feat: file file_count_threshold for comapcting cold partitions to make it consistent with the hot case and help set up to avoid oom easier
* chore: remove unecessary commments
Cold partition compaction will (in the next commit) upgrade a level 0
file without any overlaps rather than running compaction.
Cold partition filtering gathers all level 0 files in the (already
deemed cold) partition with all overlapping level 1 files, and does not
limit the set of files being compacted by their number or size.
It seems that the buffering / parallelization code cannot deal with
empty lists and just freezes forever (which blocks shutdown but will
also freeze the compactor forever).
* feat: run many compact partitions in parallel
* refactor: Use rust futures fu to run compactor jobs in parallel
* chore: Apply suggestions from code review
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
* fix: reduce log verbosity
* refactor: sleep for a sec if no work, print debug
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* feat: `QueryChunk::as_any`
* feat: allo `ChunkPruner::prune_chunks` to fail
* feat: limit per-table chunk data for every query
Closes#5211.
* fix: address review comments
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
* feat: Split large compactions into multiple compacted files
Connects to #5121
* refactor: Extract update catalog function and error type
* refactor: Share physical plan to object store streaming
And only differ in the logical plan building based on split times in
different compaction cases.
* fix: Test for a split time equal to the max time and don't split then
* chore: cherry pick the first 3 commits of branch cn/connect-new-compaction
* fix: modify the test to work correctly with compactor running
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* test: Document the behavior of compute_split_time when min time = max time
* fix: compute_split_time returns one value when min_time = max_time
Co-authored-by: NGA-TRAN <nga-tran@live.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
If compaction is called on one level 0 file, the work of compaction
doesn't really need to be done, but for simplicity's sake for now, run
it through the compaction query/write process rather than special casing
it. This will keep us from getting stuck in the unlikely event that a
partition with one level 0 file gets selected as a top candidate for
compaction.
* refactor: remove min_sequnce_number
* fix: typos
* fix: remove min_sequencer_number from new files from merging main
* fix: add back throwing error if the compactor compacts files persisted by the ingester after the ingester sends max seq_num back to querier
* test: add test_compactor_collision back but modify the input to make it work woth new changes
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
- remove `IOxSessionContext::default()` because untracked contexts
should only be created by tests
- remove `Option<IOxSessionContext>` because it is a typed workaround
for `IOxSessionContext::default`
Tests should use `IOxSessionContext::testing` and all _normal_ users
should create proper contexts.
I suspect this will help tracing or at least prevent silent regressions.
See #5129.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* feat: initial implementation for selecting compaction candidates
* feat: 2 catalog functions to choose the most thorughput partitions to compact and the selecting candidate function itself
* test: tests for the new 2 queries
* feat: more tests and metrics for chooing compaction candidates
* chore: Apply self suggestions from self review
* chore: cleanup
* chore: fix doc comment
* chore: Apply suggestions from code review
Co-authored-by: Carol (Nichols || Goulding) <193874+carols10cents@users.noreply.github.com>
* refactor: address review comments
* fix: get the right time provider for the tests
* refactor: remove the left over compaction_
* fix: typos
* fix: make the param name and env name consistent
* refactor: make relevant iSomething to uSomething
* fix: typo
Co-authored-by: Carol (Nichols || Goulding) <193874+carols10cents@users.noreply.github.com>
Fixes#5118.
Given a partition ID, look up the non-deleted Parquet files for that
partition. Separate them into level 0 and level 1, and sort the level 0
files by max sequence number.
This is not called anywhere yet.