But still pass them to hot_partitions_for_shard.
And make the order of the arguments the same as for
recent_highest_throughput_partitions because I've already messed the
order up. And make the names the same throughout.
This makes the closure passed to get_candidates_with_retry simpler.
This will work the same way that compacting level 0 -> level 1 does
except that the resulting files won't be split into potentially multiple
files. It will be limited by the memory budget bytes, which should limit
the groups more than the max_file_size_bytes would.
This encodes the result directly and has the FilterResult hold only the
relevant data to the state. So no longer any need to create or check for
empty vectors or 0 budget_bytes. Also creates a new type after checking
the filter result state and handling the budget, as actual compaction
doesn't need to care about that.
This could still use more refactoring to become a clearer pipeline of
different states, but I think this is a good start.
Table columns for a partition don't change, so rather than carrying
around table columns for the partition and parquet files to look up
repeatedly, have the `PartitionCompactionCandidateWithInfo` keep track
of its column types and be able to estimate bytes given a number of rows
from a parquet file.
Temporarily disable full compaction from level 1 to 2.
Re-use the memory budget estimation and parallelization for cold
compaction. Rather than choosing cold compaction candidates and then in
parallel compacting each partition from level 0 to 1 and then 1 to 2,
this commit switches to compacting in parallel (by memory budget) all
candidates form level 0 to 1. The next commit will re-enable full
compaction of all partitions in parallel (by memory budget).
Which is currently the only step, compacting any remaining level 0 files
into level 1. Make a TODO function for performing full compaction of all
level 1 files next.
In our data model, a chunk always belongs to a partition[^1], so let's
not make this attribute optional. The optional value only leads to
-- mostly surprising -- conditional behavior, ranging from "do not equalize
the partition sort key" (querier) to "always consider the chunk overlapping"
(iox_query when dealing with ingester chunks).
[^1]: This is even true when the chunk belongs to a parquet file that is not
yet added to the catalog, contrary to what a comment in the ingester
stated. The catalog and data model used by the querier are two totally
different things.
I'm spending way too long with the wrong number of arguments to
CompactorConfig::new and not a lot of help from the compiler. If these
struct fields are pub, they can be set directly and destructured, etc,
which the compiler gives way more help on. This also reduces duplication
and boilerplate that has to be updated when the config fields change.
* ci: use same feature set in `build_dev` and `build_release`
* ci: also enable unstable tokio for `build_dev`
* chore: update tokio to 1.21 (to fix console-subscriber 0.1.8
* fix: "must use"
Remove our own hand-rolled logic and let DataFusion read the parquet
files.
As a bonus, this now supports predicate pushdown to the deserialization
step, so we can use parquets as in in-mem buffer.
Note that this currently uses some "nested" DataFusion hack due to the
way the `QueryChunk` interface works. Midterm I'll change the interface
so that the `ParquetExec` nodes are directly visible to DataFusion
instead of some opaque `SendableRecordBatchStream`.
* refactor: do not override parquet file size in querier
This is going to be an issue when we actually rely on the size for
reading, see #5531.
* refactor: use selected file size mocking in compactor
Do not blindly override parquet file sizes for all subsystems.
This is going to be an issue when we actually rely on the size for
reading, see #5531.
* refactor: remove ability to override file sizes in catalog
Blindly overriding data for all subsystems is dangerous, because some
parts of our stack actually rely on the actual file size. See #5531.
* docs: explain `size_overrides`
* feat: make compactors to select candidates based on the last n minutes to reduce workload for postgres catalog query
* refactor: remove 1-minute case per review comment
* fix: loop forever in compact_hot_partition_candidates
* chore: cleanup
* fix: avoid using continues that will cause bugs in corner cases
* fix: Pass compaction fn as a closure instead to allow collection of groups in test
* fix: Add Send bound as suggested by clippy
* fix: fix the test to return data of round 3 instead of round 2
Co-authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* feat: initial implementation of memory estimation for a compaction
* feat: estimate size of files and have the right actions for the needed budget
* feat: run candidates in parallel
* fix: have the right name for the column field of the output struct
* feat: add metrics for estimated budgets
* chore: cleanup
* chore: Apply suggestions from code review
Co-authored-by: Carol (Nichols || Goulding) <193874+carols10cents@users.noreply.github.com>
* fix: fix syntax after applying review's suggestions
* refactor: Convert a Vec to VecDeque to go well with pop and push
* chore: remove max_concurrent_size_bytes and input_size_threshold_bytes
* chore: remove input_file_count_threshold
* test: tests for estimate_arrow_bytes_for_file
Co-authored-by: Carol (Nichols || Goulding) <193874+carols10cents@users.noreply.github.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
Changes the compactor code to tolerate a SplitExec yielding an empty
partition (with no rows).
This raises a WARN as the situation in which this is acceptable is very
rare, and is more likely indicative of an opportunity to improve the
SplitExec usage (i.e. pruning out unnecessary split points).
* feat: file file_count_threshold for comapcting cold partitions to make it consistent with the hot case and help set up to avoid oom easier
* chore: remove unecessary commments
Cold partition compaction will (in the next commit) upgrade a level 0
file without any overlaps rather than running compaction.
Cold partition filtering gathers all level 0 files in the (already
deemed cold) partition with all overlapping level 1 files, and does not
limit the set of files being compacted by their number or size.
It seems that the buffering / parallelization code cannot deal with
empty lists and just freezes forever (which blocks shutdown but will
also freeze the compactor forever).
* feat: run many compact partitions in parallel
* refactor: Use rust futures fu to run compactor jobs in parallel
* chore: Apply suggestions from code review
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
* fix: reduce log verbosity
* refactor: sleep for a sec if no work, print debug
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* feat: `QueryChunk::as_any`
* feat: allo `ChunkPruner::prune_chunks` to fail
* feat: limit per-table chunk data for every query
Closes#5211.
* fix: address review comments
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
* feat: Split large compactions into multiple compacted files
Connects to #5121
* refactor: Extract update catalog function and error type
* refactor: Share physical plan to object store streaming
And only differ in the logical plan building based on split times in
different compaction cases.
* fix: Test for a split time equal to the max time and don't split then
* chore: cherry pick the first 3 commits of branch cn/connect-new-compaction
* fix: modify the test to work correctly with compactor running
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* test: Document the behavior of compute_split_time when min time = max time
* fix: compute_split_time returns one value when min_time = max_time
Co-authored-by: NGA-TRAN <nga-tran@live.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
If compaction is called on one level 0 file, the work of compaction
doesn't really need to be done, but for simplicity's sake for now, run
it through the compaction query/write process rather than special casing
it. This will keep us from getting stuck in the unlikely event that a
partition with one level 0 file gets selected as a top candidate for
compaction.
* refactor: remove min_sequnce_number
* fix: typos
* fix: remove min_sequencer_number from new files from merging main
* fix: add back throwing error if the compactor compacts files persisted by the ingester after the ingester sends max seq_num back to querier
* test: add test_compactor_collision back but modify the input to make it work woth new changes
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
- remove `IOxSessionContext::default()` because untracked contexts
should only be created by tests
- remove `Option<IOxSessionContext>` because it is a typed workaround
for `IOxSessionContext::default`
Tests should use `IOxSessionContext::testing` and all _normal_ users
should create proper contexts.
I suspect this will help tracing or at least prevent silent regressions.
See #5129.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* feat: initial implementation for selecting compaction candidates
* feat: 2 catalog functions to choose the most thorughput partitions to compact and the selecting candidate function itself
* test: tests for the new 2 queries
* feat: more tests and metrics for chooing compaction candidates
* chore: Apply self suggestions from self review
* chore: cleanup
* chore: fix doc comment
* chore: Apply suggestions from code review
Co-authored-by: Carol (Nichols || Goulding) <193874+carols10cents@users.noreply.github.com>
* refactor: address review comments
* fix: get the right time provider for the tests
* refactor: remove the left over compaction_
* fix: typos
* fix: make the param name and env name consistent
* refactor: make relevant iSomething to uSomething
* fix: typo
Co-authored-by: Carol (Nichols || Goulding) <193874+carols10cents@users.noreply.github.com>
Fixes#5118.
Given a partition ID, look up the non-deleted Parquet files for that
partition. Separate them into level 0 and level 1, and sort the level 0
files by max sequence number.
This is not called anywhere yet.
* docs: add consensus for the desired final output of the compactor
* chore: Apply suggestions from code review
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
* docs: add initial readme to the compactor
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
* refactor: remove parquet chunk ID to `ChunkMeta`
* refactor: return `Arc` from `QueryChunk::summary`
This is similar to how we handle other chunk data like schemas. This
allows a chunk to change/refine its "believe" over its own payload while
it is passed around in the query stack.
Helps w/ #5032.
* fix: avoid combing groups that overlap with other groups even if they are small
* chore: Apply suggestions from code review
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* refactor: convert a panic into an error and throw a warning if we choose non-actionable candidates
* chore: Apply suggestions from code review
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
* chore: run fmt
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Instead of using some hand-rolled timestamp-based logic (or just
"unknown") all over the place, just use logic introduced in #5017.
This requires slightly improved table summaries within the querier that
at least has min/max for the timestamp column. For that, the former
`IngesterChunk`-specific `calculate_summary` method was extended to
`create_basic_summary` to include that data and is now also used by
`QuerierParquetChunk`.
Note: `QuerierRBChunk` already has detailled metrics that are provided
by the read buffer implementation.
Should we ever need even better pruning for `QuerierParquetChunk` (or
`IngesterChunk`) then we _only_ need add extra data to the table
summaries.
Closes#4976.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* fix: Split overlapped files based on the order of sequence numbers and only group non-overlapped contigous small files
* test: add one more test for group contiguous files:
* refactor: address review comments
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* refactor: change level 1 to level 2 preparing for next design changes
* fix: make level-2 consistent everywhere
* chore: remove unused comments
* refactor: change all the name level_1 to level_2 to completely replace 1 with 2 to amke everything consistent
* chore: add correspinding constants for the comapction levels in the comments
Co-authored-by: Dom <dom@itsallbroken.com>
* refactor: remove `DecodedParquetFile` from `iox_tests`
* refactor: remove `DecodedParquetFile` from querier
Also pull out all the chunk schema and sort key handling into a function
so that RB chunks and parquet chunks mostly use the same code path.
* refactor: remove `DecodedParquetFile`
* refactor: remove `ParquetFileWithMetadata` usage
* fix: test data consistency
* refactor: simplify sort key calculation
* refactor: use schema from catalog instead from file
* refactor: do not request parquet file MD in compactor
* test: ensure that `QueryableParquetChunk` works correctly
* refactor: avoid feeding sort key from struct into same struct
* feat: allow namespace schema query by ID
* refactor: do not use binary parquet file MD in compactor tests
* refactor: do not use in-parquet IOx metadata
* refactor: reduce number of catalog queries
* fix: avoid using min_time, which can be negative, for ChunkId. Using object store id which is uuid instead
* chore: Apply suggestions from code review
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
* chore: run fmt
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* feat: split times of compacting results based on the max file size
* feat: cosider max file size while computing split time
* test: tests for comput_split_time
* feat: first step to teach the function split_the_steam to know how to split data into n streams using n-1 input PhysicalExprs
* feat: make StreamSplitNode support a list of expression
* docs: explain how StreamSplitNode works
* feat: Teach compute_split_time to split a time range into many contiguous ranges and split compacted result into multiple non-overlapped files based on the config comapction_max_size_bytes
* chore: cleanup
* chore: clean up doc
* chore: address review comments
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* feat: conversion from `ParquetFile` to `ParquetFilePath`
* refactor: slim down parquet chunk
- ensure it works without binary parquet metadata
- timestamp range is no longer optional (ensured by the NG type system)
- remove table summary: this is only needed for SOME API users. The
compactor can perfectly work without statistics since has the timestamp
range which is sufficient for the current overlap check (we don't use
any other primary key stats at the moment). The querier currently does
NOT use parquet chunks (was replaced by read buffer) but if it will
again in some future it will likely need to find a way to fetch and
cache the statistics.
- the schema is now provided by the API user since it can be
reconstructed using the NG catalog only (and "wrong" column orders are
tolerated as of #4921)
Ref #4124
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
Queries internals are not meant to be used by other crates. Only a
handful selected interfaces should be used by IOxD and the query tests.
The compactor only used a very small subset just to read parquet files
back into memory. It shall rather use the official `parquet_file`
interface instead.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* refactor: store per-file column set in catalog
Together with the table-wide schema and the partition-wide sort key, this should
be everything we need to read a parquet file directly into memory
without peeking any file-level metadata.
The querier will use this to directly load parquet files into the read
buffer.
**WARNING: This requires a catalog wipe!**
Ref #4124.
* refactor: use proper `ColumnSet` type
* fix: make ChunkOrder u64 data type to accept min sequence number 0
* fix: make ChunkOrder i64 to match with sequence number type
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* chore: more info for i64-to-u128 panic message
* chore: Apply suggestions from code review
Co-authored-by: Dom <dom@itsallbroken.com>
* chore: fix fmt
Co-authored-by: Dom <dom@itsallbroken.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* chore: TEMP Update DataFusion to pre-release
* chore: update arrow et al to 16.0.0
* chore: Run cargo hakari tasks
* fix: update reader read_dictionary API
* chore: Update to real Datafusion release
* fix: Update parquet API
* fix: update test
Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
This commit changes the code base to use a new reference-counted
PartitionKey type wrapper, instead of passing a bare String around.
This allows the compiler to type check & verify usage of the partition
key, instead of passing a bare string around. By reference counting the
underlying string, we reduce memory usage for some use cases.
* feat: add an option to compact all overlapped files no matter how large they are
* chore: Apply suggestions from code review
* feat: always compact oerlapped files no matter how large they are
* chore: cleaup
Use a constructor to initialise a ParquetFileWithTombstone struct,
rather than making the fields pub.
This allows IDEs to "go to" places where this is constructed when
browsing the code, but also keeps the type closed for modification of
internals (SOLID).
* fix: let us not compact no-data
* fix: split time must be greater min_time, too
* fix: resolve merge conflict
* chore: increase size of a compactor job and level of concurrency
Co-authored-by: Dom <dom@itsallbroken.com>
* fix: let us not compact no-data
* fix: split time must be greater min_time, too
* fix: resolve merge conflict
Co-authored-by: Dom <dom@itsallbroken.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
Assert consistent metadata when evaluating candidate parquet files for
compaction.
Asserts all files have the same:
* Sequencer ID
* Namespace ID
* Table ID
* Partition ID
* Sort key
Changes the compaction logic to always reference the same SortKey
instance, rather than repeatedly querying for it.
The Partition metadata is always read from the catalog as part of
compact_partition(), where it previously threw away all metadata except
the sort key, which was passed into compact(). Then compact() would
always re-query the catalog to look up just the sort key again, and mix
up the two instances during use - one passed into the fn, one freshly
queried within the fn.
Now the Partition metadata is resolved in compact_partition() as it was
previously, but the entire Partition reference is passed to compact(),
and this is consistently used do access the sort key. This also removes
a catalog query per compaction call.
* refactor: split compact_partition into two functions to handle concurrency better
* feat: limit number of files to compact
* test: add test for limit num files
* chore: fix cipply
* feat: split group if over max size
* fix: split the overlapped group to limit size or file num
* chore: reduce config values
* test: add tests and clearer comments for the split_overlapped_groups and test_limit_size_and_num_files
* chore: more comments
* chore: cleanup
Changes the compactor to consume both StreamSplitExec output partitions
concurrently.
Practically speaking this means both Parquet files will be generated
concurrently, and uploaded to object store concurrently.
This commit changes the Compactor::compact() method to stream the
RecordBatch instances directly to the parquet serialiser, before being
uploaded directly to object storage.
Removes the min/max timestamp fields from the IoxMetadata proto
structure embedded within a Parquet file's metadata.
These values are redundant as they already exist within the Parquet
column statistics, and precluded streaming serialisation as these
removed min/max values were needed before serialising the file.
Remove the redundant row_count from the IoxMetadata structure that is
serialised into the Parquet file.
The reasoning is twofold:
* The Parquet file's native metadata already contains a row count
* Needing to know the number of rows up-front precludes streaming
Implements an upload() method on the ParquetStorage type, consuming a
stream of RecordBatch, serialising the Parquet file, and uploading the
result to object storage. Returns the IOx-specific file metadata.
Currently while the upload() method accepts a stream of RecordBatch, the
actual resulting Parquet file is buffered in memory before uploading to
object store, due to lack of streaming upload functionality in the
ObjectStore abstraction - this isn't the end of the world, as the files
tend to be relatively small with our current usage.
This impl should be easily modified to be fully streaming once streaming
object store puts are implemented:
https://github.com/influxdata/object_store_rs/issues/9
Changes the code paths that interact with Parquet files in the object
store to reference the ParquetStorage directly (DRY refactor).
This change takes us from a dependency graph of:
┌─────────────────┐
│ │
▼ │
Parquet Consumer │
│ ┌──────────────┐
├────────▶│ParquetStorage│
▼ └──────────────┘
┌──────────────┐
│ ObjectStore │
└──────────────┘
│
┌────┴────┐
▼ ▼
File s3
System (etc)
to:
Parquet Consumer
│
▼
┌──────────────┐
│ParquetStorage│
└──────────────┘
│
▼
┌──────────────┐
│ ObjectStore │
└──────────────┘
│
┌────┴────┐
▼ ▼
File s3
System (etc)
With the ParquetStorage being solely responsible for managing
interactions with the object store when dealing with Parquet files.
Renames the Storage type so the context is clear in usage (i.e. fn
args), rather than having to rely on knowing the fully-qualified import
path to know what the type stores.
Removes two unused constructors for a ParquetChunk, and moves the bare
fn constructor that is actually used to be an associated method (a
conventional constructor).
* refactor: require `Resource`s to be convertible to `u64`
* refactor: require `Resource`s to have a unit name
* refactor: make LRU cache IDs static
* feat: add LRU cache metrics
* docs: improve type names in LRU doctest
* docs: epxlain `MeasuredT`
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
* docs: explain `test_metrics`
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
* ci: fix cargo deny
* chore: downgrade `socket2`, version 0.4.5 was yanked
* chore: rename `query` to `iox_query`
`query` is already taken on crates.io and yanked and I am getting tired
of working around that.