* feat: teach querier to use sort_key_ids
* chore: add an assert to capture bugs
---------
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* refactor: make projection masks unsigned
* fix: buffer alignment
* feat: more precise serialization error
* refactor: make `client_util` tower helper public
This can be used for #8349 to set tracing headers.
* fix: impl `Eq` for `TimestampMinMax`
---------
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
Adds conversion functions to serialise a ParquetFile into a protobuf
representation, and back again.
Adds randomised testing to assert round-trip equality.
* feat: add CompactRanges RoundInfo type
* chore: insta test updates for adding CompactRange
* feat: simplify/improve ManySmallFiles logic, now that its problem set is simpler
* chore: insta test updates for ManySmallFiles improvement
* chore: upgrade files more aggressively
* chore: insta updates from more aggressive file upgrades
* chore: addressing review comments
* feat: teach compactor to use sort_key_ids instead of sort_key
* test: update the test output after chatting with Joe and know the reason of the chnanges
* chore: test changes and additions in preparation for functional changes
* feat: move vertical splitting to RoundInfo calculation, align splits to L1 files
* chore: insta test churn
* feat: detect non-linear data distribution in vertical splitting
* chore: add tests for non-linear data distribution
* chore: insta churn
* chore: cleanup & comment additions
* chore: some variable renaming
* feat: fill catalog sort_key_ids for partition with coming data
* test: sort_key_ids has empty array for newly create partition
* test: name of non-existing column
* chore: add comments to ask Andrew about the code
* chore: make comments clearer
* chore: fix a comment to avoid failure in doc
* chore: add comment for the panic if column name of sort key not found
* fix: during import files the partition has to be created with empty sort key first. Then after its files are created, the partition will be uodated with sort key
* chore: remove no longer needed comments after the bug in build_catalog test is fixed
* chore: address review comments
* refactor: Use ColumnSet type
* chore: Apply suggestions from code review
Co-authored-by: Carol (Nichols || Goulding) <193874+carols10cents@users.noreply.github.com>
* chore: fix a clippy
---------
Co-authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com>
Co-authored-by: Carol (Nichols || Goulding) <193874+carols10cents@users.noreply.github.com>
This commit implements a PartitionProvider decorator that
probabilistically determines if a partition is going to be a an
"old-style" row-addressed partition created prior to #7963, or a
"new-style" hash-addressed partition created after using a fast,
space-efficient, compressed bloom filter.
If a partition is identified as a new-style, hash-addressed partition,
the PartitionData is immediately initialised using the deterministic
hash ID without performing a catalog query at all.
If a partition is identified as an old-style, row-addressed partition, a
catalog query is performed to resolve the row ID as it would without
this filter.
A new-style, hash-addressed partition may sometimes be incorrectly
identified as a row-addressed partition, causing a spurious catalog
query, which is then correctly identified as a hash-addressed partition.
This is tuned to happen ~1-0.1% of the time, eliminating 99% to 99.9% of
unnecessary catalog queries.
* feat: Make parquet_file.partition_id optional in the catalog
This will acquire a short lock on the table in postgres, per:
<https://stackoverflow.com/questions/52760971/will-making-column-nullable-lock-the-table-for-reads>
This allows us to persist data for new partitions and associate the
Parquet file catalog records with the partition records using only the
partition hash ID, rather than both that are used now.
* fix: Support transition partition ID in the catalog service
* fix: Use transition partition ID in import/export
This commit also removes support for the `--partition-id` flag of the
`influxdb_iox remote store get-table` command, which Andrew approved.
The `--partition-id` filter was getting the results of the catalog gRPC
service's query for Parquet files of a table and then keeping only the
files whose partition IDs matched. The gRPC query is no longer returning
the partition ID from the Parquet file table, and really, this command
should instead be using `GetParquetFilesByPartitionId` to only request
what's needed rather than filtering.
* feat: Support looking up Parquet files by either kind of Partition id
Regardless of which is actually stored on the Parquet file record.
That is, say there's a Partition in the catalog with:
Partition {
id: 3,
hash_id: abcdefg,
}
and a Parquet file that has:
ParquetFile {
partition_hash_id: abcdefg,
}
calling `list_by_partition_not_to_delete(PartitionId(3))` should still
return this Parquet file because it is associated with the partition
that has ID 3.
This is important for the compactor, which is currently only dealing in
PartitionIds, and I'd like to keep it that way for now to avoid having
to change Even More in this PR.
* fix: Use and set new partition ID fields everywhere they want to be
---------
Co-authored-by: Dom <dom@itsallbroken.com>
This adds some computational overhead during the merging of new
namespace schema with what's in the router's local cache, but will allow
gossiping of changes.
Cache the row count & timestamp min/max values within the partition FSM
/ buffer, and make them available through the Queryable trait.
This allows the PartitionData to read the row count of a buffer (either
"hot" for writes, a "snapshot" of immutable RecordBatch, or "persisting"
for in-flight persisting data).
These values will enable early partition pruning.
Time has a special meaning and can be partitioned on by the strftime
formatter. It should not be used as a tag value part in a custom
partitioning template.
There are a bunch of dependencies in `Cargo.lock` that are related to
mysql. These are NOT compiled at all, and are also not part of `cargo
tree`. The reason for the inclusion is a bug in cargo:
https://github.com/rust-lang/cargo/issues/10801
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
Although callers could manually extend the sequence number set by continually
adding in an iterator loop or a fold expression, this enables other
combinator patterns when dealing with collections of sequence number
sets.
Now that sequence numbers are internal to the ingester and the WAL,
there's no need for them to be a signed integer. As noted by
[#7260](https://github.com/influxdata/influxdb_iox/issues/7260) this was
a quirk related to the kafka-based IOx and Postgres only supported
signed integers.