* chore: Update DataFusion pin
* chore: Update for new API
* fix: Update for API
* fix: update compactor test
* fix: Update to patched version of arrow 46.0.0
* fix: map `DataFusionError::Configuration` to an internal error
* fix: do not use deprecated API
---------
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* fix(compactor): prevent sort order mismatches from creating overlapping regions
* chore: test additions for incorrectly created regions
* fix(compactor): more sort order mismatch fixes
* chore: insta updates
* chore: insta updates after merge
Optionally initialise the gossip subsystem in the compactor.
This will cause the compactor to perform PEX and join the cluster, but
as it registers no topic interests, it will not receive any
application-level payloads.
No messages are currently sent (in fact, gossip shuts down immediately).
* chore: add test case for L0 added after vertical splitting
* feat: use recurring L0 end time as hint for split times
* chore: insta test updates
* chore: add split time verification to simulator
* feat: add CompactRanges RoundInfo type
* chore: insta test updates for adding CompactRange
* feat: simplify/improve ManySmallFiles logic, now that its problem set is simpler
* chore: insta test updates for ManySmallFiles improvement
* chore: upgrade files more aggressively
* chore: insta updates from more aggressive file upgrades
* chore: addressing review comments
* feat: teach compactor to use sort_key_ids instead of sort_key
* test: update the test output after chatting with Joe and know the reason of the chnanges
* chore: test changes and additions in preparation for functional changes
* feat: move vertical splitting to RoundInfo calculation, align splits to L1 files
* chore: insta test churn
* feat: detect non-linear data distribution in vertical splitting
* chore: add tests for non-linear data distribution
* chore: insta churn
* chore: cleanup & comment additions
* chore: some variable renaming
* feat: add tracking of why bytes are written in simulator
* chore: enable breakdown of why bytes are written in a few larger tests
* chore: enable writes breakdown in another test
---------
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* catalog.get_in_skipped_compaction() should handle for multiple partitions
* add the ability to perform transformation on sets of partitions (rather than filtering one by one). Start with the transformation to remove skipped partitions, in the scheduler.
* move the env var and cli flag setting, for when to ignore skipped partitions, to the scheduler config.
* rename PartitionDoneSink to CompactionJobSink. and change signature in trait
* update all trait implementations, including local variables and comments
* rename partition_done_sink in the components and driver, to be compaction_job_done_sink
Also rename PartitionInfo's transition_partition_id to be partition_id
so that it's consistent with the QueryChunk method. We might want to
rename the partition_id field to catalog_partition_id, but for now I
think the types will make compactor usage clear enough.
This gets compactor to compile and pass its tests.
* fix: selectively merge L1 to L2 when L0s still exist
* fix: avoid grouping files that undo previous splits
* chore: add test case for new fixes
* chore: insta test churn
* chore: lint cleanup
* feat: Make parquet_file.partition_id optional in the catalog
This will acquire a short lock on the table in postgres, per:
<https://stackoverflow.com/questions/52760971/will-making-column-nullable-lock-the-table-for-reads>
This allows us to persist data for new partitions and associate the
Parquet file catalog records with the partition records using only the
partition hash ID, rather than both that are used now.
* fix: Support transition partition ID in the catalog service
* fix: Use transition partition ID in import/export
This commit also removes support for the `--partition-id` flag of the
`influxdb_iox remote store get-table` command, which Andrew approved.
The `--partition-id` filter was getting the results of the catalog gRPC
service's query for Parquet files of a table and then keeping only the
files whose partition IDs matched. The gRPC query is no longer returning
the partition ID from the Parquet file table, and really, this command
should instead be using `GetParquetFilesByPartitionId` to only request
what's needed rather than filtering.
* feat: Support looking up Parquet files by either kind of Partition id
Regardless of which is actually stored on the Parquet file record.
That is, say there's a Partition in the catalog with:
Partition {
id: 3,
hash_id: abcdefg,
}
and a Parquet file that has:
ParquetFile {
partition_hash_id: abcdefg,
}
calling `list_by_partition_not_to_delete(PartitionId(3))` should still
return this Parquet file because it is associated with the partition
that has ID 3.
This is important for the compactor, which is currently only dealing in
PartitionIds, and I'd like to keep it that way for now to avoid having
to change Even More in this PR.
* fix: Use and set new partition ID fields everywhere they want to be
---------
Co-authored-by: Dom <dom@itsallbroken.com>