* feat(idpe-17789): scheduler job_status() (#8121)
This block of work moves into the scheduler some of the specific downstream actions affiliated with compaction outcomes. Which responsibilities stay in the compactor, versus moved to the scheduler, roughly followed the heuristic of whether the action (a) had an impact on global catalog state (a.k.a. commits and partition skipping), (b) whether it's logging affiliated with compactor health (e.g. ParitionDoneSink logging outcomes) versus system health (e.g. logging commits), and (c) reporting to the scheduler on any errors encountered during compaction. This boundary is subject to change as we move forward.
Also, a noted caveat (TODO) on this commit. We have a CompactionJob which is used to track work handed off to each compactor. Currently it still uses the partition_id for tracking, but the followup PR will start moving the compactor to have more CompactionJob uuid awareness.
* fix(idpe-17789): need to remove partition from uniqueness tracking, so it becomes available again
* refactor(idpe-17789): split up the single-use end_job() from the multi-use update_job_status()
* feat(idpe-17789): Commit is now a scheduler trait, only used externally in the compactor_test_utils
* feat(idpe-17789): Propagate errors pertaining to commit, in both the scheduler and the compactor.
* feat(idpe-17789): PartitionDoneSink should have different crate-private traits for scheduler versus comactor.
* feat(idpe-17789): PartitionDoneSink should propagate errors
* test(idpe-17789): integration tests suite
* test(idpe-17789): test documenting what skip request does (as outcome)
* refactor(idpe-17789): make the validate of the upgrade commit, versus replacement commit, more explicit.
* feat(idpe-17789): switch to using parking_lot Mutex within the scheduler
When a long running query is in process and the querier is shutting
down, it might happen that the executor (= thread pool and tokio
executor responsible for the CPU-bound DataFusion execution) is shut
down while the query is running. From a "systems interaction" PoV I
think this is totally fine and I would like to avoid some weird
ref-counting. Or in other words: if the system is shutting down, shut it
down.
However the error was treated as "internal" which is not useful. The
client should rather be informed that its server was gone and that it is
OK (and desired) to retry. So as per
<https://grpc.github.io/grpc/core/md_doc_statuscodes.html> I think this
should signal "unavailable".
This change wires the error code in such a way that the gRPC service
layer can properly inspect it and then changes the error mapping.
Ref https://github.com/influxdata/idpe/issues/17917 .
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
The ingester can project arbitrary columns at query time, and has no
special requirement that the "time" column be part of that projection.
Because the timestamp summary generation explicitly requires the time
column to exist, it panics when there's no "time" column in the
projection - this is a bit of a modelling mismatch more than anything.
* feat(idpe-17789): scheduler job_status() (#8121)
This block of work moves into the scheduler some of the specific downstream actions affiliated with compaction outcomes. Which responsibilities stay in the compactor, versus moved to the scheduler, roughly followed the heuristic of whether the action (a) had an impact on global catalog state (a.k.a. commits and partition skipping), (b) whether it's logging affiliated with compactor health (e.g. ParitionDoneSink logging outcomes) versus system health (e.g. logging commits), and (c) reporting to the scheduler on any errors encountered during compaction. This boundary is subject to change as we move forward.
Also, a noted caveat (TODO) on this commit. We have a CompactionJob which is used to track work handed off to each compactor. Currently it still uses the partition_id for tracking, but the followup PR will start moving the compactor to have more CompactionJob uuid awareness.
This block of work moves into the scheduler some of the specific downstream actions affiliated with compaction outcomes. Which responsibilities stay in the compactor, versus moved to the scheduler, roughly followed the heuristic of whether the action (a) had an impact on global catalog state (a.k.a. commits and partition skipping), (b) whether it's logging affiliated with compactor health (e.g. ParitionDoneSink logging outcomes) versus system health (e.g. logging commits), and (c) reporting to the scheduler on any errors encountered during compaction. This boundary is subject to change as we move forward.
Also, a noted caveat (TODO) on this commit. We have a CompactionJob which is used to track work handed off to each compactor. Currently it still uses the partition_id for tracking, but the followup PR will start moving the compactor to have more CompactionJob uuid awareness.
Similar to #8109.
This was once implemented by the RUB but as it stands right now, no
chunk implements this anymore.
If we ever want to bring this back, we should use the output of
`QueryChunk::data` instead (i.e. use a data-based implementation instead
of a per-chunk one).
Closes#8096.
This interface was once specially implemented by the RUB. The only
actual implementation of it is within the querier that just forwards it
to a simple schema scan. Lift this semantic to `iox_query_influxrpc`
instead so all the chunks can use it.
If we ever want to optimize this again, we should use `QueryChunk::data`
instead (i.e. instead of implementing it within the chunk it should use
the data method and do something smart based on that).
First half of #8096.
Do not (ab)use per-chunk delete predicates for the retention policy.
Instead use a per-table predicate.
This makes the code way cleaner, since the scoping is correct (i.e.
delete predicates are a table-wide attribute, not a chunk-based one) and
it is consistent time predicates that the user providers (e.g. via
`WHERE time > x`).
It also allows us to remove delete predicates (in their current,
non-scalable form) from the query path. A potential future version would
likely not use per chunk predicates (and "is processed" markers) but use
the timestamp / chunk order to determine to which data the predicate
should be applied.
Note that the lowering of the retention policy changed slightly from
```text
(time > (now() - retention)) AND (time < MAX)
```
to
```text
time > (now() - retention)
```
Since the `MAX` cut is just an artifact of the lowering and was unnecessary.
Closes#7409.
Closes#7410.
* feat: provide convenience methods to create Scheduler, and keep the scheduler implementations crate private. External crates can only create a Scheduler based upon configs.
* feat: provide Scheduler as a component to compactor. Specifically, the scheduler configs are present within the compactor run config, and the scheduler in created within the compactor hardcoded components.
* feat: within the compactor ScheduledPartitionsSource, utilize the dyn Scheduler and Scheduler.get_jobs()
* feat: CompactionJob should be per partition, and have a uniqueness characteristic independent of the partition
* feat: keep compactor_scheduler separate from clap_blocks. Only interface is within ioxd_compactor where the CLI configs are transformed into ShardConfig and PartitionsSourceConfig.
* chore: make IdOnlyPartitionFilter into only pub(crate)
* chore: update scheduler display to include any report information (a.k.a. shard_config, if present)
* chore: adjust with_max_num_files_per_plan to more common setting
This significantly increases write amplification (see change in `written` at the conclusion of the cases)
* fix: compactor looping with unproductive compactions
* chore: formatting cleanup
* chore: fix typo in comment
* chore: add test case that compacts too many files at once
* fix: enforce max file count for compaction
* chore: insta churn from prior commit
---------
Co-authored-by: Dom <dom@itsallbroken.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
This is purely a movement of code, and not any definition of the interface methods yet. At best, it further solidifying the boundary of what partitions_source implementations are within the scheduler -- versus within the compactor.
This will hold the deterministic ID for partitions.
Until all existing partitions have this value, this is optional/nullable.
The row ID still exists and is used as the main foreign key in the
parquet_file and skipped_compaction tables.
The hash_id has a unique index so that we can look up records based on
it (if it's available).
If the parquet file record has a partition_hash_id value, use that to
generate the object storage path instead of the partition_id.
* chore: adjust with_max_num_files_per_plan to more common setting
This significantly increases write amplification (see change in `written` at the conclusion of the cases)
* fix: compactor looping with unproductive compactions
* chore: formatting cleanup
* chore: fix typo in comment
* chore: delineate scheduler logic boundary in code comments
* refactor: move id_only_partition_filter mod into local scheduler
* chore: add docs for each IdOnlyPartitionFilter implementation
* refactor: make compactor_scheduler crate
* refactor: move PartitionsSource into the compactor_scheduler
The compactor currently uses PartitionsSource in two ways:
* for the preparation of PartitionIds prior to the compactor pipeline.
* for the abstraction which utilize the PartitionIds during the IO pipeline.
This commit is a refactoring to enable us to delineate between these two utilizations.
The former (preparation) utilization will now be done in the compactor_scheduler.
Since the compactor is dependent on the compactor_scheduler, it made sense to move the trait to the scheduler.
This adds 4 small test cases intending to test how compaction decisions made affect the final size of L1/L2 files.
The assumption is that when a steady stream of small L0 files is arriving, the compactor needs to be rewriting L1s so they grow to a reasonable size instead of getting left small.
* feat(garbage-collector): batch parquet existence checks to catalog
The core feature of this PR is batching the existence checks of parquet
files in object store against the catalog. Before, there was 1 catalog
query per each parquet file in object store. This can be a lot of
requests.
This PR can perform one query of at most 100 parquet file uuids against
the catalog in one query. A hundred seems like a decent starting place.
The batch may not reach 100 because there is also a timeout on receiving
object store meta objects from the object store lister thread. That
timeout is set to 100 milliseconds. If more than 100 are received, they
are batched into 100 for the catalog.
Additionally, this PR includes surrounding code changes to make it more
idiomatic (but not perfect). It follows up some suggested work from
#7652 for watching for shutdown on the threads.
* fixes#7784
* use hashset instead of vec to test for contains
* chore: add test for db failure path
* remove ParquetFileExistsByOSID and other single field structs that are
just for sql deserialization; map to uuid explicitly
* fix the sqlite query by using a blob literal X'<hex>' for uuids
* comment clarifications
* adjust loggings to warn from debug for expected rare events
Many thanks to Carol for help implementing this!
Nothing gets the partition ID out of the metadata. The parts of the code
interacting with object storage that need the ID to create the object
store path were using the partition ID from the metadata out of
convenience, but I changed those places to pass in the partition ID in a
separate argument instead.
This will make the transition to deterministic partition IDs a bit
smoother.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
This is the major part of #7470. Additional clean ups (e.g. to remove
the actual types from `data_types`) will follow.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>