* refactor: rename files and function to remove tartget level
* chore: update a comment
---------
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* chore: document and test split_percentage and percentage_max_file_size
* fix: Apply suggestions from code review
Co-authored-by: Carol (Nichols || Goulding) <193874+carols10cents@users.noreply.github.com>
* chore: add test with both max file size and split percentage
* docs: whitespace engineering and small typo
---------
Co-authored-by: Carol (Nichols || Goulding) <193874+carols10cents@users.noreply.github.com>
* feat: initial implementation of the split
* feat: split many L0 files in groups and compact them into new and fewer L0 files
* test: remove iappropriate AllAtOnce test
* refactor: move file classification for initial target to its own function
* fix: pop the branch from start to end
* chore: address review comments
* feat: support splitting to many L1 files
* feat: only add extra round to compact level-n files to same level-n files if their files plus overlapped level-n-plus-1 over limit
* chore: Apply suggestions from code review
Co-authored-by: Andrew Lamb <alamb@influxdata.com>
* chore: final cleanup and address comments
* chore: run fmt
---------
Co-authored-by: Andrew Lamb <alamb@influxdata.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* refactor: Split layout tests into their own module
* feat: Add more tests, improve sizes to simulator run display more
* fix: Apply suggestions from code review
Co-authored-by: Nga Tran <nga-tran@live.com>
* fix: fix comment wording
* fix: reporting order of skipped compactions
* chore: Run cargo hakari tasks
* fix: revert changes to Cargo.lock
* fix: revert workspace hack change
---------
Co-authored-by: Nga Tran <nga-tran@live.com>
Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
* refactor: move ParquetFileSimulator to compactor2_test_utils
* chore: Test with new algorithm + update display
* chore: Updates
* chore: Update setting to match prod
* refactor: extract `FileClassifer` component
Make the driver slightly smaller. Also makes the "all-in-one" mode
easier to understand.
* docs: add some
---------
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* refactor: extract compactor2 test utils into `compactor2_test_utils` and integration test
* fix: Update compactor2/src/components/mod.rs
Co-authored-by: Marco Neumann <marco@crepererum.net>
---------
Co-authored-by: Marco Neumann <marco@crepererum.net>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
This commit adds initial support for "soft" namespace deletion, where
the actual records & data remain, but are no longer queryable /
writeable.
Soft deletion is eventually consistent - users can expect to continue
writing to and reading from a bucket after issuing a soft delete call,
until the various components either restart, or have their caches
flushed.
The components treat soft-deleted namespaces differently:
* router: ignore soft deleted namespaces
* ingester: accept soft deleted namespaces
* compactor: accept soft deleted namespaces
* querier: ignore soft deleted namespaces
* various gRPC services: ignore soft deleted namespaces
This ensures that the ingester & compactor do not see rows "vanishing"
from the database, and continue to make forward progress.
Writes for the deleted namespace that are buffered in the ingester will
be persisted as normal, allowing us to support "un-delete" operations
where the system is restored to a the state at which the delete was
issued (rather than loosing the buffered data).
Follow-on work is required to ensure GC drops the orphaned parquet files
after the configured GC time, and optimisations such as not compacting
parquet from soft-deleted namespaces seems like a trivial win.
* refactor: `PartitionInfoSource`
Clean up the driver code a bit. There is certainly a good point in
having all these three sources (partition, table, namespace) separate,
but the driver doesn't really need to know that. In the end, it just
wants to have a `PartitionInfo` instance.
* docs: typo
Co-authored-by: Andrew Lamb <alamb@influxdata.com>
---------
Co-authored-by: Andrew Lamb <alamb@influxdata.com>
* refactor: introduce IR before creating actual DF plan
Let's have an IR that presents a machine-readable form of how output
files may look like.
* docs: improve
Co-authored-by: Andrew Lamb <alamb@influxdata.com>
* feat: also log plan type
---------
Co-authored-by: Andrew Lamb <alamb@influxdata.com>
* test: allow testing the compactor w/o any real data
Things that are missing:
- output files have nondeterministic IDs which interferes w/ snapshot
testing. We should probably normalize the IDs somehow.
- time ranges of output files are not captured correctly (because the
mock sink doesn't know how to calculate them)
* fix: Add output assertion
* fix: fmt
* docs: improve
Co-authored-by: Andrew Lamb <alamb@influxdata.com>
* fix: fmt
---------
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <alamb@influxdata.com>
* feat: `PartitionRepo::list_ids`
* refactor: `CatalogPartitionsSource` => `CatalogToCompactPartitionsSource`
* feat: allow the compactor to process all known partitions
Closes#6648.
* docs: improve
Co-authored-by: Andrew Lamb <alamb@influxdata.com>
---------
Co-authored-by: Andrew Lamb <alamb@influxdata.com>
- do not wait for a non-empty partition result (this doesn't make sense
if we are not running endlessly)
- modify entry point to allow the compactor to exit on its own (this is
normally not allowed for other server types)
Ignore partitions that where throttled or filtered due to the "not
unique" combo.
This is in line w/ the "partitions source", so the metric for "partition
in" and "partition out" line up.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
The maximum number of tables is part of the Namespace, which is already
loaded in its entirety. This commit copies the value into the
NamespaceSchema, making it available for the router to utilise.
* feat: add more information to commit logging
* feat: rework commit metrics
- more consistent metric names
- histograms per file and per job
- more histogram types (number of files, rows, bytes)
This is NOT used yet but will greatly help w/ logging and metrics. E.g.
it allows us to count rows and bytes of in/out-flow, create per-file
histograms of bytes/rows, and more.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
Instead of looping and polling a fresh set of partitions and
constructing a stream from that, use an endless stream instead. This
helps w/ efficiency during roll-overs since we can already start to
process the next set of partitions while the last ones from the previous
round are still in-progress.
Closes#6750.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* feat: throttle partitions that do not receive commits
* test: add failing test
* fix: partition ID in "unique" combo
* fix: partition ID in "throttle" combo
* docs: improve
Co-authored-by: Dom <dom@itsallbroken.com>
---------
Co-authored-by: Dom <dom@itsallbroken.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* feat: partition filters for TargetLevel version and a complete test
* chore: Apply suggestions from code review
Co-authored-by: Andrew Lamb <alamb@influxdata.com>
* chore: run fmt after applying review suggestions in git
---------
Co-authored-by: Andrew Lamb <alamb@influxdata.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
Rust 1.67 now says:
warning: `#[track_caller]` on async functions is a no-op
= note: see issue #87417 <https://github.com/rust-lang/rust/issues/87417> for more information
= note: `#[warn(ungated_async_fn_track_caller)]` on by default
* refactor: rename compact algo versions to reflect thier actual work
* chore: Apply suggestions from code review
Co-authored-by: Andrew Lamb <alamb@influxdata.com>
---------
Co-authored-by: Andrew Lamb <alamb@influxdata.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
This will stall the compactor and only touch each partition ones because
the "unique" combo thinks that partitions never finish. This will need
more thought.
* chore: add definition for the output of compacting a partition
* chore: Apply suggestions from code review
Co-authored-by: Andrew Lamb <alamb@influxdata.com>
* chore: address review comments
---------
Co-authored-by: Andrew Lamb <alamb@influxdata.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
This was a bit tricky to design so it is testable and modular, but I
think this turned out quite nicely. It will even work w/ #6750.
Fixes#6727.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
This should greatly improve efficiency of the two filters ("by ID" -
used for mocking / tests / dev, "shard"). This also changes the metrics
slightly since the partitions filtered there no longer count into the
overall "backlog" (which makes sense for the two filters).
I've left the "never skipped" filter where it is, because it needs to
perform IO (i.e. shouldn't be done for all partitions upfront) and we
shall only have a few skipped partitions, so it's not such a big if
needlessly fetch the parquet files for them.
Closes#6783.
I'm not saying we have to use this, but this is a demonstration how easy
it would be to add sharding to the compaction tier and also acts as a
"backup / insurance" if we ever need it.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* test: `MockPartitionsSource::set`
* test: `AssertFutureExt`
* feat: throttle when there are not partitions to compact
Fixes#6727.
---------
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
Add some rough "partition is too big" filter for now until we can deal
with them (the framework allows that but we need to set up the proper
divide-and-conquer components).
This will hopefully prevent our prod compactor from dying that often.
Note that this is also duct-tape around two issues:
- DataFusion not accounting in-flight data all the time
- Our wide fan-out query plans (see https://github.com/influxdata/idpe/issues/16768#issuecomment-1387056833 )
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
I've meant to skip partitions w/ timeouts when I designed the
functionality but forgot to adjust the error filter accordingly. To not
run into this problem again (i.e. forget adjust the filter), make the
code a bit more explicit.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* feat: introduce a new way of max_sequence_number for ingester, compactor and querier
* chore: cleanup
* feat: new column max_l0_created_at to order files for deduplication
* chore: cleanup
* chore: debug info for chnaging cpu.parquet
* fix: update test parquet file
Co-authored-by: Marco Neumann <marco@crepererum.net>
* feat: introduce scratchpad store for compactor
Use an intermediate in-memory store (can be a disk later if we want) to
stage all inputs and outputs of the compaction. The reasons are:
- **fewer IO ops:** DataFusion's streaming IO requires slightly more
IO requests (at least 2 per file) due to the way it is optimized to
read as little as possible. It first reads the metadata and then
decides which content to fetch. In the compaction case this is (esp.
w/o delete predicates) EVERYTHING. So in contrast to the querier,
there is no advantage of this approach. In contrary this easily adds
100ms latency to every single input file.
- **less traffic:** For divide&conquer partitions (i.e. when we need to
run multiple compaction steps to deal with them) it is kinda pointless
to upload an intermediate result just to download it again. The
scratchpad avoids that.
- **higher throughput:** We want to limit the number of concurrent
DataFusion jobs because we don't wanna blow up the whole process by
having too much in-flight arrow data at the same time. However while
we perform the actual computation, we were waiting for object store
IO. This was limiting our throughput substantially.
- **shadow mode:** De-coupling the stores in this way makes it easier to
implement #6645.
Note that we assume here that the input parquet files are WAY SMALLER
than the uncompressed Arrow data during compaction itself.
Closes#6650.
* fix: panic on shutdown
* refactor: remove shadow scratchpad (for now)
* refactor: make scratchpad safe to use
Allows compactor2 to run a fixed-point loop (until all work is done) and
in every loop in can run mulitiple jobs.
The jobs are currently organized by "branches". This is because our
upcoming OOM handling may split a branch further if it doesn't complete.
Also note that the current config resembles the state prior to this PR.
So the FP-loop will only iterate ONCE and then runs out of L0 files. A
more advanced setup can be built using the framework though.
It seems that prod was hanging last night. This is pretty hard to debug
and in general we should protect the compactor against hanging /
malformed partitions that take forever. This is similar to the fact that
the querier also has a timeout for every query. Let's see if this shows
anything in prod (and if not it's still a desired safety net).
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
With the upcoming divide-and-conquer approach, we have have multiple
commits per partition since we can divide it into multiple compaction
jobs. For metrics (and logs) however it is important to track the
overall process, so we shall also monitor the number of completed
partitions.
* refactor: planner as a component
Now everything except for the core algorithm structure is a component.
This also means that the driver no longer needs the whole config
structure.
* docs: explain V1
* chore: address review comment of previous PR
* refactor: execute compact plan
* refactor: we will now compact all L0 and L1 files of a partition and split them as needed
* chore: comnents
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
Filters can now inspect ALL files for a partition which may be useful
for limiters. This also moves the "is not empty" part into a filter.
Note that we still can only run ONE compaction job per partition for the
time being, so splitting the files into multiple sub-groups and run a
per-group DataFusion job is currently not possible. It should be a rather easy
addition if we ever want that (probably needs another semaphore of
something to limit the overall job count).
Sets up crate and wires up the main binary. No tests yet, no algorithm
framework, just the bare minimum.
Also I decided to not offer a gRPC server in `compactor2` at the moment
and hence did not implement any handle/delegate infrastructure. We add
this later if we need it. This also means compactor2 does NOT provide a
catalog service for now.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>