I've meant to skip partitions w/ timeouts when I designed the
functionality but forgot to adjust the error filter accordingly. To not
run into this problem again (i.e. forget adjust the filter), make the
code a bit more explicit.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* feat: introduce a new way of max_sequence_number for ingester, compactor and querier
* chore: cleanup
* feat: new column max_l0_created_at to order files for deduplication
* chore: cleanup
* chore: debug info for chnaging cpu.parquet
* fix: update test parquet file
Co-authored-by: Marco Neumann <marco@crepererum.net>
* feat: introduce scratchpad store for compactor
Use an intermediate in-memory store (can be a disk later if we want) to
stage all inputs and outputs of the compaction. The reasons are:
- **fewer IO ops:** DataFusion's streaming IO requires slightly more
IO requests (at least 2 per file) due to the way it is optimized to
read as little as possible. It first reads the metadata and then
decides which content to fetch. In the compaction case this is (esp.
w/o delete predicates) EVERYTHING. So in contrast to the querier,
there is no advantage of this approach. In contrary this easily adds
100ms latency to every single input file.
- **less traffic:** For divide&conquer partitions (i.e. when we need to
run multiple compaction steps to deal with them) it is kinda pointless
to upload an intermediate result just to download it again. The
scratchpad avoids that.
- **higher throughput:** We want to limit the number of concurrent
DataFusion jobs because we don't wanna blow up the whole process by
having too much in-flight arrow data at the same time. However while
we perform the actual computation, we were waiting for object store
IO. This was limiting our throughput substantially.
- **shadow mode:** De-coupling the stores in this way makes it easier to
implement #6645.
Note that we assume here that the input parquet files are WAY SMALLER
than the uncompressed Arrow data during compaction itself.
Closes#6650.
* fix: panic on shutdown
* refactor: remove shadow scratchpad (for now)
* refactor: make scratchpad safe to use
Allows compactor2 to run a fixed-point loop (until all work is done) and
in every loop in can run mulitiple jobs.
The jobs are currently organized by "branches". This is because our
upcoming OOM handling may split a branch further if it doesn't complete.
Also note that the current config resembles the state prior to this PR.
So the FP-loop will only iterate ONCE and then runs out of L0 files. A
more advanced setup can be built using the framework though.
It seems that prod was hanging last night. This is pretty hard to debug
and in general we should protect the compactor against hanging /
malformed partitions that take forever. This is similar to the fact that
the querier also has a timeout for every query. Let's see if this shows
anything in prod (and if not it's still a desired safety net).
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
With the upcoming divide-and-conquer approach, we have have multiple
commits per partition since we can divide it into multiple compaction
jobs. For metrics (and logs) however it is important to track the
overall process, so we shall also monitor the number of completed
partitions.
* refactor: planner as a component
Now everything except for the core algorithm structure is a component.
This also means that the driver no longer needs the whole config
structure.
* docs: explain V1
* chore: address review comment of previous PR
* refactor: execute compact plan
* refactor: we will now compact all L0 and L1 files of a partition and split them as needed
* chore: comnents
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
Filters can now inspect ALL files for a partition which may be useful
for limiters. This also moves the "is not empty" part into a filter.
Note that we still can only run ONE compaction job per partition for the
time being, so splitting the files into multiple sub-groups and run a
per-group DataFusion job is currently not possible. It should be a rather easy
addition if we ever want that (probably needs another semaphore of
something to limit the overall job count).
Sets up crate and wires up the main binary. No tests yet, no algorithm
framework, just the bare minimum.
Also I decided to not offer a gRPC server in `compactor2` at the moment
and hence did not implement any handle/delegate infrastructure. We add
this later if we need it. This also means compactor2 does NOT provide a
catalog service for now.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>