Filters can now inspect ALL files for a partition which may be useful
for limiters. This also moves the "is not empty" part into a filter.
Note that we still can only run ONE compaction job per partition for the
time being, so splitting the files into multiple sub-groups and run a
per-group DataFusion job is currently not possible. It should be a rather easy
addition if we ever want that (probably needs another semaphore of
something to limit the overall job count).
- use a single data structure for CLI args (not two)
- set mem limit default to 8GB (same as querier). We can always tune
this later, but we should not run with "unlimited" to begin with.
* chore: Update datafusion and arrow/parquet/arrow-flight `31.0.0`
* chore: Update for new API
* chore: Run cargo hakari tasks
Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
Sets up crate and wires up the main binary. No tests yet, no algorithm
framework, just the bare minimum.
Also I decided to not offer a gRPC server in `compactor2` at the moment
and hence did not implement any handle/delegate infrastructure. We add
this later if we need it. This also means compactor2 does NOT provide a
catalog service for now.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* feat: function to read partition IDs of all partitions with new writes
* chore: run fmt
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* refactor: Drop Expr::UnaryOp to simplify tree traversal
The UnaryOp doesn't provide and additional value and complicates
walking the AST, as literal values wrapped in a UnaryOp(Minus, ...)
require extra handling when reducing time range expressions, etc.
This change also is true to the InfluxQL Go implementation,
which represents whole number literals as signed integers unless
they exceed i64::MAX.
* chore: Refactor all usages of format!("{}", ?) to ?.to_string()
Per https://github.com/influxdata/influxdb_iox/pull/6600#discussion_r1072028895
* refactor: remove unused code
* refactor: make fn private
* feat: safely stream data from one tokio runtime to another
Closes#6577.
* refactor: review comments
Co-authored-by: Andrew Lamb <alamb@influxdata.com>
* docs: improve
* test: explain
* test: make tests more tricky
* refactor: improve error message
Co-authored-by: Andrew Lamb <alamb@influxdata.com>
Co-authored-by: Andrew Lamb <alamb@influxdata.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
Changes the persist system to call into an abstract
PersistCompletionObserver after the persist task has completed, but
before releasing the job permit / notifying the enqueuer.
This call happens synchronously, driven by the persist worker to
completion. A sync construct can easily be made async (by enqueuing work
into a channel), but not the other way around, so this gives the best
flexibility.
This trait allows pluggable logic to be inserted into the persist
system, without tightly coupling it to the implementer's logic (for
example, replication). One or more observers may be chained together to
construct an arbitrary sequence of actors.
This commit uses a no-op observer, causing no functional change to the
system.
Adds an integration test of the persist system, covering:
* Node A starts a persist operation
* Node B starts a persist operation for the same partition
* Node A completes, setting the catalog sort key to a new value
* Node B attempts to update the catalog, observing the new sort key
* Node B re-compacts the data, re-uploads, and drives to completion
This scenario is/was tracked in:
https://github.com/influxdata/influxdb_iox/issues/6439
The persist::Context struct carries the data to be persisted, a
reference to the partition from which it came, and various cached fields
to avoid re-acquiring the partition read lock all the time.
Prior to this commit, the Context also had the full persist logic as
methods, invoked by the persist worker. This tightly couples the data &
logic - it's fairly clear a worker should implement the work, and
operate on the data - not commingling the two. I even knew the mess I
was making when I wrote it, but effectively copy-pasted it from
ingester1 because deadlines.
This commit decouples the persist logic from the Context.