This commit swaps the existing single "persisting batch" slot (field) in
a PartitionData for an ordered queue of outstanding partitions.
This decouples marking a partition buffer for persistence from the
actual persistence operation, allowing them to proceed at different
rates. This reduces the complexity of persistence management, but also
allows us to gracefully handle "hot" partitions; for example, this
problematic scenario in the ingester(1) implementation during recovery:
* Writes come into a partition, reaching a size/row/hotness limit
* Partition is enqueued for persistence
* More writes come into the new buffer, exceeding the same limit
* Cannot persist the hot buffer because of outstanding persist op
Without this change the only possibilities in this situation are:
* stop ingest for the partition and error all writes that
(partially!) write to the partition, or
* continue accepting writes, allowing the partition to exceed the
limit that marked it for persistence in the first place
The latter is what the ingester(1) implementation does today, which
results in partitions exceeding their row/size/age limits, which exist
to limit the cost of generating the Parquet file from the buffer - this
is a significant contributor to instability during recovery.
This strategy enforces configured the limits on the partition buffer,
but does not block / slow down recovery while persistence is completed.
* fix: check schemas in `pretty_print_batches`
I think most users of this function (and `assert_batches_eq`) assume
that all batches have the same schema. If not, `pretty_print_batches`
may either fail producing an actual table (some rows may have more or
less columns) or silently produce a table that looks "alright".
* fix: equalize schemas where it is required/desired
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
Have a single global test executor w/ reasonable defaults. Also don't
require tests to join/await executor shutdowns (most tests forget this
anyways and will get a runtime warning).
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
Pushes the (ref-counted/shared) deferred NamespaceName resolver into the
child TableData and PartitionData nodes in the BufferTree.
This allows the PartitionData to provide the name of the namespace to
which it belongs, in addition to the existing table name, and all
relevant IDs.
Replay the WAL log, if any, at startup.
Op replay is performed synchronously, during initialisation of the
ingester2 instance, and passes all ops through the "normal" write path
that the system uses once replay is complete, minus the WAL writer layer
- this helps to DRY the write path and minimise different behaviours.
Now partition cache entries are smaller, the number of entries held in
memory can be increased - this now uses ~2MiB of memory and drains the
cache during execution, amortising to 0.
* feat: Add a feature flag to switch to the router RPC write path
Fixes#6242.
* refactor: Remove a weird arc clone/rename that's not needed
I'm sure this was needed at some point, but it doesn't make much sense.
I wasn't going to change this, but I'm now trying to minimize the
differences between this function and the write path init function, so
make this one better too.
* fix: Add the namespace autocreation to the RPC write path too
The topic/query pool don't really apply to this case, but use them
anyway to be able to use the existing catalog methods.
Also add a bunch of comments pointing out where the RPC write path
initializer and the old router's initializer are the same and where
they're different, so that perhaps it'll be easier to keep them in sync
while they both exist.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
This commit removes the invariant asserts of monotonicity carried over
from the "ingester" crate - ingester2 does not define any ordering of
writes within the system.
This commit also removes the SequenceNumberRange as it is no longer
useful to indirectly check the equality of two sets of ops -
non-monotonic writes means overlapping ranges does not guarantee a full
set of overlapping operations (gaps may be present). Likewise bounding
values (such as "min persisted") provide no useful meaning in an
out-of-order system.
Document the arbitrary reordering of concurrent writes in an ingester,
and the potential divergence of WAL entries / buffered state.
Also documents that in ingester2, sequence numbers only identify writes,
not their ordering.
Rather than naming WAL files with a UUID, give them a number that
indicates the order they were created in so that they can be read back
in order.
Fixes#6227.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>