Commit Graph

237 Commits (970371929492d6c1557368cb48dd0bc5c24339ea)

Author SHA1 Message Date
Dom Dwyer b3363639f5
chore: nudge CI 2022-12-20 17:05:03 +01:00
Dom Dwyer 5f4acf186d
docs: fix bad doc link
Rust hates indented URLs.
2022-12-20 15:25:34 +01:00
Dom Dwyer e083f3276c
feat(persist): accept concurrent matching updates
As an optimisation, allow a persist task to progress if it observes a
concurrent catalog sort key update that exactly matches the sort key it
was committing.
2022-12-20 15:15:39 +01:00
Dom Dwyer f64ffbe035
fix(ingester2): handle concurrent sort key updates
Allow an ingester2 instance to tolerate concurrent partition sort key
updates in the catalog.

A persist job is optimistically executed with the locally cached sort
key. If an ingester2 instance observes a concurrent update, it aborts
both the sort key update, and the overall persist operation (before
making the parquet file visible) and retries the operation with the
newly observed sort key. Concurrent sort key updates are theorised to be
relatively rare overall.

Any orphaned parquet files uploaded as part of a persist job that aborts
due to a concurrent sort key update are eventually removed by the
(external) object store GC task.

See https://github.com/influxdata/influxdb_iox/issues/6439
2022-12-20 15:15:39 +01:00
Dom Dwyer adc6fcfb04
feat(catalog): linearise sort key updates
Updating the sort key is not commutative and MUST be serialised. The
correctness of the current catalog interface relies on the caller
serialising updates globally, something it cannot reasonably assert in a
distributed system.

This change of the catalog interface pushes this responsibility to the
catalog itself where it can be effectively enforced, and allows a caller
to detect parallel updates to the sort key.
2022-12-20 12:31:00 +01:00
Dom dbbe43f241
Merge branch 'main' into dom/query-types-refactor 2022-12-19 15:03:16 +00:00
Dom Dwyer 84e29791e5
docs: fix incomplete comment
Finishes the incomplete sentence that
2022-12-19 12:46:21 +01:00
dependabot[bot] 299f0e99f9
chore(deps): Bump thiserror from 1.0.37 to 1.0.38
Bumps [thiserror](https://github.com/dtolnay/thiserror) from 1.0.37 to 1.0.38.
- [Release notes](https://github.com/dtolnay/thiserror/releases)
- [Commits](https://github.com/dtolnay/thiserror/compare/1.0.37...1.0.38)

---
updated-dependencies:
- dependency-name: thiserror
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
2022-12-19 10:33:32 +00:00
Dom Dwyer 371857399c
refactor: avoid double flattening stream
Changes the into_record_batches() method to avoid creating an extra
stream out of the Option that must be flattened (iterating over the
option vs. filtering out all None first).
2022-12-19 11:31:41 +01:00
dependabot[bot] 8478d41bcb
chore(deps): Bump paste from 1.0.10 to 1.0.11 (#6430)
Bumps [paste](https://github.com/dtolnay/paste) from 1.0.10 to 1.0.11.
- [Release notes](https://github.com/dtolnay/paste/releases)
- [Commits](https://github.com/dtolnay/paste/compare/1.0.10...1.0.11)

---
updated-dependencies:
- dependency-name: paste
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-12-19 10:31:05 +00:00
Dom Dwyer b87d572e42
refactor: single PartitionResponse constructor
Removes the PartitionResponse::new_no_batches() constructor, instead
using an Option-wrapped data. Before that would have been confusing
(many Option in the constructor signature) but now there's only one!
2022-12-19 11:25:20 +01:00
Dom Dwyer c1db76bf9e
refactor: remove max seqnum in PartitionResponse
Removes the redundant max_persisted_sequence_number in
PartitionResponse, which was functionally replaced with
completed_persistence_count for the Querier's parquet file discovery
instead.
2022-12-19 11:21:53 +01:00
Dom Dwyer 13ed3f9acb
fix: show lack of partition data in query output
Show that a PartitionResponse does not contain data in the Debug output.
2022-12-19 11:18:21 +01:00
dependabot[bot] c72734473c
chore(deps): Bump async-trait from 0.1.59 to 0.1.60 (#6433)
Bumps [async-trait](https://github.com/dtolnay/async-trait) from 0.1.59 to 0.1.60.
- [Release notes](https://github.com/dtolnay/async-trait/releases)
- [Commits](https://github.com/dtolnay/async-trait/compare/0.1.59...0.1.60)

---
updated-dependencies:
- dependency-name: async-trait
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-12-19 10:09:23 +00:00
Carol (Nichols || Goulding) 07772e8d22
fix: Always return a PartitionRecord which maybe streams record batches
Connects to #6421.

Even if the ingester doesn't have data in memory for a query, we need to
send back metadata about the ingester UUID and the number of files
persisted so that the querier can decide whether it needs to refresh the
cache.
2022-12-16 17:02:41 -05:00
Carol (Nichols || Goulding) 473ce7a268
fix: Don't hardcode the transition shard id 2022-12-16 17:01:35 -05:00
Dom Dwyer c830a83105
feat(ingester2): hot partition persistence
This PR uses the MutableBatch persist cost estimation added in #6425 to
selectively mark "hot" partitions for persistence.

This uses a (composable!) "post-write" observer that is invoked after
each buffer call - this allows the HotPartitionPersister in this commit
to inspect the cost of the partition after applying the write, and if it
exceeds the configurable cost threshold, enqueue it for persistence
(rotating the buffer within the partition in the process).

Unlike ingester(1), this implementation prevents overrun - the
application of the write that exceeds the cost limit, and enqueueing the
partition for persistence is atomic.
2022-12-16 19:33:34 +01:00
Marko Mikulicic 69d5148729 fix(ingester2): Make ingester2 work with existing catalogs and document quickstart 2022-12-16 13:16:31 +01:00
Dom Dwyer 933ab1f8c7
feat(ingester2): optimal persist parallelism
This commit changes the behaviour of the persist system to enable
optimal parallelism of persist operations, and improve the accuracy of
the outstanding job bound / back-pressure.

Previously all persist operations for a given partition were
consistently hashed to a single worker task. This serialised persistence
per partition, ensuring all updates to the partition sort key were
serialised. However, this also unnecessarily serialises persist
operations that do not need to update the sort key, reducing the
potential throughput of the system; in the worst case of a single
partition receiving all the writes, only one worker would be persisting,
and the other N-1 workers would be idle.

After this change, the sort key is inspected when enqueuing the persist
operation and if it can be determined that no sort key update is
necessary (the typical case), then the persist task is placed into a
global work queue from which all workers consume. This allows for
maximal parallelisation of these jobs, and the removes the per-worker
head-of-line blocking.

In the case that the sort key does need updating, these jobs continue to
be consistently hashed to a single worker, ensuring serialised sort key
updates only where necessary.

To support these changes, the back-pressure system has been changed to
account for all outstanding persist jobs in the system, regardless of
type or assigned worker - a logical, bounded queue is composed together
of a semaphore limiting the number of persist tasks overall, and a
series of physical, unbounded queues - one to each worker & the global
queue. The overall system remains bounded by the
INFLUXDB_IOX_PERSIST_QUEUE_DEPTH value, and is now simpler to reason
about (it is independent of the number of workers, etc).
2022-12-15 18:30:51 +01:00
Dom Dwyer e24d21255b
refactor: inject persist started timestamp
Instead of recording the "enqueued_at" when initialising the
PersistRequest, inject the value in.

This lets us re-order the request construction while retaining accurate
timing.
2022-12-15 18:28:08 +01:00
Dom ede2627dcf
Merge branch 'main' into dom/deferred-load-peek 2022-12-15 17:16:02 +00:00
Dom 261eeacf3c
Merge branch 'main' into dom/ooo-parittion-persist 2022-12-15 17:07:53 +00:00
Dom Dwyer 7d7c8db334
feat(ingester2): out-of-order partition persist
Previously data within a partition had to be persisted in the order in
which the data was received. This was necessary for the correctness of
the query API, as it utilised the lower-bound sequence number to
determine what data was available in the object store.

With the changes to the parquet discovery protocol / query API made in
https://github.com/influxdata/influxdb_iox/pull/6365 this restriction
can be lifted, allowing out-of-order persistence within a partition for
increased parallelism / performance.

This commit changes the PartitionData to accept out-of-order persist
completion notifications, removing the ordering invariant from ingester2
(note that the persist ops currently remain ordered however).
2022-12-15 14:38:13 +01:00
Dom Dwyer b15aebbddc
feat(deferred_load): peek() immediate values
Adds a peek() method to the DeferredLoad construct, allowing a caller to
immediately read the resolved value, or "None" if the value is
unresolved or concurrently resolving.

This allows a caller to optimistically read the value without having to
block and wait for it to become available.
2022-12-15 14:33:44 +01:00
Dom Dwyer e76b107332
feat(ingester2): persist back-pressure
This commit causes an ingester2 instance to stop accepting new writes
when at least one persist queue is full. Writes continue to be rejected
until the persist workers have processed enough outstanding persist
tasks to drain the queues to half of their capacity, at which point
writes are accepted again.

When a write is rejected, the ingester returns a "resource exhausted"
RPC code to the caller.

Checking if the system is in a healthy state for writes is extremely
cheap, as it is on the hot path for all writes.
2022-12-14 17:17:17 +01:00
Paul Dix d9c72bb93f
feat: optimize wal with batching (#6399)
* feat: optimize wal with batching

Simplified the wal writer so that it batches up write operations. Currently it waits 10ms between fsync calls. We can pull this out to a config variable later if we want, but I think this is good enough for now.

Also updated the reader to be a more simple blocking reader without the extra tasks and channels as that wasn't really getting us anything that I know of.

* chore: cleanup wal code for PR feedback
2022-12-14 16:07:20 +00:00
kodiakhq[bot] 66c610f7b1
Merge branch 'main' into cn/ingester-persisted-file-count 2022-12-14 14:58:31 +00:00
Carol (Nichols || Goulding) f29bed86c0
fix: Improve log messages and docs as suggested in code review
Co-authored-by: Dom <dom@itsallbroken.com>
2022-12-14 09:52:09 -05:00
Dom Dwyer 8f0da90d76
docs: remove ref to PersistActor
Fix bad reflink to something that no longer exists.
2022-12-13 16:59:15 +01:00
Dom Dwyer 309386b828
chore: silence spurious lint
This is by design! Clippy just doesn't see the plan.
2022-12-13 16:59:14 +01:00
Dom Dwyer 1da9b63cce
fix(ingester2): persist deadlock
Removes the submission queue from the persist fan-out, instead the
PersistHandle now carries the shared state internally (cheaply cloned
via ref counts).

This also resolves the persist deadlock when under load.
2022-12-13 16:47:45 +01:00
kodiakhq[bot] 9e8ae1485f
Merge branch 'main' into dom/wal-bench 2022-12-13 15:19:32 +00:00
Dom Dwyer 5fa4e49098
feat(ingester2): persist active & queue timings
Adds more debug logging to the persist code paths, as well as capturing
& logging (at INFO) timing information tracking the time a persist task
spends in the queue, the active time spent actually persisting the data,
and the total duration of time since the request was created (sum of
both durations).
2022-12-13 11:06:09 +01:00
dependabot[bot] e108a8b6c9
chore(deps): Bump paste from 1.0.9 to 1.0.10 (#6384)
Bumps [paste](https://github.com/dtolnay/paste) from 1.0.9 to 1.0.10.
- [Release notes](https://github.com/dtolnay/paste/releases)
- [Commits](https://github.com/dtolnay/paste/compare/1.0.9...1.0.10)

---
updated-dependencies:
- dependency-name: paste
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-12-13 06:03:05 +00:00
Carol (Nichols || Goulding) 1c7f322a4e
feat: Keep track of and report number of Parquet files persisted
Per partition and starting over each time the ingester restarts.

Fixes #6334.
2022-12-12 11:45:00 -05:00
Dom Dwyer 7c28a30d1b
test(ingester2): WAL replay benchmark
This adds a simple WAL replay benchmark to ingester2 that executes a
replay of a single line of LP.

Unfortunately each file in the benches directory is compiled as it's own
binary/crate, and as such is restricted to importing only "pub" types.
This sucks, as it requires you to either benchmark at a high level
(macro, not microbenchmarks - i.e. benchmarking the ingester startup,
not just the WAL replay) or you are forced to mark the reliant types &
functions as "pub", as well as all the other types/traits they reference
in their signatures. Because the performance sensitive code is usually
towards the lower end of the call stack, this can quickly lead to an
explosion of "pub" types causing a large amount of internal code to be
exported.

Instead this commit uses a middle-ground; benchmarked types & fns are
conditionally marked as "pub" iff the "benches" feature is enabled. This
prevents them from being visible by default, but allows the benchmark
function to call them.

The benchmark itself is also restricted to only run when this feature is
enabled.
2022-12-12 15:02:36 +01:00
Carol (Nichols || Goulding) 2fd2d05ef6
feat: Identify each run of an ingester with a Uuid
And send that UUID in the Flight response for queries to that ingester
run.

Fixes #6333.
2022-12-08 17:22:52 -05:00
Carol (Nichols || Goulding) 6014c10866
test: Enable running ingester2/router RPC write servers in e2e tests
Add configuration and server types to be able to create server fixtures
for them.
2022-12-08 17:22:52 -05:00
dependabot[bot] 1d38d400f0
chore(deps): Bump object_store from 0.5.1 to 0.5.2 (#6339)
* chore(deps): Bump object_store from 0.5.1 to 0.5.2

Bumps [object_store](https://github.com/apache/arrow-rs) from 0.5.1 to 0.5.2.
- [Release notes](https://github.com/apache/arrow-rs/releases)
- [Changelog](https://github.com/apache/arrow-rs/blob/master/CHANGELOG-old.md)
- [Commits](https://github.com/apache/arrow-rs/compare/object_store_0.5.1...object_store_0.5.2)

---
updated-dependencies:
- dependency-name: object_store
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

* chore: Run cargo hakari tasks

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-12-06 07:53:54 +00:00
Marco Neumann cd6a8a1a82
refactor: DF-driven on-demand mem limit instead of ahead-of-time heuristics (#6313)
* refactor: DF-driven on-demand mem limit instead of ahead-of-time heuristics

Closes #6310.

* refactor: rename and tune default exec mem limits

* fix: ingester2 bits after rebase
2022-12-05 12:38:28 +00:00
kodiakhq[bot] 228c81c6fb
Merge branch 'main' into dom/remove-comment 2022-12-02 17:58:33 +00:00
kodiakhq[bot] fff6cdf951
Merge branch 'main' into dom/drop-empty-wal 2022-12-02 17:49:47 +00:00
Dom Dwyer de52a24751
chore: remove old comment
The ingester will replay the ops, and then immediately trigger a
persist! This comment is done :)

Outstanding is actually deleting the replayed files, which will come
with the ref-counted WAL segment change later.
2022-12-02 18:44:45 +01:00
Dom Dwyer a8f99d4593
feat(ingester2): drop empty WAL files
A shutdown of an ingester that has received no writes leaves an empty
WAL file - these can be deleted at startup.
2022-12-02 18:43:32 +01:00
Dom Dwyer c3a88f105d
test: track_caller for make_write_op
Changes the stack trace to point at the call site when a panic occurs
within make_write_op, and adds a unique message per panic explaining the
issue.
2022-12-02 18:42:28 +01:00
Dom Dwyer 26bf54d041
feat(ingester2): drop WAL segments after persist
Changes the WAL rotation task to drop the WAL segment after the
partition data has completely persisted.

This logic contains a race condition outlined (at length) in the code
comments, along with a plan to resolve it once I'm back from holiday.
For now, the extremely low likelihood and minor impact of the race
occurring is likely acceptable for testing purposes.
2022-12-02 18:29:35 +01:00
Dom Dwyer f145d6415f
feat(ingester2): persist partitions on WAL rotate
Changes the WAL rotation code to cause the current set of partitions to
be submitted for persistence.

Because rotating the WAL segment and triggering persistence is not
atomic (not under an exclusive lock preventing writes) the persisted
buffers MAY contain writes that appear in the new segment file.

In the happy/non-crash path, this will have no effect, however if the
ingester crashes and replays the WAL files, these writes will be
duplicated into object storage (where compaction will resolve the issue)
- this seems like a good trade-off, allowing us to avoid blocking write
requests for as long as it takes to rotate the WAL and mark all
partitions as persisting.

(though currently WAL files are not dropped and thus everything is
replayed all the time.)
2022-12-02 17:18:39 +01:00
Dom Dwyer 66aab55534
feat(ingester2): run persistence task
Configures the initialisation of an ingester2 instance to spawn a
persistence task (currently unused) and plumbs in various configuration
parameters.
2022-12-02 17:18:39 +01:00
kodiakhq[bot] 1137f2fc7e
Merge branch 'main' into dom/buffer-tree-iter 2022-12-02 16:10:38 +00:00
Dom Dwyer 85a41e34df
refactor(ingester2): BufferTree partition iterator
Expose an iterator of PartitionData within a BufferTree.
2022-12-02 17:01:34 +01:00
Dom Dwyer f524687602
feat(ingester2): parallel partition persistence
Implements actor-based, parallel persistence in ingester2 with
controllable fan-out parallelism and queue depths.

This implementation encapsulates the complexity of persistence, queuing
and parallelism - the caller simply uses the handle to persist a
partition, while the actor handles fan-out to a set of persistence
workers, compaction in a separate thread-pool, and optional completion
notifications.

By consistently hashing persist jobs onto workers, parallelism is
achieved across partitions, but serialisation of partition persists is
enforced so that the sort key update is correctly serialised.
2022-12-02 16:34:03 +01:00
Dom e25920c7c0
Merge branch 'main' into dom/partition-queue 2022-12-02 15:18:44 +00:00
kodiakhq[bot] 9e3d0fcefb
Merge branch 'main' into cn/ingester2 2022-12-02 13:39:55 +00:00
Dom Dwyer 4f928fbd0c
perf(ingester2): partition persist queue
This commit swaps the existing single "persisting batch" slot (field) in
a PartitionData for an ordered queue of outstanding partitions.

This decouples marking a partition buffer for persistence from the
actual persistence operation, allowing them to proceed at different
rates. This reduces the complexity of persistence management, but also
allows us to gracefully handle "hot" partitions; for example, this
problematic scenario in the ingester(1) implementation during recovery:

    * Writes come into a partition, reaching a size/row/hotness limit
    * Partition is enqueued for persistence
    * More writes come into the new buffer, exceeding the same limit
    * Cannot persist the hot buffer because of outstanding persist op

Without this change the only possibilities in this situation are:

    * stop ingest for the partition and error all writes that
      (partially!) write to the partition, or
    * continue accepting writes, allowing the partition to exceed the
      limit that marked it for persistence in the first place

The latter is what the ingester(1) implementation does today, which
results in partitions exceeding their row/size/age limits, which exist
to limit the cost of generating the Parquet file from the buffer - this
is a significant contributor to instability during recovery.

This strategy enforces configured the limits on the partition buffer,
but does not block / slow down recovery while persistence is completed.
2022-12-02 14:09:50 +01:00
Dom Dwyer 9c170dd506
refactor(ingester2): namespace name in partition
Pushes the (ref-counted/shared) deferred NamespaceName resolver into the
child TableData and PartitionData nodes in the BufferTree.

This allows the PartitionData to provide the name of the namespace to
which it belongs, in addition to the existing table name, and all
relevant IDs.
2022-12-01 21:13:05 +01:00
Carol (Nichols || Goulding) a3bc31aa82
fix: Have ingester2 GrpcDelegate hold onto the catalog and metrics
So that they don't have to be passed in again later for the services.
2022-12-01 11:39:27 -05:00
Dom Dwyer 81619fd5b3
refactor(ingester2): clean-up rotation task
When an ingester2 instance goes out of scope, automatically stop the WAL
rotation task.
2022-12-01 16:03:53 +01:00
Dom Dwyer c7f88d779a
feat(ingester2): rotate wal periodically
Spawn a background task to rotate the WAL file with the given internal.
2022-12-01 16:03:52 +01:00
Dom Dwyer f40885d4ca
refactor(wal): remove needless async/await
Obtaining a rotation handle isn't async.
2022-12-01 16:03:52 +01:00
kodiakhq[bot] 76e500cb31
Merge branch 'main' into dom/wal-replay 2022-12-01 14:34:51 +00:00
Andrew Lamb 14a9bc92e9
Revert "Revert "chore: Update Datafusion and arrow/arrow-flight/parquet to `28.0.0` (#6279)" (#6294)" (#6296)
This reverts commit b7e52c0d8d.

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-12-01 14:20:43 +00:00
Dom Dwyer cb248c75d5
feat(ingester2): WAL entry replay at startup
Replay the WAL log, if any, at startup.

Op replay is performed synchronously, during initialisation of the
ingester2 instance, and passes all ops through the "normal" write path
that the system uses once replay is complete, minus the WAL writer layer
- this helps to DRY the write path and minimise different behaviours.
2022-12-01 15:01:43 +01:00
Dom Dwyer 1be9ffb409
refactor(ingester2): cache more hot partitions
Now partition cache entries are smaller, the number of entries held in
memory can be increased - this now uses ~2MiB of memory and drains the
cache during execution, amortising to 0.
2022-12-01 13:45:19 +01:00
kodiakhq[bot] 047bcc6e7e
Merge branch 'main' into dom/remove-ordering 2022-12-01 12:37:54 +00:00
Andrew Lamb b7e52c0d8d
Revert "chore: Update Datafusion and arrow/arrow-flight/parquet to `28.0.0` (#6279)" (#6294)
This reverts commit 039a45ddd1.
2022-12-01 11:38:42 +00:00
Dom Dwyer 6639568665
refactor: remove unimplemented!()
This lets us keep the existing test coverage for the new impl.
2022-12-01 11:35:00 +01:00
Dom Dwyer 464242ebc6
refactor: remove ordering asserts, sequence ranges
This commit removes the invariant asserts of monotonicity carried over
from the "ingester" crate - ingester2 does not define any ordering of
writes within the system.

This commit also removes the SequenceNumberRange as it is no longer
useful to indirectly check the equality of two sets of ops -
non-monotonic writes means overlapping ranges does not guarantee a full
set of overlapping operations (gaps may be present). Likewise bounding
values (such as "min persisted") provide no useful meaning in an
out-of-order system.
2022-12-01 10:38:20 +01:00
Dom Dwyer 50d5e4a6f1
docs: arbitrarily reordering & sequence numbers
Document the arbitrary reordering of concurrent writes in an ingester,
and the potential divergence of WAL entries / buffered state.

Also documents that in ingester2, sequence numbers only identify writes,
not their ordering.
2022-12-01 10:38:20 +01:00
Dom Dwyer f34d4db994
docs: reference WAL cancellation safety ticket
Reference https://github.com/influxdata/influxdb_iox/issues/6281
2022-11-30 16:37:06 +01:00
Dom Dwyer d2a3d0920b
feat(ingester2): commit writes to write-ahead log
Adds WalSink, an implementation of the DmlSink trait that commits DML
operations to the write-ahead log before passing the DML op into the
decorated inner DmlSink chain.

By structuring the WAL logic as a decorator, the chain of inner DmlSink
handlers within it are only ever invoked after the WAL commit has
successfully completed, which keeps all the WAL commit code in a
singly-responsible component. This also lets us layer on the WAL commit
logic to the DML sink chain after replaying any existing WAL files,
avoiding a circular WAL mess.

The application-logic level WAL code abstracts over the underlying WAL
implementation & codec through the WalAppender trait. This decouples the
business logic from the WAL implementation both for testing purposes,
and for trying different WAL implementations in the future.
2022-11-30 16:37:05 +01:00
Andrew Lamb 039a45ddd1
chore: Update Datafusion and arrow/arrow-flight/parquet to `28.0.0` (#6279)
* chore: Update Datafusion and arrow/arrow-flight/parquet to `28.0.0`

* chore: Update thrift to 0.17

* fix: use workspace arrow-flight in ingester2

* chore: Update for API changes

* fix: test

* chore: Update hakari

* chore: Update hakari again

* chore: Update trace_exporters to latest thrift

* fix: update test

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-11-30 14:12:30 +00:00
Dom Dwyer 5f1635bfc5
feat(ingester2): sequence write ops
Sequence all gRPC write requests to (internally) order the resulting DML
operations.

These sequence numbers are assigned from a timestamp oracle and passed
through to the downstream DmlSink implementers.
2022-11-30 10:40:26 +01:00
Dom Dwyer ace4b7f669
feat: operation timestamp sequencer
Adds a TimestampOracle to provide an ingester-internal ordering to
incoming DmlOperations using a logical clock.
2022-11-30 10:40:22 +01:00
Dom Dwyer 9648207f01
feat(ingester2): initialise an ingester2 instance
Adds a public constructor to initialise an ingester2 instance.
2022-11-29 17:05:42 +01:00
Dom 0b1449e908
Merge branch 'main' into dom/buffer-tree-query 2022-11-29 15:06:41 +00:00
Andrew Lamb fc520e0c0f
refactor: Remove unecessary optimize_record_batch (#6262)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-11-29 13:35:46 +00:00
Dom Dwyer 95216055d8
perf(ingester2): stream BufferTree partition data
This commit implements the QueryExec trait for the BufferTree, allow it
to be queried for the partition data it contains. With this change, the
BufferTree now provides "read your writes" functionality.

Notably the implementation streams the contents of individual partitions
to the caller on demand (pull-based execution), deferring acquiring the
partition lock until actually necessary and minimising the duration of
time a strong reference to a specific RecordBatch is held in order to
minimise the memory overhead.

During query execution a client sees a consistent snapshot of
partitions: once a client begins streaming the query response, incoming
writes that create new partitions do not become visible. However
incoming writes to an existing partition that forms part of the snapshot
set become visible iff they are ordered before the acquisition of the
partition lock when streaming that partition data to the client.
2022-11-29 12:01:47 +01:00
Dom Dwyer de6f0468d8
refactor: associated QueryExec return type
Allow the return type of the QueryExec trait's query_exec() method to be
parametrised by the implementer.

This allows the trait to be reused across different data sources that
return differing concrete types.
2022-11-29 12:01:36 +01:00
Dom Dwyer 1a379f5f16
perf(ingester2): streaming query data sourcing
Changes the query code (taken from the ingester crate) to stream data
for query execution, tidy up unnecessary Result types and removing
unnecessary indirection/boxing.

Previously the query data sourcing would collect the set of RecordBatch
for a query response during execution, prior to sending the data to the
caller. Any data that was dropped or modified during this time meant the
underlying ref-counted data could not be released from memory until all
outstanding queries referencing it completed. When faced with multiple
concurrent queries and ongoing ingest, this meant multiple copies of
data could be held in memory at any one time.

After this commit, data is streamed to the user, minimising the duration
of time a reference to specific partition data is held, and therefore
eliminating the memory overhead of holding onto all the data necessary
for a query for as long as the client takes to read the data.

When combined with an upcoming PR to stream RecordBatch out of the
BufferTree, this should provide performant query execution with minimal
memory overhead, even for a maliciously slow reading client.
2022-11-29 11:08:06 +01:00
Dom Dwyer 2ed9780f6b
refactor(ingester2): explicit PartitionStream type
Simplify the streaming types by introducing explicitly named wrappers to
improve visibility.
2022-11-29 11:08:02 +01:00
dependabot[bot] b5aa39db4b
chore(deps): Bump tonic from 0.8.2 to 0.8.3 (#6249)
Bumps [tonic](https://github.com/hyperium/tonic) from 0.8.2 to 0.8.3.
- [Release notes](https://github.com/hyperium/tonic/releases)
- [Changelog](https://github.com/hyperium/tonic/blob/master/CHANGELOG.md)
- [Commits](https://github.com/hyperium/tonic/compare/v0.8.2...v0.8.3)

---
updated-dependencies:
- dependency-name: tonic
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-11-29 09:53:30 +00:00
Dom Dwyer 443ec49f24
feat: query exec tracing spans
Implement a QueryExec decorator that emits named tracing spans covering
the inner delegate's query_exec() execution.

Captures the result, emitting the error string in the span on failure.
2022-11-28 13:29:18 +01:00
Dom Dwyer a54326d1ae
refactor: rename Response -> QueryResponse
Both more descriptive, and less conflict-y! This seems like a more
sensible name for a system with many Response's.
2022-11-28 13:29:18 +01:00
Dom Dwyer a0ab78298f
feat(ingester2): gRPC methods & type-erased init
This commit implements the gRPC direct-write RPC interface (largely
copied from the ingester crate), and adds a much improved RPC query
handler.

Compared to the ingester crate, the query API is now split into two
defined halves - the API handler side, and types necessary to support it
(server/grpc/query.rs) and the Ingester query execution side (a stub in
query/exec.rs). These two halves maintain a separation of concerns, and
are interfaced by an abstract QueryExec trait (in query/trait.rs).

I also added the catalog RPC interface as it is currently exposed on the
ingester, though I am unsure if it is used by anything.

This commit also introduces the "init" module, and the
IngesterRpcInterface trait within it. This trait forms the public
ingester2 crate API, defining the complete set of methods external
crates can expect to utilise in a stable, unchanging and decoupled way.

The IngesterRpcInterface trait also serves as a method of type-erasure
on the underlying handler implementations, avoiding the need to
expose/pub the types, abstractions, and internal implementation details
of the ingester to external crates.
2022-11-25 12:40:01 +01:00
Dom Dwyer 522ae6c2a3
docs: document possible FSM panic
Document a possible panic if the data within a partition FSM cannot be
converted to an Arrow RecordBatch.
2022-11-25 10:46:44 +01:00
CircleCI[bot] 44eeab7e2b chore: Run cargo hakari tasks 2022-11-24 14:51:21 +00:00
Dom Dwyer a66fc0b645
feat(ingester): ingester2 init
Adds an ingester2 crate to hold the MVP of the Kafkaless project.

This was necessary due to the tight coupling of the ingester internals
with tests in external crates, and eases the parallel development of two
version of the ingester.

This commit contains various changes from the "ingester" crate, mostly
removing the concept/references to a "shard" or "ShardId" where
possible.

This commit does not copy over all of the "ingester" crate - only those
components that are definitely needed. I will drag across more as
functionality is implemented.
2022-11-24 15:34:02 +01:00