* refactor: use new ingester<>querier wire protocol
Use and document the new and more flexible ingester<>querier wire
protocol.
Note that the ingester does NOT stream the response data yet, but the
internal data structures would allow that. A follow-up change will
adjust the ingester code to stream the data.
Ref #4849.
* fix: typos
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
* refactor: clarify naming and public interface
* test: add schema assertion to `ingester_response_to_record_batches`
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
* refactor: prepare new ingester<>querier protocol on the querier side
This changes the querier internals to work with the new protocol. The
wire protocol stays the same (for now). There's a (somewhat hackish)
adapter in place on the querier side that converts the old to the new
protocol on-the-fly. This is an intermediate step before we actually
change the wire protocol (and in a step after that also take advantage
of the new possibilites on the ingester side).
Ref #4849.
* docs: explain adapter
* chore: TEMP Update DataFusion to pre-release
* chore: update arrow et al to 16.0.0
* chore: Run cargo hakari tasks
* fix: update reader read_dictionary API
* chore: Update to real Datafusion release
* fix: Update parquet API
* fix: update test
Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
This commit changes the code base to use a new reference-counted
PartitionKey type wrapper, instead of passing a bare String around.
This allows the compiler to type check & verify usage of the partition
key, instead of passing a bare string around. By reference counting the
underlying string, we reduce memory usage for some use cases.
* feat: Change data type of catalog Postgres partition's sort_key from a string to an array of string
* test: add column with comma
* fix: use new protonuf field to avoid incompactible
* fix: ensure sort_key is an empty array rather than NULL
* refactor: address review comments
* refactor: address more comments
* chore: clearer comments
* chore: Update iox_catalog/migrations/20220607102200_change_sort_key_type_to_array.sql
* chore: Update iox_catalog/migrations/20220607102200_change_sort_key_type_to_array.sql
* fix: Rename migration so it will be applied after
Co-authored-by: Marko Mikulicic <mkm@influxdata.com>
* fix: do not return readable until a write is completely readable
* docs: Add diagram with partially buffered write
* refactor: account for actively buffering during update rather than fixup
* fix: fixup
* fix: use checked_sub
Co-authored-by: Marco Neumann <marco@crepererum.net>
* fix: checked_sub calculation
Co-authored-by: Marco Neumann <marco@crepererum.net>
Reduces memory usage in the ingester during persist operations by
streaming the results of the snapshot merge/sort/dedupe directly to
the parquet file.
Prior to this commit the output of the compact was buffered in memory
before being wrote to the parquet file.
* test: "optimize" ingesterrecord batches in query tests
It seems that I had the right idea in #4656 but wasn't able to trigger
https://github.com/influxdata/conductor/issues/955 because the query
tests do not "optimize" the record batches in the same way the actual
gRPC implementation does. If we apply the same transformation we indeed
end up with the same error.
* fix: all batches within the ingester flight response must have same schema
* refactor: simplify and reuse code
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* feat: enable debugging of failed querier->ingester requests
- extend `query-ingester` CLI to allow usage of predicates
- on failed requests: log all information that required for the CLI
- test the "ingester fails" scenario
* test: explain
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
* docs: improve
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
* refactor: move b64 pred. serde into a single crate
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Removes the min/max timestamp fields from the IoxMetadata proto
structure embedded within a Parquet file's metadata.
These values are redundant as they already exist within the Parquet
column statistics, and precluded streaming serialisation as these
removed min/max values were needed before serialising the file.
Remove the redundant row_count from the IoxMetadata structure that is
serialised into the Parquet file.
The reasoning is twofold:
* The Parquet file's native metadata already contains a row count
* Needing to know the number of rows up-front precludes streaming
Ok, so... this needed lots of... channels. Channels everywhere.
The stream method on TestWriteBufferStreamHandler previously assumed it
would only be called once. In a test where reset_to_earliest is called,
stream might be called again to get the reset stream.
We want to be able to control which of the streams gets which
operations, so that's why the macro now takes a vec of vec of
operations-- one vec of operations per expected call to stream, and the
stream will send all the operations in its vec.
The test thread needs to wait for the handler stream to consume the last
item from the last receiver stream, so when the
TestWriteBufferStreamHandler has set up the last expected call to
stream, pass back the last transmitter and have it wait until it's at
full expected capacity (which means all operations have been consumed by
the receiver).
The default behavior of the ingester is to panic if the min unpersisted
sequence number in the catalog is unknown to the write buffer due to the
retention policies having evicted that sequence number.
Specifying `--skip-to-oldest-available` changes this behavior to skip to
the oldest sequence number the write buffer does have available and go
from there.
Fixes#4624.
Implements an upload() method on the ParquetStorage type, consuming a
stream of RecordBatch, serialising the Parquet file, and uploading the
result to object storage. Returns the IOx-specific file metadata.
Currently while the upload() method accepts a stream of RecordBatch, the
actual resulting Parquet file is buffered in memory before uploading to
object store, due to lack of streaming upload functionality in the
ObjectStore abstraction - this isn't the end of the world, as the files
tend to be relatively small with our current usage.
This impl should be easily modified to be fully streaming once streaming
object store puts are implemented:
https://github.com/influxdata/object_store_rs/issues/9
Changes the code paths that interact with Parquet files in the object
store to reference the ParquetStorage directly (DRY refactor).
This change takes us from a dependency graph of:
┌─────────────────┐
│ │
▼ │
Parquet Consumer │
│ ┌──────────────┐
├────────▶│ParquetStorage│
▼ └──────────────┘
┌──────────────┐
│ ObjectStore │
└──────────────┘
│
┌────┴────┐
▼ ▼
File s3
System (etc)
to:
Parquet Consumer
│
▼
┌──────────────┐
│ParquetStorage│
└──────────────┘
│
▼
┌──────────────┐
│ ObjectStore │
└──────────────┘
│
┌────┴────┐
▼ ▼
File s3
System (etc)
With the ParquetStorage being solely responsible for managing
interactions with the object store when dealing with Parquet files.
Renames the Storage type so the context is clear in usage (i.e. fn
args), rather than having to rely on knowing the fully-qualified import
path to know what the type stores.
* ci: fix cargo deny
* chore: downgrade `socket2`, version 0.4.5 was yanked
* chore: rename `query` to `iox_query`
`query` is already taken on crates.io and yanked and I am getting tired
of working around that.
Emit a TRACE level log containing the op offset & other helpful fields.
This will allow us to identify which messages were last successfully
decoded, and which caused errors so we can pull them from analysis.
Adds a histogram metric "flight_query_duration_ms" that records the
duration a flight RPC query takes to complete. Broken down by query
result (success/error).
These were found by iterating over all of the dependencies of each
Cargo.toml, then grepping that crate for the dependency's name. If it
didn't show up, I attempted to remove it.
I left a few dependencies that this process flagged:
* generated_types
- `pbjson`,`serde`. Apparently used by the generated code.
* grpc-router-test-gen
- `prost`. Apparently used by the generated code.
* influxdb_iox
- `heappy`. Doesn't appear used, but is behind enough feature
flags that I don't care to reason about and it's already optional.
- `tikv_jemalloc_sys`. Appears to be setting a feature flag of an
indirect dependency.
* iox_gitops_adapter
- `k8s_openapi`. Appears to be setting a feature flag of an indirect
dependency.
* chore: Tool for automating arrow version update
* chore: Update datafusion and arrow/parquet/arrow-flight
* fix: update for changes in Arrow API
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* feat: use stored sort key to deduplicate data
* refactor: verify if one is a super sort key of the other
* test: unit tests for scan and deduplication plans
* fix: typo
* refactor: refactor and add comments
* feat: cache partition sort key to read during planning as needed
* test: tests for query plans with different overlap groups
* chore: cleanup
* chore: resolve merge conflicts
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* refactor: improve `IngesterData` public interface
* feat: impl `Debug` for `Test{Namespace,Sequencer}`
* refactor: trait interface for `LifecyleHandle`
This is required to mock the lifecycle for query tests.
* refactor: trait for partitioner
* feat: add per kafka partition durability reporting to write info response
* fix: buf lint + test cleanup
* fix: clean up protobuf
* refactor: pull out conversion of KafkaPartitionStatus into a function
* fix: fmt
* fix: typo
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
Attaching the "batch => partition" mapping via per-batch schema KV
metadata does NOT work because flight will transmit the schema once for
all batches (even though on the Rust side we have a schema ref attached
to every batch, probably for convenience). Instead we now use the same
global protobuf metadata that we also use for the "partition => max
sequence number" information. This somewhat limits our ability to create
record batches lazily on the ingester side (since the global metadata is
sent before any actual payload) but I think we should not modify the
usage of the flight protocol too much right now (e.g. by sending more
schema messages). If this becomes an issue, we can always find a more
complex solution in the future.
* fix: return "not found" gRPC error instead of "internal" when ingester does not know table
* fix: properly handle "namespace not found" in ingester queries
* fix: make `initialize_db` work with async code
* test: add custom step for NG tests
* fix: handle "unknown table/namespace" resp. in querier
* docs: explain test setup
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
* refactor: querier<>ingester flight protocol adjustments
This makes a few adjustments to the querier<>ingester flight protocol.
Query Scope
===========
The querier will request data for ALL sequencer IDs for now. There is
no reason to have a request per sequencer ID. We can add a range/set
filter later if we want, but this is not required for now.
Partition-level
===============
The only time when the querier cares about sequencer IDs (i.e. sharding)
at all is when it selects which ingesters to ask for unpersisted data
(this is currently not implemented, it just asks all ingesters).
Afterwards the querier only cares about partitions (which are bound to
specific sequencers anyways) because this is the level where parquet
file persistence and compaction as well as deduplication happen. So we
make partitions a first-class citizen in the ingester response.
Metadata VS RecordBatches
=========================
The global app-metadata will list all partitions and their max
persisted parquet files and tombstones (theoretically tombstones are at
table-level, but the ingester could in the future break them down to the
partition-level). Then it receives a stream of record batches. Each
record batch is tagged (via key-value metadata in its schema) so it can
be assigned to a partition. At the moment the ingester returns 0 or 1
batches per unpersisted partition (0 in case we've filtered out all the
data via the predicate), but in the future it is free to return multiple
batches. This setup gives the ingester more freedom over memory
management and (potentially parallel) query processing, while at the
same time keeps the set of duplicated information minimal and allows
easy extensions (since the global metadata is a full-blown protobuf
message).
Querier
=======
At the moment the querier ignores all the metdata. Follow-up PRs will
change that.
* docs: improve
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
* refactor: make code clearer
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Removes the old stream_in_sequenced_entries() write buffer handler,
replacing it with the SequencedStreamHandler introduced in #4203.
This change will affect the metrics emitted by an ingester as outlined
in #4243.
Removes the Sync bound SequencedStreamHandler input stream type, as the
BoxStream returned by the WriteBufferStreamHandler is not Sync.
This change means the SequencedStreamHandler is not Sync either, but is
still Send and therefore can be moved into tokio tasks.
This commit adds an adaptor (IngestSinkAdaptor) that provides a DmlSink
implementation for the existing write path (IngesterData). With this,
the existing write path becomes compatible with the new
op stream handler (SequencedStreamHandler).
This commit adds the SinkInstrumentation type that decorates an inner
DmlSink with call latency and write buffer metrics.
The write buffer / sink call metrics may be split apart into two
separate responsibilities in the future if there are multiple DmlSink
that need instrumentation, but deferring adding more types until it is
needed.
* feat: Add `SequencerProgress` reporting to ingester
* refactor: Use KafkaPartition in write_summary
* fix: Update docstrings
* refactor: Change ingester to use KafkaPartition everywhere
* refactor: add SequencerProgress::combine
* refactor: return new SequencerProgress rather than updating
* fix: distinguish between yes/no/unknown in WriteSummary
* docs: Update data_types2/src/lib.rs
Co-authored-by: Paul Dix <paul@pauldix.net>
Co-authored-by: Paul Dix <paul@pauldix.net>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
Pass the sort key from the catalog through to compact_persisting_batch.
If the sort key is Some, use that. If the sort key is None, compute it
from the data's cardinality with compute_sort_key.
Connects to #4196.
Adds the PeriodicWatermarkFetcher type responsible for querying write
buffer / Kafka for the maximum sequence number / offset, surfacing any
errors via both logs & metrics.
This high watermark / max offset value is used within the ingest
instrumentation metrics. This use case is tolerant of caching / stale
values, and as such the value is periodically updated to minimise load
on the write buffer.
Instruments the SequencedStreamHandler with a series of new metrics that
record the various error classes observable in the stream handler.
These metrics are labelled with potential_data_loss=true where relevant
to surface potential data loss events for alerting & further review.
Refactors the stream_in_sequenced_entries() into a new impl in the
SequencedStreamHandler type, decoupling the reading / decoding of ops
from Kafka (and associated error handling) from the "what happens to
those ops" concern to ease testing, encapsulate the specifics of "how to
get an op" and improve flexibility.
This is intended to provide robust error handling within what is
reasonably possible (unexpected errors are always unexpected!) while
retaining the existing metrics and functionality. I've also separated
out code that exists in the current impl specifically to drive tests
from the prod code path, instead driving those behaviours through mocks.
As of this commit, the handler is not used - this commit simply adds the
new impl.
Fix the ingester to track the max persisted sequence number per partition.
Ensure replay takes in data from unpersisted partitions.
Simplify the table persist info to not return a max persisted sequence number for the table as that information isn't needed.
Min/max values and distinct counts are already optional, so let's make
the null counts optional as well. This will be helpful for NG to deal w/
partial statistics (e.g. we only populate stats for the time column).
Note that the total count is still mandatory, but we normally have the
chunk/file-level row count at hand.
Set to_delete to the time the file was marked as deleted rather than
true.
Fixes#4059.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
The sort key is optional and currently only produced by `iox_tests`.
Writing it within the ingester/compactor is tracked by #3968. The sort
key is read by the querier (and this will be verified by the query tests
and is required to merge #4103).
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
Removed some unnecessary tests as they no longer apply with the new buffer structure. This will hopefully reduce the memory footprint of the ingesters significantly.
Closes#4072
This makes it way easier to dyn-type database implementations. The only
real change is that we make `QueryChunk::Error` opaque. Nobody is going
to inspect that anyways, it's just printed to the user.
This is a follow-up of #4053.
Ref #3934.
Emit a counter metric "ingest_paused_duration_ms_total" that records the
duration of time an ingester stream is paused with millisecond
granularity.
This metric will allow us to measure the frequency and severity of, and
alert on, an ingester stopping ingest due to memory limits enforced by
the LifecycleManager. This will help us tune these config params.
Changes all consumers of the object store to use the dynamically
dispatched DynObjectStore type, instead of using a hardcoded concrete
implementation type.
* feat: initial implementation of compact a given list of overlapped parquet files
* feat: Add QueryableParquetChunk and some refactoring
* feat: build queryable parquet chunks for parquet files with tombstones
* feat: second half the implementation for Compactor's compact. Tests will be next
* fix: comments for trait funnctions fof QueryChunkMeta
* test: add tests for compactor's compact function
* fix: typos
* refactor: address Jake's review comments
* refactor: address Andrew's comments and add one more test for files in different order in the vector
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
When created in the catalog, parquet files should always have compaction
level 0. Updating the compaction level should always happen in the
compactor.
Only the catalog should need to know about the initial compaction level
value.
This has the advantages of:
- Not needing to create fake parquet file IDs or fake deleted_at
values that aren't used by create before insertion
- Not needing too many arguments for create
- Naming the arguments so it's easier to see what value is what
argument, especially in tests
- Easier to reuse arguments or parts of arguments by using copies of
params, which makes it easier to see differences, especially in tests
This commit splits the API of the LifecycleManager into two:
* LifecycleManager: singleton responsible for evaluating partitions
and running persist tasks.
* LifecycleHandle: a handle for each sequencer ingester(s) to update
the global LifecycleManager state when applying ops.
This keeps the accessible API & responsibilities of each caller distinct
and allows us to leverage the type system to enforce linearisation of
calls to LifecycleManager::maybe_persist() without resorting to an
(unnecessary) mutex guard for serialisation.
This includes a bit of a refactor in the locking structure of the buffer data. Locking at the partition collection and within the partition data was making things more complex than they needed to be. The partitions in the buffer are there only temporarily until they get persisted. Locking on the table simplifies things a bit and makes it more clear when the table state is being modified since it no longer has any interior mutability. Having access to separate partitions without the same lock isn't something we need because queries will hit all partitions and data is brought in sequentially, regardless of which partition it is hitting in a sequencer.
Fixes#3850
Uses the new ColumnRepo::create_or_get_many() catalog method to perform
a bulk upsert of (potentially) new columns to the catalog during schema
validation.
* refactor: wire exectution context to Deduplicator
* feat: example trace to chunk read_filter
* refactor: make execution context required
* refactor: expose metadata API
* refactor: more span context for chunk read_filter
* refactor: fix build
* refactor: push context into result stream
* refactor: make executor optional
* feat: detach dedicated exec jobs
* feat: async `DedicatedExecutor::join`
Now `DedicatedExecutor` follows the system we use for other server
components:
- `shutdown`: a quick sync call that signals the shutdown but doesn't
drop
- `join`: async awaits until the executor has finished shutdown
- `drop`: warn but still try to shut down
* test: irmpove `detach_receiver` test
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
* feat: changes needed to apply tombstones correctly on the life-cycle ingest bacthes
* refactor: adjust the design after discussing with Paul
* feat: apply the coming tombstone on all data but persiting one
* chore: fmt
* fix: build on buffer tombstone
* test: delete & write tests for a parition and some cleanup
* feat: No need add processed tombstones for newly created parquet file in the ingester becasue all deletes before that parquet file is created were applied
* chore: cleanup
* feat: intitial implementation for preparing data to send back to the Querier
* feat: full implementation of prepare_data_to_querier
* fix: apply filters for the batches
* chore: Apply suggestions from code review
Co-authored-by: Carol (Nichols || Goulding) <193874+carols10cents@users.noreply.github.com>
* chore: cleanup
* fix: typos in comments
* fix: typos in comments
* fix: typos in comments
* test: create different scenarios and test them
* chore: fix typos
* test: add tests with deletes
* chore: make pub pub(crate)
* chore: Apply suggestions from code review
Co-authored-by: Jake Goulding <jake.goulding@integer32.com>
* refactor: address review comments
* fix: keep batches in their arrival order
* refactor: not assign unecessary values to enum
* refactor: use bitflags enum
* fix: use bitflags correctly
* chore: Apply suggestions from code review
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
* refactor: avoid using use at the end of the function
* chore: merge main to branch
* fix: fix downgrade versions
* refactor: address review comments
* chore: remove unnecessary comments
* refactor: Make the whole test_utils module test-only and bring paths into module scope
Co-authored-by: Paul Dix <paul@pauldix.net>
Co-authored-by: Carol (Nichols || Goulding) <193874+carols10cents@users.noreply.github.com>
Co-authored-by: Jake Goulding <jake.goulding@integer32.com>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com>
I'm seeing some panics in our test bench, but it the ingester happily
continues and thinks it persisted tasks even though it didn't. Let's at
least bail out if a persist task fails.
A quick change to perform the ColumnRepo::create_or_get() calls in
parallel (up to a maximum of 3 in-flight at any one time) in order to
mitigate the latency of the call and reduce the overall schema
validation call duration.
The in-flight limit is enforced to avoid starving the DB connection pool
of connections.
It's a bit of a duck-type hack, but if we wanna just `ParquetFileChunk`
in the new architecture, we somehow need it to accept new-gen paths.
Also path handling should be somewhat centralized since
ingester/compactor/querier all need to construct them. So having a
`ParquetFilePath` that supports both path styles seems to be a
not-to-bad solution. This should obviously be cleaned up in some
not-to-distant future.
Fixes#3702. This pulls the min sequence tracking into the LifecycleManager. Because the number requires looking at all other partitions in memory, this was the most efficient place to put it. The manager updates the sequencer state after it calls persist. The number is meant to be a lower bound on the sequence number. Issue #3783 will add functionality for the ingester to ignore replayed data that has already been persisted.
* feat: changes needed to apply tombstones correctly on the life-cycle ingest bacthes
* refactor: adjust the design after discussing with Paul
* feat: apply the coming tombstone on all data but persiting one
* chore: fmt
* fix: build on buffer tombstone
* test: delete & write tests for a parition and some cleanup
* feat: No need add processed tombstones for newly created parquet file in the ingester becasue all deletes before that parquet file is created were applied
* chore: cleanup
Co-authored-by: Paul Dix <paul@pauldix.net>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* fix: Adjust fields of IngesterQueryResponse
* feat: Adjust IngestHandler query method to call prepare_data_to_querier
* feat: Send ingest query result data back through Flight doGet
* feat: Send delete predicates and max sequencer number in metadata
* fix: greater_than_sequence_number should be of type SequenceNumber
* fix: Remove DeletePredicates from IngesterQueryResponse
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
This adds persistence into the ingester with a lifecycle manager. The persist operation must still be updated to keep track of the min_unpersisted_sequence_number for each sequencer.
* feat: initial implementaion the Query Plan that query QueryableBatch with filters
* fix: read_filter of QueryableBatch should provide the shema of the columns/projection it needs
* chore: Apply suggestions from code review
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
* chore: address review comment
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
* feat: allow catalog access w/o a transaction
Now the caller has the full control if they want to use a transaction or
not.
* fix: remove non-transaction-safe `create_many`
* fix: remove unnecessary transactions
* feat: projection pushdown for QueryableBatch
* chore: clean up and remove unwrap
* fix: Add Sync to a Snafu source to have the code compile
* chore: cleanup and add comments for tests
* refactor: Add tests for scanning non existing columns and fix related bugs
* chore: modify comment to trigger auto check in github work
* feat: Add a way to run ingester with an in-memory catalog from the CLI
If you set the --catalog-dsn string to "mem", rather than using that as
a Postgres connection URL, create an in-memory catalog.
Planning on using this in tests, so not documenting.
* fix: Set default topic to the same value as SHARED_KAFKA_TOPIC
Namely, both should use an underscore. I don't think there's a way to
directly share these values between a constant and an annotation.
* feat: Add a flight API (handshake only) to ingester
* fix: Create partitions if using file-based write buffer
* fix: Change the server fixture to handle ingester server type
For now, the ingester doesn't implement the deployment API. Not sure if
it should or not.
* feat: Start implementing ingester do_get, namely decoding the query
Skip serialization of the predicate for the moment.
* refactor: Rename ingest protos to ingester to match crate name
* refactor: Rename QueryResults to QueryData
* feat: Move ingester flight client to new querier crate
* fix: Off by one error, different starting indexes in sequencers
* fix: Create new CLI argument to pick the catalog type
* fix: Create a CLI option to set the number of topics to auto-create in the write buffer
* fix: Check the arrow flight service's health to tell that the ingester gRPC is up
* fix: Set postgres as the default catalog type
* fix: Return an error rather than panicking if CLI args aren't right
This adds the lifecycle manager to the ingester. It will trigger based on a threshold for max partition size or age or based on keeping total memory under a certain threshold.
It defines a new interface for a persister, which is stubbed out for IngesterData. I'm not sure yet how persistence errors should be handled. The assumption here is that the persister continues to retry persistence forever until it succeeds.
There is one scenario I can think of that may cause this lifecycle manager problems. If a single partition is very high throughput, it could cause things to back up as persistence is not parallelized within a single partition. Any given partition can currently only run one persistence operation at a time. We can address this later.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* refactor: catalog Unit of Work (= transaction)
Setup an inteface to handle Units of Work within our catalog. Previously
both the Postgres and the in-mem backend used "mini-transactions on
demand". Now the caller has a clear way to establish boundaries and
gets read and write isolation. A single `Arc<dyn Catalog>` can create as
many `Box<dyn UnitOfWork>` as you like, but note that depending on the
backend you may not scale infinitely (postgres will likely impose
certain limits and the in-mem backend limits concurrency to 1 to keep
things simple).
* docs: improve wording
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
* refactor: rename Unit of Work to Transaction
* test: improve `test_txn_isolation`
* feat: clearify transaction drop semantics
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
With this change write buffer ingestion metrics are showing up under
`/metrics`
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* refactor: improve writer buffer consumer interface
The change looks huge but is actually rather simple. To
understand the interface change, let me first explain what we want:
- be able to fetch watermarks for any sequencer
- have streams:
- each streams tracks a sequencer and has an offset state (no read
multiplexing)
- we can seek a stream
- seeking and streaming cannot be done at the same time (that would be
weird and likely leads to many bugs both in write buffer and in the
user code)
- ideally we don't need to create streams of all sequencers but can
choose a subset
Before this change we had one mutable consumer struct where you can get
all streams and watermark functions (this mutable-borrows the consumer)
or you can seek a single stream (this also mutable-borrows the
consumer). This is a bit weird for multiple reasons:
- you cannot seek a single stream without dropping all of them
- the mutable-borrow construct makes it really difficult to pass the
streams into separate threads
- the consumer is boxed (because its mutable) which makes it more
difficult to handle in a large-scale application
What this change does is the following:
- you have an immutable consumer (similar to the producer)
- the consumer offers the following methods:
- get the set of sequencer IDs
- get watermark for any sequencer
- get a stream handler (see next point) for any sequencer
- the stream handler captures the stream state (offset) and provides you
a standard `Stream<_>` interface as well as a seek function.
Mutable-borrows ensure that you cannot use both at the same time.
The stream handler provides you the stream via `handler.stream()`. It
doesn't implement `Stream<_>` itself because the way boxing, dynamic
dispatch work, and pinning interact (i.e. I couldn't get it to work
without the indirection).
As a bonus point (which we don't use however) you can now create
multiple streams for the same sequencer and they all have their own
offset.
* fix: review comments
Co-authored-by: Carol (Nichols || Goulding) <193874+carols10cents@users.noreply.github.com>
Co-authored-by: Carol (Nichols || Goulding) <193874+carols10cents@users.noreply.github.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
This adds the scaffolding for the ingester server to consume data from Kafka. This ingests data in an in memory structure while creating records in the catalog for any partitions that don't yet exist.
I've removed catalog_update.rs in ingester for now. That was mostly a placeholder and will be going in a combination of handler.rs and data.rs on my next PR which will have some primitive lifecycle wired up.
There's one ugly bit here where the DML write is cloned because it's getting borrowed to output spans and metrics. I'll need to follow up with a refactor to make it so that the DML write's tables can be consumed without it gumming up the metrics stuff.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
If there aren't any record batches, there isn't any metadata, and vice
versa. Make this relationship clearer by putting the Option around both
the vec of recordbatches and the metadata.
* feat: Implement a snapshot method on DataBuffer
Fixes#3510.
* test: Add a test snapshotting batches with different but compatible schemas
* fix: Simplify min/max sequencer number collection
The first batch should always have the min sequencer number. The last
batch should always have the max sequencer number. The min should always
be less than (or equal to, in case there's only one batch) the max.
* refactor: have the deduplicate work without chunk statistics
* test: more tests for duplicates data on different combinations of record batches
* refactor: address review comments
This updates the catalog API to make it easier to work with for consumers. I also found a bug in the MemCatalog implementation while refactoring the tests to work with the new API definition. Consumers will now be able to Arc wrap the catalog and use it across awaits.