* chore: remove references to perf_image in CI
* chore: adding gitops adapter image build in CI
* chore: gitops adapter bin now same as dir & package so docker build works
* fix: circle config package change after renaming gitops adapter package
* chore: update datafusion
* fix: Update to use new datafusion api
* chore: update expected plans
* fix: support zero output partitions
* fix: update test
* fix: Update for new DataFusion API
* fix: newly added system table
* fix: update cargo lock
* feat: Add a way to run ingester with an in-memory catalog from the CLI
If you set the --catalog-dsn string to "mem", rather than using that as
a Postgres connection URL, create an in-memory catalog.
Planning on using this in tests, so not documenting.
* fix: Set default topic to the same value as SHARED_KAFKA_TOPIC
Namely, both should use an underscore. I don't think there's a way to
directly share these values between a constant and an annotation.
* feat: Add a flight API (handshake only) to ingester
* fix: Create partitions if using file-based write buffer
* fix: Change the server fixture to handle ingester server type
For now, the ingester doesn't implement the deployment API. Not sure if
it should or not.
* feat: Start implementing ingester do_get, namely decoding the query
Skip serialization of the predicate for the moment.
* refactor: Rename ingest protos to ingester to match crate name
* refactor: Rename QueryResults to QueryData
* feat: Move ingester flight client to new querier crate
* fix: Off by one error, different starting indexes in sequencers
* fix: Create new CLI argument to pick the catalog type
* fix: Create a CLI option to set the number of topics to auto-create in the write buffer
* fix: Check the arrow flight service's health to tell that the ingester gRPC is up
* fix: Set postgres as the default catalog type
* fix: Return an error rather than panicking if CLI args aren't right
This adds the lifecycle manager to the ingester. It will trigger based on a threshold for max partition size or age or based on keeping total memory under a certain threshold.
It defines a new interface for a persister, which is stubbed out for IngesterData. I'm not sure yet how persistence errors should be handled. The assumption here is that the persister continues to retry persistence forever until it succeeds.
There is one scenario I can think of that may cause this lifecycle manager problems. If a single partition is very high throughput, it could cause things to back up as persistence is not parallelized within a single partition. Any given partition can currently only run one persistence operation at a time. We can address this later.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* refactor: catalog Unit of Work (= transaction)
Setup an inteface to handle Units of Work within our catalog. Previously
both the Postgres and the in-mem backend used "mini-transactions on
demand". Now the caller has a clear way to establish boundaries and
gets read and write isolation. A single `Arc<dyn Catalog>` can create as
many `Box<dyn UnitOfWork>` as you like, but note that depending on the
backend you may not scale infinitely (postgres will likely impose
certain limits and the in-mem backend limits concurrency to 1 to keep
things simple).
* docs: improve wording
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
* refactor: rename Unit of Work to Transaction
* test: improve `test_txn_isolation`
* feat: clearify transaction drop semantics
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* fix(REPL): Don't buffer lines until a trailing semicolon is found
The repl would silently buffer all lines until a trailing semicolon were found which
resulted in some very confusing error messages as I would input invalid commands followed
by a command I thought were valid, except I'd still get an error due to the previous command being buffered.
This uses rustyline's helper feature to detect incomplete input (no trailing semicolon) and makes
it accept multiline input until the input is completed.
I also included some of rustyline's default hint and highlighting while I was at it.
* chore: cargo clippy
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
All features are now covered by rskafka. This also removes the need to
specify a server ID for write buffer consumers. This was only used for
rdkafka since there we needed to specify a consumer group, even though
we did not use any transactions.
* chore: upgrade rskafka + enable snappy support
* test: ensure that rskafka and rdkakfa work together
Before removing rdkafka ensure that:
- rskafka can consume existing messages produced by rdkafka so we do not
need to drain existing topics
- rdkafka can consume new messages produced by rskafka so we can roll
back
I ran the whole `write_buffer` test suite (including the newly added
tests) using Apache Kafka as well as Redpanda.
* test: ensure we handle consumer offset in error case correctly
* docs: explain test setup
Co-authored-by: Andrew Lamb <alamb@influxdata.com>
Co-authored-by: Andrew Lamb <alamb@influxdata.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
This adds the scaffolding for the ingester server to consume data from Kafka. This ingests data in an in memory structure while creating records in the catalog for any partitions that don't yet exist.
I've removed catalog_update.rs in ingester for now. That was mostly a placeholder and will be going in a combination of handler.rs and data.rs on my next PR which will have some primitive lifecycle wired up.
There's one ugly bit here where the DML write is cloned because it's getting borrowed to output spans and metrics. I'll need to follow up with a refactor to make it so that the DML write's tables can be consumed without it gumming up the metrics stuff.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
Adds an in-memory cache of table schemas to the SchemaValidator DML
handler.
The cache pulls from the global catalog when observing a column for the
first time, and pushes the column type to set it for subsequent requests
if it does not exist (this pull & push is done by atomically by the
catalog in an "upsert" call).
The in-memory cache is sharded by namespace, with each shard guarded by
an individual lock to minimise contention between readers (the expected
average case) and writers (only when adding new columns/tables).
Relies on the catalog to serialise new column creation and validate
parallel creation requests.
Implements a write schema validation DML handler, denying requests that
conflict with the schema within the global catalog. Additive schema
changes are accepted, incrementally updating the global catalog schema.
Deletes are passed through unchanged and unvalidated.
* refactor: Debug bounds on Catalog trait
* feat: validate MutableBatch schema
Changes the schema validation code to validate MutableBatch instances
(coming from a pre-parsed LP write, and non-LP-based writes) instead of
parsed LP lines.
* refactor: Send bound on boxed errors
* refactor: clippy
Allow assert_eq!(bool, bool) for readability.
* refactor: no PartialEq<MB Column> for ColumnSchema
Remove the PartialEq<mutable_buffer::Column> for ColumnSchema - it's
definitely more readable as a method call.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* build: don't pull in all of tokio
We already specify the tokio features we need so "full" (all features)
is not necessary.
* build: remove chrono dependency
Appears unused.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>