Min/max values and distinct counts are already optional, so let's make
the null counts optional as well. This will be helpful for NG to deal w/
partial statistics (e.g. we only populate stats for the time column).
Note that the total count is still mandatory, but we normally have the
chunk/file-level row count at hand.
Set to_delete to the time the file was marked as deleted rather than
true.
Fixes#4059.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
The sort key is optional and currently only produced by `iox_tests`.
Writing it within the ingester/compactor is tracked by #3968. The sort
key is read by the querier (and this will be verified by the query tests
and is required to merge #4103).
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
Removed some unnecessary tests as they no longer apply with the new buffer structure. This will hopefully reduce the memory footprint of the ingesters significantly.
Closes#4072
This makes it way easier to dyn-type database implementations. The only
real change is that we make `QueryChunk::Error` opaque. Nobody is going
to inspect that anyways, it's just printed to the user.
This is a follow-up of #4053.
Ref #3934.
Emit a counter metric "ingest_paused_duration_ms_total" that records the
duration of time an ingester stream is paused with millisecond
granularity.
This metric will allow us to measure the frequency and severity of, and
alert on, an ingester stopping ingest due to memory limits enforced by
the LifecycleManager. This will help us tune these config params.
Changes all consumers of the object store to use the dynamically
dispatched DynObjectStore type, instead of using a hardcoded concrete
implementation type.
* feat: initial implementation of compact a given list of overlapped parquet files
* feat: Add QueryableParquetChunk and some refactoring
* feat: build queryable parquet chunks for parquet files with tombstones
* feat: second half the implementation for Compactor's compact. Tests will be next
* fix: comments for trait funnctions fof QueryChunkMeta
* test: add tests for compactor's compact function
* fix: typos
* refactor: address Jake's review comments
* refactor: address Andrew's comments and add one more test for files in different order in the vector
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
When created in the catalog, parquet files should always have compaction
level 0. Updating the compaction level should always happen in the
compactor.
Only the catalog should need to know about the initial compaction level
value.
This has the advantages of:
- Not needing to create fake parquet file IDs or fake deleted_at
values that aren't used by create before insertion
- Not needing too many arguments for create
- Naming the arguments so it's easier to see what value is what
argument, especially in tests
- Easier to reuse arguments or parts of arguments by using copies of
params, which makes it easier to see differences, especially in tests
This commit splits the API of the LifecycleManager into two:
* LifecycleManager: singleton responsible for evaluating partitions
and running persist tasks.
* LifecycleHandle: a handle for each sequencer ingester(s) to update
the global LifecycleManager state when applying ops.
This keeps the accessible API & responsibilities of each caller distinct
and allows us to leverage the type system to enforce linearisation of
calls to LifecycleManager::maybe_persist() without resorting to an
(unnecessary) mutex guard for serialisation.
This includes a bit of a refactor in the locking structure of the buffer data. Locking at the partition collection and within the partition data was making things more complex than they needed to be. The partitions in the buffer are there only temporarily until they get persisted. Locking on the table simplifies things a bit and makes it more clear when the table state is being modified since it no longer has any interior mutability. Having access to separate partitions without the same lock isn't something we need because queries will hit all partitions and data is brought in sequentially, regardless of which partition it is hitting in a sequencer.
Fixes#3850
Uses the new ColumnRepo::create_or_get_many() catalog method to perform
a bulk upsert of (potentially) new columns to the catalog during schema
validation.
* refactor: wire exectution context to Deduplicator
* feat: example trace to chunk read_filter
* refactor: make execution context required
* refactor: expose metadata API
* refactor: more span context for chunk read_filter
* refactor: fix build
* refactor: push context into result stream
* refactor: make executor optional
* feat: detach dedicated exec jobs
* feat: async `DedicatedExecutor::join`
Now `DedicatedExecutor` follows the system we use for other server
components:
- `shutdown`: a quick sync call that signals the shutdown but doesn't
drop
- `join`: async awaits until the executor has finished shutdown
- `drop`: warn but still try to shut down
* test: irmpove `detach_receiver` test
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
* feat: changes needed to apply tombstones correctly on the life-cycle ingest bacthes
* refactor: adjust the design after discussing with Paul
* feat: apply the coming tombstone on all data but persiting one
* chore: fmt
* fix: build on buffer tombstone
* test: delete & write tests for a parition and some cleanup
* feat: No need add processed tombstones for newly created parquet file in the ingester becasue all deletes before that parquet file is created were applied
* chore: cleanup
* feat: intitial implementation for preparing data to send back to the Querier
* feat: full implementation of prepare_data_to_querier
* fix: apply filters for the batches
* chore: Apply suggestions from code review
Co-authored-by: Carol (Nichols || Goulding) <193874+carols10cents@users.noreply.github.com>
* chore: cleanup
* fix: typos in comments
* fix: typos in comments
* fix: typos in comments
* test: create different scenarios and test them
* chore: fix typos
* test: add tests with deletes
* chore: make pub pub(crate)
* chore: Apply suggestions from code review
Co-authored-by: Jake Goulding <jake.goulding@integer32.com>
* refactor: address review comments
* fix: keep batches in their arrival order
* refactor: not assign unecessary values to enum
* refactor: use bitflags enum
* fix: use bitflags correctly
* chore: Apply suggestions from code review
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
* refactor: avoid using use at the end of the function
* chore: merge main to branch
* fix: fix downgrade versions
* refactor: address review comments
* chore: remove unnecessary comments
* refactor: Make the whole test_utils module test-only and bring paths into module scope
Co-authored-by: Paul Dix <paul@pauldix.net>
Co-authored-by: Carol (Nichols || Goulding) <193874+carols10cents@users.noreply.github.com>
Co-authored-by: Jake Goulding <jake.goulding@integer32.com>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com>
I'm seeing some panics in our test bench, but it the ingester happily
continues and thinks it persisted tasks even though it didn't. Let's at
least bail out if a persist task fails.
A quick change to perform the ColumnRepo::create_or_get() calls in
parallel (up to a maximum of 3 in-flight at any one time) in order to
mitigate the latency of the call and reduce the overall schema
validation call duration.
The in-flight limit is enforced to avoid starving the DB connection pool
of connections.
It's a bit of a duck-type hack, but if we wanna just `ParquetFileChunk`
in the new architecture, we somehow need it to accept new-gen paths.
Also path handling should be somewhat centralized since
ingester/compactor/querier all need to construct them. So having a
`ParquetFilePath` that supports both path styles seems to be a
not-to-bad solution. This should obviously be cleaned up in some
not-to-distant future.
Fixes#3702. This pulls the min sequence tracking into the LifecycleManager. Because the number requires looking at all other partitions in memory, this was the most efficient place to put it. The manager updates the sequencer state after it calls persist. The number is meant to be a lower bound on the sequence number. Issue #3783 will add functionality for the ingester to ignore replayed data that has already been persisted.
* feat: changes needed to apply tombstones correctly on the life-cycle ingest bacthes
* refactor: adjust the design after discussing with Paul
* feat: apply the coming tombstone on all data but persiting one
* chore: fmt
* fix: build on buffer tombstone
* test: delete & write tests for a parition and some cleanup
* feat: No need add processed tombstones for newly created parquet file in the ingester becasue all deletes before that parquet file is created were applied
* chore: cleanup
Co-authored-by: Paul Dix <paul@pauldix.net>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* fix: Adjust fields of IngesterQueryResponse
* feat: Adjust IngestHandler query method to call prepare_data_to_querier
* feat: Send ingest query result data back through Flight doGet
* feat: Send delete predicates and max sequencer number in metadata
* fix: greater_than_sequence_number should be of type SequenceNumber
* fix: Remove DeletePredicates from IngesterQueryResponse
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
This adds persistence into the ingester with a lifecycle manager. The persist operation must still be updated to keep track of the min_unpersisted_sequence_number for each sequencer.
* feat: initial implementaion the Query Plan that query QueryableBatch with filters
* fix: read_filter of QueryableBatch should provide the shema of the columns/projection it needs
* chore: Apply suggestions from code review
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
* chore: address review comment
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
* feat: allow catalog access w/o a transaction
Now the caller has the full control if they want to use a transaction or
not.
* fix: remove non-transaction-safe `create_many`
* fix: remove unnecessary transactions
* feat: projection pushdown for QueryableBatch
* chore: clean up and remove unwrap
* fix: Add Sync to a Snafu source to have the code compile
* chore: cleanup and add comments for tests
* refactor: Add tests for scanning non existing columns and fix related bugs
* chore: modify comment to trigger auto check in github work
* feat: Add a way to run ingester with an in-memory catalog from the CLI
If you set the --catalog-dsn string to "mem", rather than using that as
a Postgres connection URL, create an in-memory catalog.
Planning on using this in tests, so not documenting.
* fix: Set default topic to the same value as SHARED_KAFKA_TOPIC
Namely, both should use an underscore. I don't think there's a way to
directly share these values between a constant and an annotation.
* feat: Add a flight API (handshake only) to ingester
* fix: Create partitions if using file-based write buffer
* fix: Change the server fixture to handle ingester server type
For now, the ingester doesn't implement the deployment API. Not sure if
it should or not.
* feat: Start implementing ingester do_get, namely decoding the query
Skip serialization of the predicate for the moment.
* refactor: Rename ingest protos to ingester to match crate name
* refactor: Rename QueryResults to QueryData
* feat: Move ingester flight client to new querier crate
* fix: Off by one error, different starting indexes in sequencers
* fix: Create new CLI argument to pick the catalog type
* fix: Create a CLI option to set the number of topics to auto-create in the write buffer
* fix: Check the arrow flight service's health to tell that the ingester gRPC is up
* fix: Set postgres as the default catalog type
* fix: Return an error rather than panicking if CLI args aren't right
This adds the lifecycle manager to the ingester. It will trigger based on a threshold for max partition size or age or based on keeping total memory under a certain threshold.
It defines a new interface for a persister, which is stubbed out for IngesterData. I'm not sure yet how persistence errors should be handled. The assumption here is that the persister continues to retry persistence forever until it succeeds.
There is one scenario I can think of that may cause this lifecycle manager problems. If a single partition is very high throughput, it could cause things to back up as persistence is not parallelized within a single partition. Any given partition can currently only run one persistence operation at a time. We can address this later.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* refactor: catalog Unit of Work (= transaction)
Setup an inteface to handle Units of Work within our catalog. Previously
both the Postgres and the in-mem backend used "mini-transactions on
demand". Now the caller has a clear way to establish boundaries and
gets read and write isolation. A single `Arc<dyn Catalog>` can create as
many `Box<dyn UnitOfWork>` as you like, but note that depending on the
backend you may not scale infinitely (postgres will likely impose
certain limits and the in-mem backend limits concurrency to 1 to keep
things simple).
* docs: improve wording
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
* refactor: rename Unit of Work to Transaction
* test: improve `test_txn_isolation`
* feat: clearify transaction drop semantics
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
With this change write buffer ingestion metrics are showing up under
`/metrics`
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* refactor: improve writer buffer consumer interface
The change looks huge but is actually rather simple. To
understand the interface change, let me first explain what we want:
- be able to fetch watermarks for any sequencer
- have streams:
- each streams tracks a sequencer and has an offset state (no read
multiplexing)
- we can seek a stream
- seeking and streaming cannot be done at the same time (that would be
weird and likely leads to many bugs both in write buffer and in the
user code)
- ideally we don't need to create streams of all sequencers but can
choose a subset
Before this change we had one mutable consumer struct where you can get
all streams and watermark functions (this mutable-borrows the consumer)
or you can seek a single stream (this also mutable-borrows the
consumer). This is a bit weird for multiple reasons:
- you cannot seek a single stream without dropping all of them
- the mutable-borrow construct makes it really difficult to pass the
streams into separate threads
- the consumer is boxed (because its mutable) which makes it more
difficult to handle in a large-scale application
What this change does is the following:
- you have an immutable consumer (similar to the producer)
- the consumer offers the following methods:
- get the set of sequencer IDs
- get watermark for any sequencer
- get a stream handler (see next point) for any sequencer
- the stream handler captures the stream state (offset) and provides you
a standard `Stream<_>` interface as well as a seek function.
Mutable-borrows ensure that you cannot use both at the same time.
The stream handler provides you the stream via `handler.stream()`. It
doesn't implement `Stream<_>` itself because the way boxing, dynamic
dispatch work, and pinning interact (i.e. I couldn't get it to work
without the indirection).
As a bonus point (which we don't use however) you can now create
multiple streams for the same sequencer and they all have their own
offset.
* fix: review comments
Co-authored-by: Carol (Nichols || Goulding) <193874+carols10cents@users.noreply.github.com>
Co-authored-by: Carol (Nichols || Goulding) <193874+carols10cents@users.noreply.github.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
This adds the scaffolding for the ingester server to consume data from Kafka. This ingests data in an in memory structure while creating records in the catalog for any partitions that don't yet exist.
I've removed catalog_update.rs in ingester for now. That was mostly a placeholder and will be going in a combination of handler.rs and data.rs on my next PR which will have some primitive lifecycle wired up.
There's one ugly bit here where the DML write is cloned because it's getting borrowed to output spans and metrics. I'll need to follow up with a refactor to make it so that the DML write's tables can be consumed without it gumming up the metrics stuff.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
If there aren't any record batches, there isn't any metadata, and vice
versa. Make this relationship clearer by putting the Option around both
the vec of recordbatches and the metadata.