Adds a test that asserts (manually triggered) persistence generates a
file, uploads it to object storage, inserts metadata into the catalog,
and emits various persistence metrics.
Implements a PersistCompletionObserver that records various attributes
of the generated and persisted Parquet file as histogram metrics to
capture the distribution of values:
* File size
* Row count
* Column count
* Time range of data (max - min timestamp)
These metrics will give us insight into the generated files instead of
relying on intuition when tuning various configuration parameters.
* chore: Upgrade to Rust 1.68
* fix: Remove unnecessary into_iter, thanks Clippy!
* fix: Use the size of the type, not a reference to the type... oops.
Thanks clippy!
* fix: Return block directly instead of creating a variable
Thanks clippy!
---------
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
This commit adds initial support for "soft" namespace deletion, where
the actual records & data remain, but are no longer queryable /
writeable.
Soft deletion is eventually consistent - users can expect to continue
writing to and reading from a bucket after issuing a soft delete call,
until the various components either restart, or have their caches
flushed.
The components treat soft-deleted namespaces differently:
* router: ignore soft deleted namespaces
* ingester: accept soft deleted namespaces
* compactor: accept soft deleted namespaces
* querier: ignore soft deleted namespaces
* various gRPC services: ignore soft deleted namespaces
This ensures that the ingester & compactor do not see rows "vanishing"
from the database, and continue to make forward progress.
Writes for the deleted namespace that are buffered in the ingester will
be persisted as normal, allowing us to support "un-delete" operations
where the system is restored to a the state at which the delete was
issued (rather than loosing the buffered data).
Follow-on work is required to ensure GC drops the orphaned parquet files
after the configured GC time, and optimisations such as not compacting
parquet from soft-deleted namespaces seems like a trivial win.
The maximum number of tables is part of the Namespace, which is already
loaded in its entirety. This commit copies the value into the
NamespaceSchema, making it available for the router to utilise.
Rust 1.67 now says:
warning: `#[track_caller]` on async functions is a no-op
= note: see issue #87417 <https://github.com/rust-lang/rust/issues/87417> for more information
= note: `#[warn(ungated_async_fn_track_caller)]` on by default
* feat: introduce a new way of max_sequence_number for ingester, compactor and querier
* chore: cleanup
* feat: new column max_l0_created_at to order files for deduplication
* chore: cleanup
* chore: debug info for chnaging cpu.parquet
* fix: update test parquet file
Co-authored-by: Marco Neumann <marco@crepererum.net>
* feat: function to get parttion candidates from partition table
* chore: cleanup
* fix: make new_file_at the same value as created_at
* chore: cleanup
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
Updating the sort key is not commutative and MUST be serialised. The
correctness of the current catalog interface relies on the caller
serialising updates globally, something it cannot reasonably assert in a
distributed system.
This change of the catalog interface pushes this responsibility to the
catalog itself where it can be effectively enforced, and allows a caller
to detect parallel updates to the sort key.
* fix: check schemas in `pretty_print_batches`
I think most users of this function (and `assert_batches_eq`) assume
that all batches have the same schema. If not, `pretty_print_batches`
may either fail producing an actual table (some rows may have more or
less columns) or silently produce a table that looks "alright".
* fix: equalize schemas where it is required/desired
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
Have a single global test executor w/ reasonable defaults. Also don't
require tests to join/await executor shutdowns (most tests forget this
anyways and will get a runtime warning).
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
Reorder all imports in the ingester to match a consistent order:
* stdlib
* external crates
* intra-crate imports
This helps prevent merge conflicts & keeps everything tidy.
Splits out the nested tree of namespace -> tables -> partitions
(referred to as the "buffer tree") from the Shard which previously held
the namespace map.
This allows the BufferTree to exist without a shard, or many trees to
exist within a shard, etc.
* fix: slice flight response batches
Same as #6094 but for the Apache Flight interface.
Ref https://github.com/influxdata/idpe/issues/16073.
* refactor: use `RecordBatch::slice`
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
Moves the SequenceNumberRange type out of "data" and into the root to be
reused outside of the data module. This construct is universally useful
across all the ingester code.
Allows different DmlSink implementations to return different error
types. This allows for small, concise errors that are local to the
DmlSink implementation and specific to it. This helps avoid bloated
"kitchen sink" error types.
* feat: create namespace API call in router
Co-authored-by: Nga Tran <nga-tran@live.com>
* chore: treat retention as ns except in CLI
* fix: overflow in nanosecond calc
* fix: retention test after changing it from hours to ns
* chore: comment clarification in cli; better response type for error in ns API
* fix: correct some rebase mistakes
* chore: merge namespace create & create_with_retention; renamed ns create test helper fn & const
* fix: ns autocreation test was wrong after rebase
* fix: mem catalog has default 1hr retention, accidently removed in rebase
* chore: remove mem catalogs default 1hr retention; make it settable in sets & router
Co-authored-by: Luke Bond <luke.n.bond@gmail.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* feat: reject writes that are outside the retention period
* feat: add retention validator into handler stack
* chore: Apply suggestions from code review
Co-authored-by: Dom <dom@itsallbroken.com>
* refactor: address review comments
* test: unit tests fot retention validation
* chore: address review comments
* test: more unit tests and integration tests
* refactor: make time inside retention period for emphemeral_mode test
* fix: 2 hours
Co-authored-by: Dom <dom@itsallbroken.com>
Changes the TableData within the ingester to utilise a TableNameResolver
to fetch the TableName via the catalog on demand / in the background,
instead of using the table name sent over the write.
This change causes the ingester to perform a catalog query in the
background (or on demand) to resolve the table name. This is a
pre-requisite for removing the table name from the write wire format.
Like the NamespaceNameProvider, this commit adds a TableNameProvider to
provide decoupled initialisation of a DeferredLoad<TableName> instead of
hard-coding in a catalog instance / query code, and plumbs it into
position to be used when initialising a TableName.
Changes the buffer tree to address TableData by their ID only (removing
support for addressing tables by their string names). This removes the
double reference book keeping / twin indexes and associated overhead.
As part of this change, the TableName is now wrapped in a DeferredLoad
in preparation for removal of the names in the DmlOperation wire format.
This commit also switches the map of TableData within the NamespaceData
(the parent node) to use the ArcMap for faster lookups and DRY
exactly-once initialisation.
Removes the need to leak the PartitionProvider outside of the ingester
crate.
This will allow the PartitionProvider to utilise a
DeferredLoad<TableName> without having to make the DeferredLoad and
TableName pub.
Removes reliance on string name identifiers for namespaces in the
ingester buffer tree, reducing the memory usage of the namespace index
and associated overhead.
The namespace name is required (though unused by IOx) in the IoxMetadata
embedded within a parquet file, and therefore the name is necessary at
persist time. For this reason, a DeferredLoad is used to query the
catalog (by ID) for the name, at some uniformly random duration of time
after initialisation of the NamespaceData, up to a maximum of 1 minute
later. This ensures the query remains off the hot ingest path, and the
jitter prevents spikes in catalog load during replay/ingester startup.
As an additional / easy optimisation, the persist code causes a
pre-fetch of the name in the background while compacting, hiding the
query latency should it not have already been resolved.
In order to keep the the ingester buffer & catalog decoupled / easily
testable, this commit uses a provider/factory trait
NamespaceNameProvider and corresponding implementation
(NamespaceNameResolver) in a similar fashion to the PartitionResolver,
allowing easy mocking for tests, and composition for prod code, allowing
future optimisations such as pre-fetching / caching the "hot" namespace
names at startup.
Internal string identifier removal is a pre-requisite for removing
string identifiers from the write wire format (#4880).
Changes the ingester's buffer tree to use the deferred loading primitive
to resolve the namespace name for NamespaceData.
Note that the loader is initialised with the name in the first place -
this commit just introduces the use of the deferred loading primitive,
and doesn't change where the name is sourced from.
This lets deferred loads be used in place of a non-differed T, such as
log context fields.
If the value has not been resolved, the display impl returns
"<unresolved>".
Allow a caller to signal to the DeferredLoad that the value it may or
may not have to materialise will be used imminently, optimistically
hiding the latency of resolving the value (typically a catalog query).
* refactor: generic deferred loader helper
Splits the DeferredSortKey loader introduced in #5807 into two parts - a
generic helper type that implements deferred/background loading of
values, and SortKey specific logic for use with it.
As this will be more widley used, this implementation features improved
behaviour of the deferred loader under concurrent demand requests
(multiple calls to get() do not attempt to concurrently resolve the
value), as well as complete cancellation safety (cancelling the get()
doesn't affect the liveness of the background task).
* docs: doc-link & minor comment amendments
Fixes naming, adds missing doc-links, and expands some code comments.
* test: bound wait times to avoid hangs
Adds timeouts to all .await of the code under test, ensuring tests don't
hang if something goes wrong.
I tracked down the source of the size difference to the difference in
`mem::size_of::<mutable_batch::column::ColumnData>`. I believe this enum
is now able to take advantage of this niche-filling optimization:
<https://github.com/rust-lang/rust/pull/94075/>
* feat: flag partition for delete
* fix: compare the right date and time
* chore: Run cargo hakari tasks
* chore: cleanup
* fix: typos
* chore: rust style tidy ups in catalog
Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
Co-authored-by: Luke Bond <luke.n.bond@gmail.com>
* refactor: NS+table ID (instead of name) in querier<>ingester
* feat(ingester): use IDs for query API
Changes the ingester to utilise the ID fields (instead of names) sent
over the query wire message wrapped within the Flight API.
BREAKING: this changes the "query-ingester" CLI command arguments which
now expects the namespace & table IDs, rather than their names.
* refactor(ingester): add more query logging context
Updates the log messages during query execution to include more context
fields.
* style: remove unused import
Co-authored-by: Marco Neumann <marco@crepererum.net>
Now DML operations contain the table ID, the ingester has all necessary
data to initialise the TableData buffer node without having to query the
catalog.
This also removes the catalog from the buffer_operation() call path,
simplifying testing.
Now DML operations contain the namespace ID, the ingester has all
necessary data to initialise the NamespaceData buffer node without
having to query the catalog.
Expose the Table and Namespace IDs encoded within the serialised DML
write (added in #6036).
This makes the IDs available for use in the consumers, ending the
transition period. This commit DOES NOT remove the strings sent over the
wire.
This commit pushes the existing table-level mutex down to the partition.
This allows the ingester to gather data from multiple partitions within
a single table in parallel, and reduces contention between ingest/query
workloads.
This moves the logic that skips operations that do not need to be
applied to a partition during shard replay from the table level, to the
partition level.
Changes the bounds on the ArcMap to accept an owned key, avoiding an
extra allocation.
Cleans up the bounds on other fn to ensure the borrowed key impl Eq and
is the ref type of K.
This commit changes the ArcMap HashBuilder to use the same instance as
the underlying HashMap hasher.
This prevents divergent hashing across threads that MAY initialise a
hasher with a different seed.
This commit pushes the existing table-level mutex down to the partition.
This allows the ingester to gather data from multiple partitions within
a single table in parallel, and reduces contention between ingest/query
workloads.
This moves the logic that skips operations that do not need to be
applied to a partition during shard replay from the table level, to the
partition level.
It should be always clear from the context to which table a chunk
belongs.
I think having a table name bound to a chunk goes back to a time where
chunks had multiple tables.
Helps with #6049.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* chore: delete metric duplicate character
* fix: failure ci test case
* fix: failure ci test case
* fix: failure ci test case
Co-authored-by: Andrew Lamb <alamb@influxdata.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
Changes the DmlDelete to contain the NamespaceId for which it should be
applied, propagating this value over the wire.
Like the existing IDs within the DmlWrite, these values are marked
unsafe to use due to avoid the consumers utilising them accidentally
during deployment. Unlike DmlWrite, the DmlDelete is completely unused,
so this is less of an issue.
Implements a map of K -> Arc<V> with exactly-once initialisation
semantics.
This map can be used to ensure a given key maps to singleton instances
of V; exactly what all the nodes in the ingester "buffer tree" of shard
-> namespace -> table -> partition require.
This impl contains unused funcs (silenced with an allow(dead_code)) due
to it being picked from a future branch.
This commit is part of a two-part change in order to add the table &
namespace IDs to the write buffer wire format. This commit forms the
first half; changing the producer to send the IDs.
In this commit the new ID values are never read on the consumer side,
ensuring there is no consumer dependency on them. This ensures they
remain operational during a rollout, where the consumer may be updated
to the latest code dependent on the IDs before the producer is updated
to send them. This also ensures we have a window of time where where the
consumers can be rolled back after being updated, and still handle
replaying messages in Kafka.
* refactor: simplify `QueryChunk` data access
We have only two types for chunks (now that the RUB is gone):
1. In-memory RecordBatches
2. Parquet files
Loads of logic is duplicated in the different `read_filter`
implementations. Also `read_filter` hides a solid amount of logic from
DataFusion, which will prevent certain (future) optimizations. To enable #5897
and to simplify the interface, let the chunks return the data (batches
or metadata for parquet files) directly and let `iox_query` perform the
actual heavy-lifting.
* docs: improve
Co-authored-by: Andrew Lamb <alamb@influxdata.com>
* docs: improve
Co-authored-by: Andrew Lamb <alamb@influxdata.com>
Co-authored-by: Andrew Lamb <alamb@influxdata.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
With #5963 merged, all chunks now provide a summary (even though it may
not contain data for all columns). So let's make it mandatory, which
also removes a few 🙈-style `.except(...)` calls.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
Use the table summary instead. This allows us to have a single mechanism
that both IOx and DataFusion understand. This basically lifts the "basic
table summary" mechanism that the querier uses to `iox_query` and let
the compactor and ingester use the same mechanism.
While not strictly necessary, simplifying the `QueryChunk[Meta]`
interface helps with #5897.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
Changes the DmlWrite type to require a PartitionKey be specified,
instead of accepting an Option.
This requirement was already in place - the write buffer upheld an
invariant that all writes contained a partition key value (was not
"None") or it panicked at runtime when attempting to enqueue the write.
It is now possible to encode this invariant in the type system, which is
what this change does.
This commit makes use of the partition buffer state machine introduced
in https://github.com/influxdata/influxdb_iox/pull/5943.
This commit significantly changes the buffering, and querying, of data
from a partition, swapping out the existing "DataBuffer" for the new
state machine implementation (itself simplified due to temporary lack of
incremental snapshot generation, see #5944).
This commit simplifies the query path, removing multiple types that
wrapped one-another to pass around various state necessary to perform a
query, with various query functions needing different types or
combinations of types. The query path now operates using a single type
(named "QueryAdaptor") that provides a queryable interface over the set
of RecordBatch returned from a partition.
There is significantly increased testing of the PartitionData itself,
covering data in various states and the ordering of returned RecordBatch
(to ensure correct materialisation of updates). There are also
invariants upheld by the type system / compiler to minimise the
complexities of working with empty batches & states, and many asserts
that ensure (mostly existing!) invariants are upheld.
This reverts commit c63312ce12.
This change fixed a low-priority alert when there was no traffic flowing
through the system. The loss in TTBR value fidelity due to bucketing is
a greater concern as it affects live, high-volume clusters and hinders
operational insight.
This commit removes the on-demand, incremental snapshot generation
driven by queries.
This functionality is "on hold" due to concerns documented in:
https://github.com/influxdata/influxdb_iox/issues/5805
Incremental snapshots will be introduced alongside incremental
compactions of those same snapshots.
This commit introduces code that is intended to replace the current
implicit state machine used by PartitionData. The existing code is still
in use, the new code is NOT used in this commit. A follow-up commit will
switch over to minimise the diff.
This change has two main goals;
* encapsulation & simplification for callers
* robust implementation so developing correct additions is easier
This is a significant refactor of the partition buffering logic to
encapsulate the various states of data (buffering, snapshot, persisting
and the mixed states between them) within the Partition. This alleviates
the rest of the system from having to be concerned with the differences
between "buffering" data, and "unpersisted data", "snapshot data",
"persisting data", "persisting with snapshots" etc - callers now invoke
a method called get_query_data() and they are provided with all the
relevant data for a partition. This abstraction change alone
significantly reduces code and test complexity in the rest of the
ingester.
For the second goal, the new implementation leverages an explicit state
machine, encoded using typestates. Typestate ensures compile-time
correctness of transitions and method calls, and the explicit FSM itself
helps ensure the system progresses in the desired manner - this fixes
and helps prevent bugs caused by implicit states such as:
https://github.com/influxdata/influxdb_iox/issues/5805
This state machine makes the system states explicit and
self-descriptive, helping to reduce the cost of developer on-boarding
(no prior knowledge of "how this bit works") and reduces ongoing
developer burden. This explicit nature also de-risks adding new
functionality - it should be relatively easy to add concurrent snapshot
generation or incremental compaction without introducing bugs. The state
transition logic is abstracted away from callers, minimising the
overhead of this strategy.
Asserts write buffer seeking behaviour, including:
* Seeking past already persisted data correctly
* Skipping to next available op in non-contiguous offset stream
* Skipping to next available op for dropped ops due to retention
* Panics when seeking beyond available data (into the future)
Removes a pair of tests that covered some of the above due to their
tight coupling with ingester internals.
This commit adds a new test that exercises all major external APIs of
the ingester:
* Writing data via the write buffer
* Waiting for data to be readable via the progress API
* Querying data and and asserting the contents
This should provide basic integration coverage for the Ingester
internals. This commit also removes a similar test (though with less
coverage) that was tightly coupled to the existing buffering structures.
Adds a test helper type that maintains the in-memory state for a single
ingester integration test, and provides easy-to-use methods to
manipulate and inspect the ingester instance.
An existing function to map the complex IngesterQueryResponse type to a
simple set of RecordBatch existed in test code - this has been lifted
onto an inherent method on the response type itself for reuse.
* fix: only emit ttbr metric for applied ops
* fix: move DmlApplyAction to s/w accessible
* chore: test for skipped ingest; comments and log improvements
* fix: fixed ingester test re skipping write
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* fix: Avoid some allocations by collecting instead of inserting into a vec
* refactor: Encode that adding columns is for one table at a time
* test: Add another test of column limits
* test: Add below/above limit tests for create_or_get_many
* fix: Explicitly DO NOT check column limits when inserting many columns
* feat: Cache the max_columns_per_table on the NamespaceSchema
* feat: Add a function to validate column limits in-memory
* fix: Provide more useful information when over column limits
* fix: Swap types to remove intermediate allocation
* docs: Explain the interactions of the cache and the column limits
* test: Actually set up test that showcases column limit race condition
* fix: Allow writing to existing columns even if table is over column limit
Co-authored-by: Dom <dom@itsallbroken.com>