This commit adds initial support for "soft" namespace deletion, where
the actual records & data remain, but are no longer queryable /
writeable.
Soft deletion is eventually consistent - users can expect to continue
writing to and reading from a bucket after issuing a soft delete call,
until the various components either restart, or have their caches
flushed.
The components treat soft-deleted namespaces differently:
* router: ignore soft deleted namespaces
* ingester: accept soft deleted namespaces
* compactor: accept soft deleted namespaces
* querier: ignore soft deleted namespaces
* various gRPC services: ignore soft deleted namespaces
This ensures that the ingester & compactor do not see rows "vanishing"
from the database, and continue to make forward progress.
Writes for the deleted namespace that are buffered in the ingester will
be persisted as normal, allowing us to support "un-delete" operations
where the system is restored to a the state at which the delete was
issued (rather than loosing the buffered data).
Follow-on work is required to ensure GC drops the orphaned parquet files
after the configured GC time, and optimisations such as not compacting
parquet from soft-deleted namespaces seems like a trivial win.
This fixes an issue where persistence that does not ever complete blocks
the periodic enqueuing of persist tasks - this leads to the amount of
buffered data in the buffer tree increasing, and the persist queue depth
stays the same instead of draining the buffer.
This is an issue as the queue depth is designed to act as the
back-pressure of the ingester - once the depth exceeds a configurable
limit, further writes are rejected until the queue has drained
sufficiently (50%).
After this commit, stalled persistence (i.e. object store outage) will
not prevent the queue depth from growing, which should enable the
saturation protection to kick in.
Adds two metrics:
* Number of files replayed (counted at the start of, not completion)
* Number of applied ops
This will help identify when WAL replay is happening (an indication of
an ungraceful shutdown & potential temporary read unavailability).
* feat: introduce a new way of max_sequence_number for ingester, compactor and querier
* chore: cleanup
* feat: new column max_l0_created_at to order files for deduplication
* chore: cleanup
* chore: debug info for chnaging cpu.parquet
* fix: update test parquet file
Co-authored-by: Marco Neumann <marco@crepererum.net>
Record latency histograms for DmlSink::apply() calls, configuring
ingester2 to report the overall write path latency, and separately the
buffer apply latency.
Adds metrics to track the distribution duration spent actively
persisting a batch of partition data (compacting, generating parquet,
uploading, DB entries, etc) and another tracking the duration of time an
entry spent in the persist queue.
Together these provide a measurement of the latency of persist requests,
and as they contain event counters, they also provide the throughput and
number of outstanding jobs.
Changes the persist system to call into an abstract
PersistCompletionObserver after the persist task has completed, but
before releasing the job permit / notifying the enqueuer.
This call happens synchronously, driven by the persist worker to
completion. A sync construct can easily be made async (by enqueuing work
into a channel), but not the other way around, so this gives the best
flexibility.
This trait allows pluggable logic to be inserted into the persist
system, without tightly coupling it to the implementer's logic (for
example, replication). One or more observers may be chained together to
construct an arbitrary sequence of actors.
This commit uses a no-op observer, causing no functional change to the
system.
Adds an integration test of the persist system, covering:
* Node A starts a persist operation
* Node B starts a persist operation for the same partition
* Node A completes, setting the catalog sort key to a new value
* Node B attempts to update the catalog, observing the new sort key
* Node B re-compacts the data, re-uploads, and drives to completion
This scenario is/was tracked in:
https://github.com/influxdata/influxdb_iox/issues/6439
The persist::Context struct carries the data to be persisted, a
reference to the partition from which it came, and various cached fields
to avoid re-acquiring the partition read lock all the time.
Prior to this commit, the Context also had the full persist logic as
methods, invoked by the persist worker. This tightly couples the data &
logic - it's fairly clear a worker should implement the work, and
operate on the data - not commingling the two. I even knew the mess I
was making when I wrote it, but effectively copy-pasted it from
ingester1 because deadlines.
This commit decouples the persist logic from the Context.
The query API exposes a unique-per-instance UUID to allow callers to
detect a crash of the ingester process - this was initialised directly
in the query RPC handler.
This commit turns the bare UUID into a type, and initialises it in the
top-level initialisation of the ingester, plumbing it down into the
query RPC handler.
This allows the UUID to be reused by other components/handlers.
The ingester no longer needs to access a specific PartitionData by ID
(they are addressed either via an iterator over the BufferTree, or
shared by Arc reference).
This allows us to remove the extra map maintaining ID -> PartitionData
references, and the shared access lock protecting it.