The query API exposes a unique-per-instance UUID to allow callers to
detect a crash of the ingester process - this was initialised directly
in the query RPC handler.
This commit turns the bare UUID into a type, and initialises it in the
top-level initialisation of the ingester, plumbing it down into the
query RPC handler.
This allows the UUID to be reused by other components/handlers.
The ingester no longer needs to access a specific PartitionData by ID
(they are addressed either via an iterator over the BufferTree, or
shared by Arc reference).
This allows us to remove the extra map maintaining ID -> PartitionData
references, and the shared access lock protecting it.
Prior to this commit, the (happy path) shutdown sequence of an IOx
process was hard coded to:
1. Stop gRPC & HTTP servers
2. Stop backend server (i.e. ingester2)
After this commit, the execution of step 1 is delegated to the handler
for step 2; the server implementation (router / ingester / querier /
etc) now chooses when to shut down the RPC & HTTP servers.
This allows the server shutdown delegate to correctly sequence the
shutdown of all components of the IOx server. This allows ingester2 to
correctly sequence the shutdown of the query RPC server w.r.t the
graceful stop & persist, ensuring queries continue to be serviced.
Persist all buffered data when gracefully stopping an ingester2
instance.
This implementation accounts for both late-arriving writes, and
concurrent persist tasks - it's carefully constructed in a way that it
can discover the presence of, and wait for, outstanding persist tasks
started by other code without having to know about all the possible
places a persist task can be started (currently WAL rotation & hot
partition persistence, but later also a RPC endpoint).
There exists a small race that seems to be so incredibly unlikely to
occur, I didn't cover off (it would have a RPC write cost for little
gain). This is documented in the code comments.
Prior to this commit, when initialising the persist system it would
return a PersistState instance, used to communicate the saturation
status of the persistence system. The RPC write path used this
information to accept or deny write requests accordingly.
This was unfortunate in that it tightly coupled the ingest handler to
the persist system - in order to initialise the RPC handler, you had to
provide a PersistState; this required us to initialise a persist system
when testing only the RPC handler (which had nothing to do with
persisting). This smells!
This commit inverts the dependency, and decouples the subsystems via a
shared type (IngestState). Instead of the persist system telling ingest
to stop, the ingest system provides a means to be told to stop - this
subtle difference decouples the ingest handler from all components that
need to block ingest. This allows a fast O(1) error state read for N
components and prevents us from having to start N components to test a
RPC handler.
Additionally this commit introduces an unused ingest error state
(GracefulStop) as part of figuring out the API (to be used shortly).
Include the number of DML operations applied to the persisted buffer
in the "persisted partition" message.
Partly because I'm intrigued / it's useful information, and partly to
ensure LLVM doesn't get snazzy and dead-code the sequence number
tracking because it was never read.
Changes the ingester2 buffer FSM to track the sequence numbers that have
been applied to it.
This is a pre-requisite for replication & correct WAL segment dropping.
Previously the ingester(1) required ordered writes to be applied, this
requirement has been relaxed, and the asserts (previously) removed in
ingester2.
* feat: function to get parttion candidates from partition table
* chore: cleanup
* fix: make new_file_at the same value as created_at
* chore: cleanup
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
Changes the WAL replay logic to:
* Replay a segment file
* Persist all replayed data
* Drop segment file
* ...repeat...
This ensures old WAL segments are removed once their contents have been
made durable, fixing #6461.
Removes the Clone bound from PersistQueue, also removing the Clone impl
from the PersistHandle.
Instead of wrapping all internal PersistHandle state in Arcs, this
commit changes the system to use a single Arc wrapping the PersistHandle
which is shared.
Multiple components of the ingester depend on being able to enqueue a
partition's data for persistence. This commit decouples those components
from the concrete PersistHandle by introducing a PersistQueue trait that
defines the desired behaviour, on which the components depend.
This is a much needed clean-up of something I knowingly punted on for
the MVP, and I feel much better about the situation now!
The persist_buffer() fn iterates over all the partitions in a BufferTree
and persists them - however it only depends on one behaviour; getting an
iterator of partitions.
This commit introduces the PartitionIter, an abstraction over anything
that can produce an iterator of PartitionData, decoupling the
persist_buffer() helper (and the callers!) from the concrete BufferTree
type.
The sort-key conflict path invalidated the cached sort key in the
PartitionData, but not the cached sort key in the persist's Context. Now
both are invalidated.
Expose a metric ("ingester_persist_saturated_duration_ns") that records
the cumulative duration of time the persist system has spent in the
"saturated" state.
Separate out persist worker types & routines into a separate worker
module rather than commingling them with the persist handle, and rename
the unimaginative "inner" to reflect the actual usage.
As an optimisation, allow a persist task to progress if it observes a
concurrent catalog sort key update that exactly matches the sort key it
was committing.
Allow an ingester2 instance to tolerate concurrent partition sort key
updates in the catalog.
A persist job is optimistically executed with the locally cached sort
key. If an ingester2 instance observes a concurrent update, it aborts
both the sort key update, and the overall persist operation (before
making the parquet file visible) and retries the operation with the
newly observed sort key. Concurrent sort key updates are theorised to be
relatively rare overall.
Any orphaned parquet files uploaded as part of a persist job that aborts
due to a concurrent sort key update are eventually removed by the
(external) object store GC task.
See https://github.com/influxdata/influxdb_iox/issues/6439
Updating the sort key is not commutative and MUST be serialised. The
correctness of the current catalog interface relies on the caller
serialising updates globally, something it cannot reasonably assert in a
distributed system.
This change of the catalog interface pushes this responsibility to the
catalog itself where it can be effectively enforced, and allows a caller
to detect parallel updates to the sort key.