docs: ingester overview documentation

Adds "overview" documentation for the ingester, including the high-level
purpose & design. Each subsystem is briefly documented, with links for
jumping-off points into more specific documentation.
pull/24376/head
Dom Dwyer 2023-03-01 13:52:44 +01:00
parent e19ce98407
commit 2a8731dd90
No known key found for this signature in database
GPG Key ID: E4C40DBD9157879A
1 changed files with 157 additions and 2 deletions

View File

@ -1,12 +1,159 @@
//! IOx Ingester V2 implementation.
//!
//! # Overview
//!
//! The purpose of the ingester is to receive RPC write requests, and batch them
//! up until ready to be persisted. This buffered data must be durable (tolerant
//! of a crash) and queryable via the Flight RPC query interface. Data is
//! organised in a hierarchical structure (the [`BufferTree`]) to allow
//! table-scoped queries, and partition-scoped persistence / parquet file
//! generation.
//!
//! All buffered data is periodically persisted via a WAL file rotation, or
//! selectively when a single partition has grown too large ("hot partition
//! persistence").
//!
//! ```text
//!
//! ┌──────────────┐ ┌──────────────┐
//! │ RPC Write │ │ RPC Query │
//! └──────────────┘ └──────────────┘
//! │ ▲
//! ▼ │
//! ┌──────────────┐ │
//! │ WAL │ │
//! └──────────────┘ │
//! │ │
//! │ ┌──────────────┐ │
//! └──▶│ BufferTree │────┘
//! └──────────────┘
//! │
//! ▼
//! ┌──────────────┐
//! │ Persist │
//! └──────────────┘
//! │
//! ▼
//!
//! Object Storage
//!
//!
//!
//! (arrows show data flow)
//!```
//!
//! The write path is composed together of implementers of the [`DmlSink`]
//! abstraction, and likewise queries stream out of the [`QueryExec`]
//! abstraction.
//!
//!
//! ## Subsystems
//!
//! The Ingester is composed of multiple smaller, distinct subsystems that
//! communicate / work together to provide the full ingester functionality.
//!
//! Each subsystem has its own documentation further describing the behaviours
//! and problems it solves in more detail.
//!
//!
//! ### [`BufferTree`]
//!
//! Perhaps not a "system" as such, but a single instance of the [`BufferTree`]
//! is the central point of the ingester - all writes are buffered in the tree,
//! and all queries execute against it. Persist operations persist data in the
//! buffer tree, removing it once complete.
//!
//! All other systems either directly or indirectly operate on, or control
//! operations against the [`BufferTree`].
//!
//!
//! ### RPC Server
//!
//! External services communicate with the ingester via gRPC service calls. All
//! gRPC handlers can be found in the [`grpc`] modules.
//!
//! The two main RPC endpoints are:
//!
//! * Write: buffer new data in the ingester, issued by the router
//! * Query: return buffered data scoped by {namespace, table}, issued by the
//! queriers
//!
//! Both endpoints are latency sensitive, and the call durations are directly
//! observable by end users.
//!
//! Writes commit to the [`wal`] for durability / crash tolerance and are
//! buffered into the [`BufferTree`] in parallel. Queries execute against the
//! [`BufferTree`], lazily streaming the results back to the queriers (lock
//! acquisition and data copying is deferred until "pulled" by the querier).
//!
//! If the [`IngestState`] is marked as unhealthy/shutting down, then write
//! requests are rejected with a "resource exhausted" error message until it
//! becomes healthy again.
//!
//!
//! ### Persist System
//!
//! The persist system is responsible for durably writing data to object
//! storage; that includes compacting it (to remove duplicate/overwrote rows),
//! generating a parquet file, uploading it to the configured object store, and
//! inserting the necessary catalog state to make the new file queryable.
//!
//! Data is typically sourced from a [`BufferTree`], and once the new file is
//! queryable, the data it contains is removed from the [`BufferTree`].
//!
//! Code that uses the persist system does so through the [`PersistQueue`]
//! abstraction, implemented by the [`PersistHandle`].
//!
//! The persist system provides a logical queue of outstanding persist jobs, and
//! a configurable number of worker tasks to execute them - see
//! [`PersistHandle`] for detailed documentation. If the persist system is
//! "saturated" (queue depth reached maximum) then further writes are rejected
//! by setting the [`IngestState`] to [`IngestStateError::PersistSaturated`]
//! until the queue depth is reduced.
//!
//! The persist system is driven mainly by WAL rotation (below), and interacts
//! with the [`IngestState`] to provide back-pressure, indirectly controlling
//! the memory utilisation of the Ingester (to prevent OOMs). Partitions with
//! large volumes of writes (or otherwise problematic data) are prematurely
//! persisted independently of the WAL rotation ("hot partition persistence").
//!
//!
//! ### WAL
//!
//! The write-ahead log ([`wal`]) is used to durably record each operation
//! against the [`BufferTree`] into a replay log, with a partial order. This
//! happens synchronously in the hot write path, so WAL performance is a
//! critical consideration.
//!
//! Once a write has been buffered in the [`BufferTree`] AND flushed to disk in
//! the [`wal`], the request is ACKed to the user.
//!
//! If the ingester crashes / stops un-cleanly, then the WAL must be replayed to
//! rebuild the in-memory state (the [`BufferTree`]) to prevent data loss. This
//! has some quirks, discussed in "Write Reordering" below.
//!
//!
//! #### WAL Rotation
//!
//! The WAL file is periodically rotated at a configurable interval, which
//! triggers a full persistence of all buffered data in the ingester at the same
//! time. This "full persist" indirectly affects the ingester in many ways:
//!
//! * Limits the size of a single WAL file
//! * Limits the amount of buffered data, which in turn
//! * Limits the amount of WAL data to replay after a crash
//! * Limits the amount of data a querier must read & dedupe per query
//! * Limits the largest source of memory utilisation in the ingester
//! * Limits the amount of data in a partition that must be persisted
//!
//!
//! ## Write Reordering
//!
//! A write that enters an `ingester2` instance can be reordered arbitrarily
//! with concurrent write requests.
//!
//! For example, two gRPC writes can race to be committed to the WAL, and then
//! race again to be buffered into the [`BufferTree`]. Writes to a
//! For example, two gRPC writes can race to be committed to the [`wal`], and
//! then race again to be buffered into the [`BufferTree`]. Writes to a
//! [`BufferTree`] may arrive out-of-order w.r.t their assigned sequence
//! numbers.
//!
@ -29,6 +176,14 @@
//!
//! [`BufferTree`]: crate::buffer_tree::BufferTree
//! [`SequenceNumber`]: data_types::SequenceNumber
//! [`PersistQueue`]: crate::persist::queue::PersistQueue
//! [`PersistHandle`]: crate::persist::handle::PersistHandle
//! [`IngestState`]: crate::ingest_state::IngestState
//! [`grpc`]: crate::server::grpc
//! [`DmlSink`]: crate::dml_sink::DmlSink
//! [`QueryExec`]: crate::query::QueryExec
//! [`IngestStateError::PersistSaturated`]:
//! crate::ingest_state::IngestStateError
#![allow(dead_code)] // Until ingester2 is used.
#![deny(rustdoc::broken_intra_doc_links, rust_2018_idioms)]