docs: document ingest flow (#3418)
* docs: document ingest flow * chore: review feedback Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>pull/24376/head
parent
80c048528c
commit
479276b8cd
|
|
@ -0,0 +1,127 @@
|
|||
# Ingest
|
||||
|
||||
A user expects to be able to write/delete data from an IOx database with an API call to an IOx server. Critically the user expects:
|
||||
|
||||
* An API response in a matter of seconds or less
|
||||
* A successful API call to eventually be visible to query
|
||||
|
||||
This document aims to provide a high-level overview of how a number of related, but independent, subsystems come together to achieve this functionality within IOx:
|
||||
|
||||
* _In-Memory Catalog_ - metadata about the data contained within this server
|
||||
* Optional _Preserved Catalog_ - preserved metadata used to rebuild the in-memory catalog and start ingestion on restart
|
||||
* Optional _Write Buffer_ - a write-ahead log for modifications that haven't reached the Preserved Catalog
|
||||
* _Lifecycle Policy_ - determines when to persist data for a partition
|
||||
* _Persistence Windows_ - keeps track of writes to unpersisted chunks in the in-memory catalog
|
||||
* _Delete Mailbox_ - keeps track of unpersisted deletes applied to persisted chunks in the in-memory catalog
|
||||
|
||||
## 1. Produce to Write Buffer
|
||||
|
||||
When an API call arrives it is synchronously written to the _Write Buffer_, if any. If successful, success is returned to the client.
|
||||
|
||||
The _Write Buffer_ must therefore:
|
||||
|
||||
* Respond in a matter of seconds or less
|
||||
* Provide sufficient durability for the given application
|
||||
* Establish an ordering of any potentially conflicting operations, to ensure consistent recovery (and replication)
|
||||
|
||||
_If there is no _Write Buffer_ configured, writes are written directly to the in-memory data structures as described in the section below._
|
||||
|
||||
## 2. Consume from Write Buffer
|
||||
|
||||
Ingest servers consume from the _Write Buffer_.
|
||||
|
||||
As durability is provided by the _Write Buffer_, modifications are initially only applied to in-memory data structures:
|
||||
|
||||
* Writes are written to the _In-Memory Catalog_ and the _Persistence Windows_
|
||||
* Deletes are written to the _In-Memory Catalog_ and the _Delete Mailbox_
|
||||
|
||||
It is assumed that the _Write Buffer_ is configured in such a way as to provide sufficient ordering to provide eventual consistency.
|
||||
|
||||
This ordering is not only preserved by this ingest process, but by all lifecycle actions. This means:
|
||||
|
||||
* The order of chunks in a partition preserves any ordering established by the _Write Buffer_
|
||||
* Query, persisted compaction, etc... only care about chunk ordering, and are unaware of the _Write Buffer_
|
||||
* **The _Write Buffer_ is an implementation detail of ingest**
|
||||
|
||||
## 3. In-Memory Compaction
|
||||
|
||||
As more data is ingested, the [lifecycle](./data_organization_lifecycle.md) may re-arrange how the unpersisted data is stored within the _In-Memory Catalog_.
|
||||
|
||||
As the _Persistence Windows_ tracks the unpersisted data separately, compaction can occur independently of it
|
||||
|
||||
## 4. Persist
|
||||
|
||||
If IOx is configured with an object store, and therefore has a _Preserved Catalog_, eventually the [lifecycle](./data_organization_lifecycle.md) will decide to make durable some data for a partition.
|
||||
|
||||
**Identify data to persist**
|
||||
|
||||
First the process obtains a `FlushHandle` for the partition from the _Persistence Windows_. This handle:
|
||||
|
||||
* Is a transaction on the _Persistence Windows_ that:
|
||||
* On commit marks the data as persisted
|
||||
* Prevents concurrent conflicting operations
|
||||
* Identifies a range of data from the _In-Memory Catalog_ to persist
|
||||
* Contains information about the unpersisted data outside this range
|
||||
|
||||
**Split data to persist**
|
||||
|
||||
The chunks containing data to persist within the _In-Memory Catalog_ are identified, and marked to prevent modification by another lifecycle action - e.g. compaction.
|
||||
|
||||
Snapshots of these chunks are taken, and two new chunks are created containing the data to persist, and what is left over.
|
||||
|
||||
These two new chunks are then atomically swapped into the _In-Memory Catalog_ replacing the chunks from which they were sourced.
|
||||
|
||||
Any delete predicates that landed on the source chunks after the snapshots were taken, will be carried across to the two new chunks.
|
||||
|
||||
_Much like compaction, this must process a contiguous sequence of chunks within the partition to ensure ordering is preserved_
|
||||
|
||||
**Write data to parquet**
|
||||
|
||||
The data to persist is now located in a single chunk in the _In-Memory Catalog_, and is written to a single parquet file in object storage.
|
||||
|
||||
**Preserved Catalog**
|
||||
|
||||
The parquet file is now written, but it needs to be recorded in the _Preserved Catalog_, along with the position reached in the _Write Buffer_. On restart this will enable IOx to skip over data that has already been persisted.
|
||||
|
||||
It would be perfectly correct to simply record the min unpersisted positions in the _Write Buffer_ and on restart resume playback from there. However, as we persist data for different partitions separately, and out-of-order within a partition, this would likely result in re-ingesting lots of already persisted data.
|
||||
|
||||
As such a more [sophisticated approach](../persistence_windows/src/checkpoint.rs) is used that more accurately describes what remains unpersisted.
|
||||
|
||||
As deletes are relatively small we opt to simply flush the entire _Delete Mailbox_ for all partitions and tables on persist. This means we only need to be concerned with recording writes that remain unpersisted.
|
||||
|
||||
The final _Preserved Catalog_ transaction therefore contains:
|
||||
|
||||
* The newly created parquet file
|
||||
* All unpersisted deletes
|
||||
* Checkpoint information about the _Write Buffer_ position
|
||||
|
||||
**Commit**
|
||||
|
||||
If the _Preserved Catalog_ transaction is successful we can then:
|
||||
|
||||
* Commit the `FlushHandle`
|
||||
* Mark the chunk as persisted in the _In-Memory Catalog_
|
||||
|
||||
_There is a background worker that cleans up parquet files unreferenced by the Preserved Catalog. As such this process holds a shared "cleanup lock" to prevent this task from running and deleting the parquet file before it has been inserted into the Preserved Catalog._
|
||||
|
||||
## 5. Compact Persisted
|
||||
|
||||
As queries, deletes and writes land the [lifecycle](./data_organization_lifecycle.md) may decide to compact chunks that are persisted.
|
||||
|
||||
As this data is already durable, this occurs largely independently of the ingest flow - no new data is being persisted.
|
||||
|
||||
However, there are some subtleties worth mentioning:
|
||||
|
||||
* Like all other lifecycle actions, this must compact a contiguous sequence of chunks to preserve ordering within a partition
|
||||
* The new chunk should preserve the maximum checkpoint information from the source chunks <sup>1.</sup>
|
||||
* Deletes that land in the _Delete Mailbox_ whilst this operation is in-flight need to be also enqueued for the future compacted chunk
|
||||
|
||||
_<sup>1.</sup> Checkpoint information is only associated with chunks as a mechanism to recover from loss of the Preserved Catalog, conceptually the checkpoint is really a property of the partition_
|
||||
|
||||
## 6. Unload/Drop Data
|
||||
|
||||
Eventually memory pressure will trigger the [lifecycle](./data_organization_lifecycle.md) to free up memory.
|
||||
|
||||
If persistence is enabled, it will do this by unloading persisted chunks from the _In-Memory Catalog_. Following this the chunks will still exist, and their data can be retrieved from the persisted parquet files
|
||||
|
||||
If persistence is disabled, it will drop the chunks from the _In-Memory Catalog_. Following this the chunks will no longer exist, and will not be queryable
|
||||
Loading…
Reference in New Issue