128 lines
7.0 KiB
Markdown
128 lines
7.0 KiB
Markdown
# Ingest
|
|
|
|
A user expects to be able to write/delete data from an IOx database with an API call to an IOx server. Critically the user expects:
|
|
|
|
* An API response in a matter of seconds or less
|
|
* A successful API call to eventually be visible to query
|
|
|
|
This document aims to provide a high-level overview of how a number of related, but independent, subsystems come together to achieve this functionality within IOx:
|
|
|
|
* _In-Memory Catalog_ - metadata about the data contained within this server
|
|
* Optional _Preserved Catalog_ - preserved metadata used to rebuild the in-memory catalog and start ingestion on restart
|
|
* Optional _Write Buffer_ - a write-ahead log for modifications that haven't reached the Preserved Catalog
|
|
* _Lifecycle Policy_ - determines when to persist data for a partition
|
|
* _Persistence Windows_ - keeps track of writes to unpersisted chunks in the in-memory catalog
|
|
* _Delete Mailbox_ - keeps track of unpersisted deletes applied to persisted chunks in the in-memory catalog
|
|
|
|
## 1. Produce to Write Buffer
|
|
|
|
When an API call arrives it is synchronously written to the _Write Buffer_, if any. If successful, success is returned to the client.
|
|
|
|
The _Write Buffer_ must therefore:
|
|
|
|
* Respond in a matter of seconds or less
|
|
* Provide sufficient durability for the given application
|
|
* Establish an ordering of any potentially conflicting operations, to ensure consistent recovery (and replication)
|
|
|
|
_If there is no _Write Buffer_ configured, writes are written directly to the in-memory data structures as described in the section below._
|
|
|
|
## 2. Consume from Write Buffer
|
|
|
|
Ingest servers consume from the _Write Buffer_.
|
|
|
|
As durability is provided by the _Write Buffer_, modifications are initially only applied to in-memory data structures:
|
|
|
|
* Writes are written to the _In-Memory Catalog_ and the _Persistence Windows_
|
|
* Deletes are written to the _In-Memory Catalog_ and the _Delete Mailbox_
|
|
|
|
It is assumed that the _Write Buffer_ is configured in such a way as to provide sufficient ordering to provide eventual consistency.
|
|
|
|
This ordering is not only preserved by this ingest process, but by all lifecycle actions. This means:
|
|
|
|
* The order of chunks in a partition preserves any ordering established by the _Write Buffer_
|
|
* Query, persisted compaction, etc... only care about chunk ordering, and are unaware of the _Write Buffer_
|
|
* **The _Write Buffer_ is an implementation detail of ingest**
|
|
|
|
## 3. In-Memory Compaction
|
|
|
|
As more data is ingested, the [lifecycle](./data_organization_lifecycle.md) may re-arrange how the unpersisted data is stored within the _In-Memory Catalog_.
|
|
|
|
As the _Persistence Windows_ tracks the unpersisted data separately, compaction can occur independently of it
|
|
|
|
## 4. Persist
|
|
|
|
If IOx is configured with an object store, and therefore has a _Preserved Catalog_, eventually the [lifecycle](./data_organization_lifecycle.md) will decide to make durable some data for a partition.
|
|
|
|
**Identify data to persist**
|
|
|
|
First the process obtains a `FlushHandle` for the partition from the _Persistence Windows_. This handle:
|
|
|
|
* Is a transaction on the _Persistence Windows_ that:
|
|
* On commit marks the data as persisted
|
|
* Prevents concurrent conflicting operations
|
|
* Identifies a range of data from the _In-Memory Catalog_ to persist
|
|
* Contains information about the unpersisted data outside this range
|
|
|
|
**Split data to persist**
|
|
|
|
The chunks containing data to persist within the _In-Memory Catalog_ are identified, and marked to prevent modification by another lifecycle action - e.g. compaction.
|
|
|
|
Snapshots of these chunks are taken, and two new chunks are created containing the data to persist, and what is left over.
|
|
|
|
These two new chunks are then atomically swapped into the _In-Memory Catalog_ replacing the chunks from which they were sourced.
|
|
|
|
Any delete predicates that landed on the source chunks after the snapshots were taken, will be carried across to the two new chunks.
|
|
|
|
_Much like compaction, this must process a contiguous sequence of chunks within the partition to ensure ordering is preserved_
|
|
|
|
**Write data to parquet**
|
|
|
|
The data to persist is now located in a single chunk in the _In-Memory Catalog_, and is written to a single parquet file in object storage.
|
|
|
|
**Preserved Catalog**
|
|
|
|
The parquet file is now written, but it needs to be recorded in the _Preserved Catalog_, along with the position reached in the _Write Buffer_. On restart this will enable IOx to skip over data that has already been persisted.
|
|
|
|
It would be perfectly correct to simply record the min unpersisted positions in the _Write Buffer_ and on restart resume playback from there. However, as we persist data for different partitions separately, and out-of-order within a partition, this would likely result in re-ingesting lots of already persisted data.
|
|
|
|
As such a more [sophisticated approach](../persistence_windows/src/checkpoint.rs) is used that more accurately describes what remains unpersisted.
|
|
|
|
As deletes are relatively small we opt to simply flush the entire _Delete Mailbox_ for all partitions and tables on persist. This means we only need to be concerned with recording writes that remain unpersisted.
|
|
|
|
The final _Preserved Catalog_ transaction therefore contains:
|
|
|
|
* The newly created parquet file
|
|
* All unpersisted deletes
|
|
* Checkpoint information about the _Write Buffer_ position
|
|
|
|
**Commit**
|
|
|
|
If the _Preserved Catalog_ transaction is successful we can then:
|
|
|
|
* Commit the `FlushHandle`
|
|
* Mark the chunk as persisted in the _In-Memory Catalog_
|
|
|
|
_There is a background worker that cleans up parquet files unreferenced by the Preserved Catalog. As such this process holds a shared "cleanup lock" to prevent this task from running and deleting the parquet file before it has been inserted into the Preserved Catalog._
|
|
|
|
## 5. Compact Persisted
|
|
|
|
As queries, deletes and writes land the [lifecycle](./data_organization_lifecycle.md) may decide to compact chunks that are persisted.
|
|
|
|
As this data is already durable, this occurs largely independently of the ingest flow - no new data is being persisted.
|
|
|
|
However, there are some subtleties worth mentioning:
|
|
|
|
* Like all other lifecycle actions, this must compact a contiguous sequence of chunks to preserve ordering within a partition
|
|
* The new chunk should preserve the maximum checkpoint information from the source chunks <sup>1.</sup>
|
|
* Deletes that land in the _Delete Mailbox_ whilst this operation is in-flight need to be also enqueued for the future compacted chunk
|
|
|
|
_<sup>1.</sup> Checkpoint information is only associated with chunks as a mechanism to recover from loss of the Preserved Catalog, conceptually the checkpoint is really a property of the partition_
|
|
|
|
## 6. Unload/Drop Data
|
|
|
|
Eventually memory pressure will trigger the [lifecycle](./data_organization_lifecycle.md) to free up memory.
|
|
|
|
If persistence is enabled, it will do this by unloading persisted chunks from the _In-Memory Catalog_. Following this the chunks will still exist, and their data can be retrieved from the persisted parquet files
|
|
|
|
If persistence is disabled, it will drop the chunks from the _In-Memory Catalog_. Following this the chunks will no longer exist, and will not be queryable
|