From 479276b8cd1b429cc2c7fd34928f5fd1d04fdfbc Mon Sep 17 00:00:00 2001
From: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>
Date: Fri, 7 Jan 2022 17:37:30 +0000
Subject: [PATCH] docs: document ingest flow (#3418)
* docs: document ingest flow
* chore: review feedback
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
---
docs/ingest.md | 127 +++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 127 insertions(+)
create mode 100644 docs/ingest.md
diff --git a/docs/ingest.md b/docs/ingest.md
new file mode 100644
index 0000000000..ebfc1d98a4
--- /dev/null
+++ b/docs/ingest.md
@@ -0,0 +1,127 @@
+# Ingest
+
+A user expects to be able to write/delete data from an IOx database with an API call to an IOx server. Critically the user expects:
+
+* An API response in a matter of seconds or less
+* A successful API call to eventually be visible to query
+
+This document aims to provide a high-level overview of how a number of related, but independent, subsystems come together to achieve this functionality within IOx:
+
+* _In-Memory Catalog_ - metadata about the data contained within this server
+* Optional _Preserved Catalog_ - preserved metadata used to rebuild the in-memory catalog and start ingestion on restart
+* Optional _Write Buffer_ - a write-ahead log for modifications that haven't reached the Preserved Catalog
+* _Lifecycle Policy_ - determines when to persist data for a partition
+* _Persistence Windows_ - keeps track of writes to unpersisted chunks in the in-memory catalog
+* _Delete Mailbox_ - keeps track of unpersisted deletes applied to persisted chunks in the in-memory catalog
+
+## 1. Produce to Write Buffer
+
+When an API call arrives it is synchronously written to the _Write Buffer_, if any. If successful, success is returned to the client.
+
+The _Write Buffer_ must therefore:
+
+* Respond in a matter of seconds or less
+* Provide sufficient durability for the given application
+* Establish an ordering of any potentially conflicting operations, to ensure consistent recovery (and replication)
+
+_If there is no _Write Buffer_ configured, writes are written directly to the in-memory data structures as described in the section below._
+
+## 2. Consume from Write Buffer
+
+Ingest servers consume from the _Write Buffer_.
+
+As durability is provided by the _Write Buffer_, modifications are initially only applied to in-memory data structures:
+
+* Writes are written to the _In-Memory Catalog_ and the _Persistence Windows_
+* Deletes are written to the _In-Memory Catalog_ and the _Delete Mailbox_
+
+It is assumed that the _Write Buffer_ is configured in such a way as to provide sufficient ordering to provide eventual consistency.
+
+This ordering is not only preserved by this ingest process, but by all lifecycle actions. This means:
+
+* The order of chunks in a partition preserves any ordering established by the _Write Buffer_
+* Query, persisted compaction, etc... only care about chunk ordering, and are unaware of the _Write Buffer_
+* **The _Write Buffer_ is an implementation detail of ingest**
+
+## 3. In-Memory Compaction
+
+As more data is ingested, the [lifecycle](./data_organization_lifecycle.md) may re-arrange how the unpersisted data is stored within the _In-Memory Catalog_.
+
+As the _Persistence Windows_ tracks the unpersisted data separately, compaction can occur independently of it
+
+## 4. Persist
+
+If IOx is configured with an object store, and therefore has a _Preserved Catalog_, eventually the [lifecycle](./data_organization_lifecycle.md) will decide to make durable some data for a partition.
+
+**Identify data to persist**
+
+First the process obtains a `FlushHandle` for the partition from the _Persistence Windows_. This handle:
+
+* Is a transaction on the _Persistence Windows_ that:
+ * On commit marks the data as persisted
+ * Prevents concurrent conflicting operations
+* Identifies a range of data from the _In-Memory Catalog_ to persist
+* Contains information about the unpersisted data outside this range
+
+**Split data to persist**
+
+The chunks containing data to persist within the _In-Memory Catalog_ are identified, and marked to prevent modification by another lifecycle action - e.g. compaction.
+
+Snapshots of these chunks are taken, and two new chunks are created containing the data to persist, and what is left over.
+
+These two new chunks are then atomically swapped into the _In-Memory Catalog_ replacing the chunks from which they were sourced.
+
+Any delete predicates that landed on the source chunks after the snapshots were taken, will be carried across to the two new chunks.
+
+_Much like compaction, this must process a contiguous sequence of chunks within the partition to ensure ordering is preserved_
+
+**Write data to parquet**
+
+The data to persist is now located in a single chunk in the _In-Memory Catalog_, and is written to a single parquet file in object storage.
+
+**Preserved Catalog**
+
+The parquet file is now written, but it needs to be recorded in the _Preserved Catalog_, along with the position reached in the _Write Buffer_. On restart this will enable IOx to skip over data that has already been persisted.
+
+It would be perfectly correct to simply record the min unpersisted positions in the _Write Buffer_ and on restart resume playback from there. However, as we persist data for different partitions separately, and out-of-order within a partition, this would likely result in re-ingesting lots of already persisted data.
+
+As such a more [sophisticated approach](../persistence_windows/src/checkpoint.rs) is used that more accurately describes what remains unpersisted.
+
+As deletes are relatively small we opt to simply flush the entire _Delete Mailbox_ for all partitions and tables on persist. This means we only need to be concerned with recording writes that remain unpersisted.
+
+The final _Preserved Catalog_ transaction therefore contains:
+
+* The newly created parquet file
+* All unpersisted deletes
+* Checkpoint information about the _Write Buffer_ position
+
+**Commit**
+
+If the _Preserved Catalog_ transaction is successful we can then:
+
+* Commit the `FlushHandle`
+* Mark the chunk as persisted in the _In-Memory Catalog_
+
+_There is a background worker that cleans up parquet files unreferenced by the Preserved Catalog. As such this process holds a shared "cleanup lock" to prevent this task from running and deleting the parquet file before it has been inserted into the Preserved Catalog._
+
+## 5. Compact Persisted
+
+As queries, deletes and writes land the [lifecycle](./data_organization_lifecycle.md) may decide to compact chunks that are persisted.
+
+As this data is already durable, this occurs largely independently of the ingest flow - no new data is being persisted.
+
+However, there are some subtleties worth mentioning:
+
+* Like all other lifecycle actions, this must compact a contiguous sequence of chunks to preserve ordering within a partition
+* The new chunk should preserve the maximum checkpoint information from the source chunks 1.
+* Deletes that land in the _Delete Mailbox_ whilst this operation is in-flight need to be also enqueued for the future compacted chunk
+
+_1. Checkpoint information is only associated with chunks as a mechanism to recover from loss of the Preserved Catalog, conceptually the checkpoint is really a property of the partition_
+
+## 6. Unload/Drop Data
+
+Eventually memory pressure will trigger the [lifecycle](./data_organization_lifecycle.md) to free up memory.
+
+If persistence is enabled, it will do this by unloading persisted chunks from the _In-Memory Catalog_. Following this the chunks will still exist, and their data can be retrieved from the persisted parquet files
+
+If persistence is disabled, it will drop the chunks from the _In-Memory Catalog_. Following this the chunks will no longer exist, and will not be queryable