7.0 KiB

Raw Blame History

Ingest

A user expects to be able to write/delete data from an IOx database with an API call to an IOx server. Critically the user expects:

An API response in a matter of seconds or less
A successful API call to eventually be visible to query

This document aims to provide a high-level overview of how a number of related, but independent, subsystems come together to achieve this functionality within IOx:

In-Memory Catalog - metadata about the data contained within this server
Optional Preserved Catalog - preserved metadata used to rebuild the in-memory catalog and start ingestion on restart
Optional Write Buffer - a write-ahead log for modifications that haven't reached the Preserved Catalog
Lifecycle Policy - determines when to persist data for a partition
Persistence Windows - keeps track of writes to unpersisted chunks in the in-memory catalog
Delete Mailbox - keeps track of unpersisted deletes applied to persisted chunks in the in-memory catalog

1. Produce to Write Buffer

When an API call arrives it is synchronously written to the Write Buffer, if any. If successful, success is returned to the client.

The Write Buffer must therefore:

Respond in a matter of seconds or less
Provide sufficient durability for the given application
Establish an ordering of any potentially conflicting operations, to ensure consistent recovery (and replication)

If there is no Write Buffer configured, writes are written directly to the in-memory data structures as described in the section below.

2. Consume from Write Buffer

Ingest servers consume from the Write Buffer.

As durability is provided by the Write Buffer, modifications are initially only applied to in-memory data structures:

Writes are written to the In-Memory Catalog and the Persistence Windows
Deletes are written to the In-Memory Catalog and the Delete Mailbox

It is assumed that the Write Buffer is configured in such a way as to provide sufficient ordering to provide eventual consistency.

This ordering is not only preserved by this ingest process, but by all lifecycle actions. This means:

The order of chunks in a partition preserves any ordering established by the Write Buffer
Query, persisted compaction, etc... only care about chunk ordering, and are unaware of the Write Buffer
The Write Buffer is an implementation detail of ingest

3. In-Memory Compaction

As more data is ingested, the lifecycle may re-arrange how the unpersisted data is stored within the In-Memory Catalog.

As the Persistence Windows tracks the unpersisted data separately, compaction can occur independently of it

4. Persist

If IOx is configured with an object store, and therefore has a Preserved Catalog, eventually the lifecycle will decide to make durable some data for a partition.

Identify data to persist

First the process obtains a FlushHandle for the partition from the Persistence Windows. This handle:

Is a transaction on the Persistence Windows that:
- On commit marks the data as persisted
- Prevents concurrent conflicting operations
Identifies a range of data from the In-Memory Catalog to persist
Contains information about the unpersisted data outside this range

Split data to persist

The chunks containing data to persist within the In-Memory Catalog are identified, and marked to prevent modification by another lifecycle action - e.g. compaction.

Snapshots of these chunks are taken, and two new chunks are created containing the data to persist, and what is left over.

These two new chunks are then atomically swapped into the In-Memory Catalog replacing the chunks from which they were sourced.

Any delete predicates that landed on the source chunks after the snapshots were taken, will be carried across to the two new chunks.

Much like compaction, this must process a contiguous sequence of chunks within the partition to ensure ordering is preserved

Write data to parquet

The data to persist is now located in a single chunk in the In-Memory Catalog, and is written to a single parquet file in object storage.

Preserved Catalog

The parquet file is now written, but it needs to be recorded in the Preserved Catalog, along with the position reached in the Write Buffer. On restart this will enable IOx to skip over data that has already been persisted.

It would be perfectly correct to simply record the min unpersisted positions in the Write Buffer and on restart resume playback from there. However, as we persist data for different partitions separately, and out-of-order within a partition, this would likely result in re-ingesting lots of already persisted data.

As such a more sophisticated approach is used that more accurately describes what remains unpersisted.

As deletes are relatively small we opt to simply flush the entire Delete Mailbox for all partitions and tables on persist. This means we only need to be concerned with recording writes that remain unpersisted.

The final Preserved Catalog transaction therefore contains:

The newly created parquet file
All unpersisted deletes
Checkpoint information about the Write Buffer position

Commit

If the Preserved Catalog transaction is successful we can then:

Commit the FlushHandle
Mark the chunk as persisted in the In-Memory Catalog

There is a background worker that cleans up parquet files unreferenced by the Preserved Catalog. As such this process holds a shared "cleanup lock" to prevent this task from running and deleting the parquet file before it has been inserted into the Preserved Catalog.

5. Compact Persisted

As queries, deletes and writes land the lifecycle may decide to compact chunks that are persisted.

As this data is already durable, this occurs largely independently of the ingest flow - no new data is being persisted.

However, there are some subtleties worth mentioning:

Like all other lifecycle actions, this must compact a contiguous sequence of chunks to preserve ordering within a partition
The new chunk should preserve the maximum checkpoint information from the source chunks ^1.
Deletes that land in the Delete Mailbox whilst this operation is in-flight need to be also enqueued for the future compacted chunk

^1. Checkpoint information is only associated with chunks as a mechanism to recover from loss of the Preserved Catalog, conceptually the checkpoint is really a property of the partition

6. Unload/Drop Data

Eventually memory pressure will trigger the lifecycle to free up memory.

If persistence is enabled, it will do this by unloading persisted chunks from the In-Memory Catalog. Following this the chunks will still exist, and their data can be retrieved from the persisted parquet files

If persistence is disabled, it will drop the chunks from the In-Memory Catalog. Following this the chunks will no longer exist, and will not be queryable

7.0 KiB Raw Blame History