Continue work on storage engine doc
parent
761220c90a
commit
fe88a9d6e4
|
@ -30,6 +30,7 @@ Major topics include:
|
||||||
The storage engine handles data from the point an API request is received through writing it to the physical disk.
|
The storage engine handles data from the point an API request is received through writing it to the physical disk.
|
||||||
Data is written to InfluxDB using [line protocol](/v2.0/reference/line-) sent via HTTP POST request to the `/write` endpoint.
|
Data is written to InfluxDB using [line protocol](/v2.0/reference/line-) sent via HTTP POST request to the `/write` endpoint.
|
||||||
Batches of [points](/v2.0/reference/glossary/#point) are sent to InfluxDB, compressed, and written to a WAL for immediate durability.
|
Batches of [points](/v2.0/reference/glossary/#point) are sent to InfluxDB, compressed, and written to a WAL for immediate durability.
|
||||||
|
(A *point* is a series key, field value, and timestamp.)
|
||||||
The points are also written to an in-memory cache and become immediately queryable.
|
The points are also written to an in-memory cache and become immediately queryable.
|
||||||
The cache is periodically written to disk in the form of [TSM](#time-structured-merge-tree-tsm) files.
|
The cache is periodically written to disk in the form of [TSM](#time-structured-merge-tree-tsm) files.
|
||||||
As TSM files accumulate, they are combined and compacted into higher level TSM files.
|
As TSM files accumulate, they are combined and compacted into higher level TSM files.
|
||||||
|
@ -50,21 +51,11 @@ When a client sends a write request, the following occurs:
|
||||||
4. Return success to caller.
|
4. Return success to caller.
|
||||||
|
|
||||||
`fsync()` takes the file and pushes pending writes all the way through any buffers and caches to disk.
|
`fsync()` takes the file and pushes pending writes all the way through any buffers and caches to disk.
|
||||||
As a system call, `fsync()` has a kernel context switch which is expensive _in terms of time_ but guarantees your data is safe on disk.
|
As a system call, `fsync()` has a kernel context switch which is computationally expensive, but guarantees your data is safe on disk.
|
||||||
|
|
||||||
{{% note%}}
|
|
||||||
To `fsync()` less frequently, batch your points.
|
|
||||||
{{% /note %}}
|
|
||||||
|
|
||||||
When the storage engine restarts, the WAL file is read back into the in-memory database.
|
When the storage engine restarts, the WAL file is read back into the in-memory database.
|
||||||
InfluxDB then snswer requests to the `/read` endpoint.
|
InfluxDB then snswer requests to the `/read` endpoint.
|
||||||
|
|
||||||
<!-- TODO is this still true? -->
|
|
||||||
<!-- On the file system, the WAL is made up of sequentially numbered files (`_000001.wal`). -->
|
|
||||||
<!-- The file numbers are monotonically increasing and referred to as WAL segments. -->
|
|
||||||
<!-- When a segment reaches 10MB in size, it is closed and a new one is opened. Each WAL segment stores multiple compressed blocks of writes and deletes. -->
|
|
||||||
<!-- Each entry in the WAL follows a [TLV standard](https://en.wikipedia.org/wiki/Type-length-value) with a single byte representing the type of entry (write or delete), a 4 byte `uint32` for the length of the compressed block, and then the compressed block. -->
|
|
||||||
|
|
||||||
{{% note%}}
|
{{% note%}}
|
||||||
Once you receive a response to a write request, your data is on disk!
|
Once you receive a response to a write request, your data is on disk!
|
||||||
{{% /note %}}
|
{{% /note %}}
|
||||||
|
@ -78,22 +69,20 @@ Data is not compressed in the cache.
|
||||||
The cache is recreated on restart by re-reading the WAL files on disk back into memory.
|
The cache is recreated on restart by re-reading the WAL files on disk back into memory.
|
||||||
The cache is queried at runtime and merged with the data stored in TSM files.
|
The cache is queried at runtime and merged with the data stored in TSM files.
|
||||||
|
|
||||||
<!-- From Scott: Points are organize by series. -->
|
When the storage engine restarts, WAL files are re-read into the in-memory cache.
|
||||||
<!-- A series key defines the contents of a series and is comprised of a measurement, tag set, and field key. -->
|
|
||||||
|
|
||||||
<!-- When the storage engine restarts, WAL files are written to the in-memory cache. -->
|
|
||||||
|
|
||||||
Queries to the storage engine will merge data from the cache with data from the TSM files.
|
Queries to the storage engine will merge data from the cache with data from the TSM files.
|
||||||
Queries execute on a copy of the data that is made from the cache at query processing time.
|
Queries execute on a copy of the data that is made from the cache at query processing time.
|
||||||
This way writes that come in while a query is running do not affect the result.
|
This way writes that come in while a query is running do not affect the result.
|
||||||
|
|
||||||
Deletes sent to the Cache will clear out the given key or the specific time range for the given key.
|
Deletes sent to the cache will clear out the given key or the specific time range for the given key.
|
||||||
|
|
||||||
## Time-Structured Merge Tree (TSM)
|
## Time-Structured Merge Tree (TSM)
|
||||||
|
|
||||||
To efficiently compact and store data,
|
To efficiently compact and store data,
|
||||||
the storage engine groups field values by [series](/v2.0/reference/key-concepts/data-elements/#series) key,
|
the storage engine groups field values by series key,
|
||||||
and then orders those field values by time.
|
and then orders those field values by time.
|
||||||
|
(A *series key* is defined by measurement, tag key and value, and field key.)
|
||||||
|
|
||||||
The storage engine uses a **Time-Structured Merge Tree** (TSM) data format.
|
The storage engine uses a **Time-Structured Merge Tree** (TSM) data format.
|
||||||
TSM files store compressed series data in a columnar format.
|
TSM files store compressed series data in a columnar format.
|
||||||
|
@ -101,13 +90,7 @@ To improve efficiency, the storage engine only stores differences (or *deltas*)
|
||||||
Column-oriented storage means we can read by series key and ignore what it doesn't need.
|
Column-oriented storage means we can read by series key and ignore what it doesn't need.
|
||||||
Storing data in columns lets the storage engine read by series key.
|
Storing data in columns lets the storage engine read by series key.
|
||||||
|
|
||||||
<!-- TERMS -->
|
After fields are stored safely in TSM files, the WAL is truncated and the cache is cleared.
|
||||||
<!-- Some terminology: -->
|
|
||||||
|
|
||||||
<!-- - a *series key* is defined by measurement, tag key and value, and field key. -->
|
|
||||||
<!-- - a *point* is a series key, field value, and timestamp. -->
|
|
||||||
|
|
||||||
After fields are stored safely in TSM files, WAL is truncated...
|
|
||||||
<!-- TODO what next? -->
|
<!-- TODO what next? -->
|
||||||
|
|
||||||
There’s a lot of logic and sophistication in the TSM compaction code.
|
There’s a lot of logic and sophistication in the TSM compaction code.
|
||||||
|
@ -116,7 +99,7 @@ organize values for a series together into long runs to best optimize compressio
|
||||||
|
|
||||||
## Time Series Index (TSI)
|
## Time Series Index (TSI)
|
||||||
|
|
||||||
As data cardinality (number of series) grows, queries read more series keys and become slower.
|
As data cardinality (the number of series) grows, queries read more series keys and become slower.
|
||||||
|
|
||||||
The **Time Series Index** ensures queries remain fast as data cardinality of data grows...
|
The **Time Series Index** ensures queries remain fast as data cardinality of data grows...
|
||||||
To keep queries fast as we have more data, we use a **Time Series Index**.
|
To keep queries fast as we have more data, we use a **Time Series Index**.
|
||||||
|
@ -127,28 +110,3 @@ The TSI stores series keys grouped by measurement, tag, and field.
|
||||||
TSI answers two questions well:
|
TSI answers two questions well:
|
||||||
1) What measurements, tags, fields exist?
|
1) What measurements, tags, fields exist?
|
||||||
2) Given a measurement, tags, and fields, what series keys exist?
|
2) Given a measurement, tags, and fields, what series keys exist?
|
||||||
|
|
||||||
<!-- ## Shards -->
|
|
||||||
<!-- A shard contains: -->
|
|
||||||
<!-- WAL files -->
|
|
||||||
<!-- TSM files -->
|
|
||||||
<!-- TSI files -->
|
|
||||||
<!-- Shards are time-bounded -->
|
|
||||||
<!-- Retention policies have properties: duration and shard duration -->
|
|
||||||
<!-- colder shards get more compacted -->
|
|
||||||
|
|
||||||
<!-- =========== QUESTIONS -->
|
|
||||||
<!-- Which parts of cache and WAL are configurable? -->
|
|
||||||
<!-- Should we even mention shards? -->
|
|
||||||
|
|
||||||
<!-- =========== OTHER -->
|
|
||||||
<!-- V1 -->
|
|
||||||
<!-- - FileStore - The FileStore mediates access to all TSM files on disk. -->
|
|
||||||
<!-- It ensures that TSM files are installed atomically when existing ones are replaced as well as removing TSM files that are no longer used. -->
|
|
||||||
<!-- - Compactor - The Compactor is responsible for converting less optimized Cache and TSM data into more read-optimized formats. -->
|
|
||||||
<!-- It does this by compressing series, removing deleted data, optimizing indices and combining smaller files into larger ones. -->
|
|
||||||
<!-- - Compaction Planner - The Compaction Planner determines which TSM files are ready for a compaction and ensures that multiple concurrent compactions do not interfere with each other. -->
|
|
||||||
<!-- - Compression - Compression is handled by various Encoders and Decoders for specific data types. -->
|
|
||||||
<!-- Some encoders are fairly static and always encode the same type the same way; -->
|
|
||||||
<!-- others switch their compression strategy based on the shape of the data. -->
|
|
||||||
<!-- - Writers/Readers - Each file type (WAL segment, TSM files, tombstones, etc..) has Writers and Readers for working with the formats. -->
|
|
||||||
|
|
Loading…
Reference in New Issue