Continue work on storage engine doc

pull/556/head
pierwill 2019-11-12 09:40:45 -08:00
parent 761220c90a
commit fe88a9d6e4
1 changed files with 8 additions and 50 deletions

View File

@ -30,6 +30,7 @@ Major topics include:
The storage engine handles data from the point an API request is received through writing it to the physical disk. The storage engine handles data from the point an API request is received through writing it to the physical disk.
Data is written to InfluxDB using [line protocol](/v2.0/reference/line-) sent via HTTP POST request to the `/write` endpoint. Data is written to InfluxDB using [line protocol](/v2.0/reference/line-) sent via HTTP POST request to the `/write` endpoint.
Batches of [points](/v2.0/reference/glossary/#point) are sent to InfluxDB, compressed, and written to a WAL for immediate durability. Batches of [points](/v2.0/reference/glossary/#point) are sent to InfluxDB, compressed, and written to a WAL for immediate durability.
(A *point* is a series key, field value, and timestamp.)
The points are also written to an in-memory cache and become immediately queryable. The points are also written to an in-memory cache and become immediately queryable.
The cache is periodically written to disk in the form of [TSM](#time-structured-merge-tree-tsm) files. The cache is periodically written to disk in the form of [TSM](#time-structured-merge-tree-tsm) files.
As TSM files accumulate, they are combined and compacted into higher level TSM files. As TSM files accumulate, they are combined and compacted into higher level TSM files.
@ -50,21 +51,11 @@ When a client sends a write request, the following occurs:
4. Return success to caller. 4. Return success to caller.
`fsync()` takes the file and pushes pending writes all the way through any buffers and caches to disk. `fsync()` takes the file and pushes pending writes all the way through any buffers and caches to disk.
As a system call, `fsync()` has a kernel context switch which is expensive _in terms of time_ but guarantees your data is safe on disk. As a system call, `fsync()` has a kernel context switch which is computationally expensive, but guarantees your data is safe on disk.
{{% note%}}
To `fsync()` less frequently, batch your points.
{{% /note %}}
When the storage engine restarts, the WAL file is read back into the in-memory database. When the storage engine restarts, the WAL file is read back into the in-memory database.
InfluxDB then snswer requests to the `/read` endpoint. InfluxDB then snswer requests to the `/read` endpoint.
<!-- TODO is this still true? -->
<!-- On the file system, the WAL is made up of sequentially numbered files (`_000001.wal`). -->
<!-- The file numbers are monotonically increasing and referred to as WAL segments. -->
<!-- When a segment reaches 10MB in size, it is closed and a new one is opened. Each WAL segment stores multiple compressed blocks of writes and deletes. -->
<!-- Each entry in the WAL follows a [TLV standard](https://en.wikipedia.org/wiki/Type-length-value) with a single byte representing the type of entry (write or delete), a 4 byte `uint32` for the length of the compressed block, and then the compressed block. -->
{{% note%}} {{% note%}}
Once you receive a response to a write request, your data is on disk! Once you receive a response to a write request, your data is on disk!
{{% /note %}} {{% /note %}}
@ -78,22 +69,20 @@ Data is not compressed in the cache.
The cache is recreated on restart by re-reading the WAL files on disk back into memory. The cache is recreated on restart by re-reading the WAL files on disk back into memory.
The cache is queried at runtime and merged with the data stored in TSM files. The cache is queried at runtime and merged with the data stored in TSM files.
<!-- From Scott: Points are organize by series. --> When the storage engine restarts, WAL files are re-read into the in-memory cache.
<!-- A series key defines the contents of a series and is comprised of a measurement, tag set, and field key. -->
<!-- When the storage engine restarts, WAL files are written to the in-memory cache. -->
Queries to the storage engine will merge data from the cache with data from the TSM files. Queries to the storage engine will merge data from the cache with data from the TSM files.
Queries execute on a copy of the data that is made from the cache at query processing time. Queries execute on a copy of the data that is made from the cache at query processing time.
This way writes that come in while a query is running do not affect the result. This way writes that come in while a query is running do not affect the result.
Deletes sent to the Cache will clear out the given key or the specific time range for the given key. Deletes sent to the cache will clear out the given key or the specific time range for the given key.
## Time-Structured Merge Tree (TSM) ## Time-Structured Merge Tree (TSM)
To efficiently compact and store data, To efficiently compact and store data,
the storage engine groups field values by [series](/v2.0/reference/key-concepts/data-elements/#series) key, the storage engine groups field values by series key,
and then orders those field values by time. and then orders those field values by time.
(A *series key* is defined by measurement, tag key and value, and field key.)
The storage engine uses a **Time-Structured Merge Tree** (TSM) data format. The storage engine uses a **Time-Structured Merge Tree** (TSM) data format.
TSM files store compressed series data in a columnar format. TSM files store compressed series data in a columnar format.
@ -101,13 +90,7 @@ To improve efficiency, the storage engine only stores differences (or *deltas*)
Column-oriented storage means we can read by series key and ignore what it doesn't need. Column-oriented storage means we can read by series key and ignore what it doesn't need.
Storing data in columns lets the storage engine read by series key. Storing data in columns lets the storage engine read by series key.
<!-- TERMS --> After fields are stored safely in TSM files, the WAL is truncated and the cache is cleared.
<!-- Some terminology: -->
<!-- - a *series key* is defined by measurement, tag key and value, and field key. -->
<!-- - a *point* is a series key, field value, and timestamp. -->
After fields are stored safely in TSM files, WAL is truncated...
<!-- TODO what next? --> <!-- TODO what next? -->
Theres a lot of logic and sophistication in the TSM compaction code. Theres a lot of logic and sophistication in the TSM compaction code.
@ -116,7 +99,7 @@ organize values for a series together into long runs to best optimize compressio
## Time Series Index (TSI) ## Time Series Index (TSI)
As data cardinality (number of series) grows, queries read more series keys and become slower. As data cardinality (the number of series) grows, queries read more series keys and become slower.
The **Time Series Index** ensures queries remain fast as data cardinality of data grows... The **Time Series Index** ensures queries remain fast as data cardinality of data grows...
To keep queries fast as we have more data, we use a **Time Series Index**. To keep queries fast as we have more data, we use a **Time Series Index**.
@ -127,28 +110,3 @@ The TSI stores series keys grouped by measurement, tag, and field.
TSI answers two questions well: TSI answers two questions well:
1) What measurements, tags, fields exist? 1) What measurements, tags, fields exist?
2) Given a measurement, tags, and fields, what series keys exist? 2) Given a measurement, tags, and fields, what series keys exist?
<!-- ## Shards -->
<!-- A shard contains: -->
<!-- WAL files -->
<!-- TSM files -->
<!-- TSI files -->
<!-- Shards are time-bounded -->
<!-- Retention policies have properties: duration and shard duration -->
<!-- colder shards get more compacted -->
<!-- =========== QUESTIONS -->
<!-- Which parts of cache and WAL are configurable? -->
<!-- Should we even mention shards? -->
<!-- =========== OTHER -->
<!-- V1 -->
<!-- - FileStore - The FileStore mediates access to all TSM files on disk. -->
<!-- It ensures that TSM files are installed atomically when existing ones are replaced as well as removing TSM files that are no longer used. -->
<!-- - Compactor - The Compactor is responsible for converting less optimized Cache and TSM data into more read-optimized formats. -->
<!-- It does this by compressing series, removing deleted data, optimizing indices and combining smaller files into larger ones. -->
<!-- - Compaction Planner - The Compaction Planner determines which TSM files are ready for a compaction and ensures that multiple concurrent compactions do not interfere with each other. -->
<!-- - Compression - Compression is handled by various Encoders and Decoders for specific data types. -->
<!-- Some encoders are fairly static and always encode the same type the same way; -->
<!-- others switch their compression strategy based on the shape of the data. -->
<!-- - Writers/Readers - Each file type (WAL segment, TSM files, tombstones, etc..) has Writers and Readers for working with the formats. -->