Storage engine doc: continue PR review work

pull/556/head
pierwill 2020-01-08 10:41:29 -08:00
parent 46803fc3ee
commit 09e6cef6d3
1 changed files with 12 additions and 12 deletions

View File

@ -36,7 +36,7 @@ The in-memory cache is periodically written to disk in the form of [TSM](#time-s
As TSM files accumulate, the storage engine combines and compacts accumulated them into higher level TSM files.
{{% note %}}
Points can be sent individually; however, for efficiency, most applications send points in batches.
While points can be sent individually, for efficiency, most applications send points in batches.
Points in a POST body can be from an arbitrary number of series, measurements, and tag sets.
Points in a batch do not have to be from the same measurement or tagset.
{{% /note %}}
@ -62,29 +62,29 @@ InfluxDB then answers requests to the `/read` endpoint.
## Cache
The **cache** is an in-memory copy of data points currently stored in the WAL.
Points are organized by key, which is the measurement, tag set, and unique field.
Each field is stored in its own time-ordered range.
Data is not compressed in the cache.
The cache is recreated on restart by re-reading the WAL files on disk back into memory.
The cache is queried at runtime and merged with the data stored in TSM files.
When the storage engine restarts, WAL files are re-read into the in-memory cache.
The cache:
- Organizes points by key (measurement, tag set, and unique field)
Each field is stored in its own time-ordered range.
- Stores uncompressed data.
- Gets updates from the WAL each time the storage engine restarts.
The cache is queried at runtime and merged with the data stored in TSM files.
Queries to the storage engine merge data from the cache with data from the TSM files.
Queries execute on a copy of the data that is made from the cache at query processing time.
This way writes that come in while a query is running do not affect the result.
Deletes sent to the cache will clear out the given key or the specific time range for the given key.
Deletes sent to the cache clear the specified key or time range for a specified key.
## Time-Structured Merge Tree (TSM)
To efficiently compact and store data,
the storage engine groups field values by series key, and then orders those field values by time.
(A [series key](/v2/) is defined by measurement, tag key and value, and field key.)
(A [series key](/v2.0/reference/glossary/#series-key) is defined by measurement, tag key and value, and field key.)
The storage engine uses a **Time-Structured Merge Tree** (TSM) data format.
TSM files store compressed series data in a columnar format.
To improve efficiency, the storage engine only stores differences (or *deltas*) between values in a series.
Column-oriented storage means we can read by series key and ignore what it doesn't need.
Storing data in columns lets the storage engine read by series key.
Column-oriented storage lets the engine read by series key and omit extraneous data.
After fields are stored safely in TSM files, the WAL is truncated and the cache is cleared.
The TSM compaction code is quite complex.
@ -98,7 +98,7 @@ The **Time Series Index** ensures queries remain fast as data cardinality grows.
To keep queries fast as we have more data, we use a **Time Series Index**.
TSI stores series keys grouped by measurement, tag, and field.
In data with high cardinality (a large quantity of series), it becomes slower to search through all series keys.
In data with high cardinality (a large quantity of series), queries become slower.
The TSI stores series keys grouped by measurement, tag, and field.
TSI answers two questions well: