From 3c6164953ea05727fa21910b17c513ddd6e5d126 Mon Sep 17 00:00:00 2001 From: pierwill <19642016+pierwill@users.noreply.github.com> Date: Mon, 21 Oct 2019 11:18:48 -0700 Subject: [PATCH 01/18] Add initial outline and notes for storage engine documentation Uses material based on v1 docs and: - https://www.influxdata.com/blog/influxdb-internals-101-part-one/ - https://www.youtube.com/watch?v=8dOZGnSwKmo --- content/v2.0/reference/storage-engine.md | 202 +++++++++++++++++++++++ 1 file changed, 202 insertions(+) create mode 100644 content/v2.0/reference/storage-engine.md diff --git a/content/v2.0/reference/storage-engine.md b/content/v2.0/reference/storage-engine.md new file mode 100644 index 000000000..12dbb9e8f --- /dev/null +++ b/content/v2.0/reference/storage-engine.md @@ -0,0 +1,202 @@ +--- +title: InfluxDB storage engine +description: > + Concepts related to InfluxDB storage engine. +weight: 7 +menu: + v2_0_ref: + name: Storage engine +v2.0/tags: [storage engine, internals, platform] +--- + +## Introduction + +The InfluxDB 2.0 platform includes: + +- query engine +- internal API +- storage engine + +The query engine sends requests through the internal API to the storage engine. +The storage engine ensures the following three things: + +- Data is safely written to disk +- Queried data is returned complete and correct +- Data is accurate (first) and performant (second) + +This document details the internal workings of the storage engine. +This information is presented both as a reference and to aid those looking to maximize performance. + +{{% note %}} +##### At a glance: Changes in InfluxDB 2.0 +- The InfluxDB 2.0 storage engine no longer partitions data into shards by time. +- **Buckets** replace databases and retention policies. +- Only TSI is used. There is no more in-memory index. +- The configuration interface and options have changed. Configuration options are not currently exposed, but will be. + +[Read about the v1 storage engine](https://docs.influxdata.com/influxdb/v1.7/concepts/storage_engine). +{{% /note %}} + +## Summary + +In summary, batches of points are POSTed to InfluxDB. +Those batches are snappy compressed and written to a WAL for immediate durability. +The points are also written to an in-memory cache so that newly written points are immediately queryable. +The cache is periodically flushed to TSM files. +As TSM files accumulate, they are combined and compacted into higher level TSM files. +TSM data is organized into shards. +The time range covered by a shard and the replication factor of a shard in a clustered deployment are configured by the retention policy. + + + + + + + + + + + + + + + + + + + +## /write endpoint + +Data is written to InfluxDB using [Line protocol](/) sent via HTTP POST request to the `/write` endpoint. +Points can be sent individually; however, for efficiency, most applications send points in batches. +A typical batch ranges in size from hundreds to thousands of points. +Points in a POST body can be from an arbitrary number of series. +Points in a batch do not have to be from the same measurement or tagset. + +## Durability: Write Ahead Log (WAL) + + + + +To ensure durability, we use a Write Ahead Log (WAL). + +WAL is a data structure and algorithm that is super simple and powerful. +It ensures that written data does not disappear when storage engine restarts. +When a client sends a /write request, the following occurs: + +1. Write request is appended to the end of the WAL file. +2. fsync() the data to the file. +3. Update the in-memory database. +4. Return success to caller. + +fsync() is a system call, so it has a kernel context switch which costs something. +fsync() takes the file and pushes pending writes all the way through any buffers and caches to disk. +fsync() is expensive _in terms of time_ but guarantees your data is safe on disk. +**Important** to batch your points (send in ~2000 points at a time), to fsync() less frequently. + + +When the storage engine restarts, open WAL file and read it back into the in-memory database. +Answer requests to the /read endpoint. + + + + + + + + + + + + + + + + + + + +{{% note%}} +Once you receive a response to a write request, your data is on disk! +{{% /note %}} + +## Cache + + +The Cache is an in-memory copy of all data points current stored in the WAL. +The points are organized by the key, which is the measurement, [tag set](/influxdb/v1.7/concepts/glossary/#tag-set), and unique [field](/influxdb/v1.7/concepts/glossary/#field). +Each field is kept as its own time-ordered range. +The Cache data is not compressed while in memory. + +Queries to the storage engine will merge data from the Cache with data from the TSM files. +Queries execute on a copy of the data that is made from the cache at query processing time. +This way writes that come in while a query is running won't affect the result. + +Deletes sent to the Cache will clear out the given key or the specific time range for the given key. + + + + + + + + + + + +The cache is recreated on restart by re-reading the WAL files on disk back into memory. + +## Time-Structured Merge Tree + +Now let's handle more Data! +Queries get slower as data grows +Service terminates if data size exceeds memory... + +**Time-Structured Merge Tree** (TSM) is our data format +We group field values grouped by series key, then order field values by time + + +series key = measurement, tag key+value, field key +point = series key, field value, timestamp + +Within a series, we store only differences between values, which is more efficient. + +Column-Oriented storage means we can read by series key and ignore what it doesn't need. + +Compression helps with performance. + +After fields are stored safely in TSM files, WAL is truncated... + +(This stuff is configurable) + +There’s a lot of logic and sophistication in the TSM compaction code. +However, the high-level goal is quite simple: +organize values for a series together into long runs to best optimize compression and scanning queries. + +## TSI + +To keep queries fast as we have more data, we use a **Time Series Index**. +cardinality = quantity of series keys +With high cardinality, we have to search through all series keys. +So how to quickly find and match series keys? +We use Time Series Index (TSI), which stores series keys grouped by measurement,tag,field +TSI answers question what measurements, tags, fields exist? + + + + + + + + + + + + + + + + + + + From 61ce3d47ff839b6b9a7d5ff6e74f9181793968fd Mon Sep 17 00:00:00 2001 From: pierwill <19642016+pierwill@users.noreply.github.com> Date: Thu, 24 Oct 2019 10:41:56 -0700 Subject: [PATCH 02/18] Re-structure storage engine doc --- content/v2.0/reference/storage-engine.md | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/content/v2.0/reference/storage-engine.md b/content/v2.0/reference/storage-engine.md index 12dbb9e8f..0cfaf5a0e 100644 --- a/content/v2.0/reference/storage-engine.md +++ b/content/v2.0/reference/storage-engine.md @@ -11,14 +11,7 @@ v2.0/tags: [storage engine, internals, platform] ## Introduction -The InfluxDB 2.0 platform includes: - -- query engine -- internal API -- storage engine - -The query engine sends requests through the internal API to the storage engine. -The storage engine ensures the following three things: +The InfluxDB storage engine ensures the following three things: - Data is safely written to disk - Queried data is returned complete and correct @@ -27,17 +20,24 @@ The storage engine ensures the following three things: This document details the internal workings of the storage engine. This information is presented both as a reference and to aid those looking to maximize performance. +Major topics include: + +* [Write Ahead Log (WAL)](#) +* [Time-Structed Merge Tree (TSM)](#) +* [Time Series Index (TSI)](#) + {{% note %}} -##### At a glance: Changes in InfluxDB 2.0 +##### At a glance: changes to the storage engine in InfluxDB 2.0 - The InfluxDB 2.0 storage engine no longer partitions data into shards by time. - **Buckets** replace databases and retention policies. - Only TSI is used. There is no more in-memory index. - The configuration interface and options have changed. Configuration options are not currently exposed, but will be. -[Read about the v1 storage engine](https://docs.influxdata.com/influxdb/v1.7/concepts/storage_engine). +Read about the [v1 storage engine](https://docs.influxdata.com/influxdb/v1.7/concepts/storage_engine). {{% /note %}} -## Summary + +## Writing data: from API to disk In summary, batches of points are POSTed to InfluxDB. Those batches are snappy compressed and written to a WAL for immediate durability. @@ -183,7 +183,7 @@ We use Time Series Index (TSI), which stores series keys grouped by measurement, TSI answers question what measurements, tags, fields exist? - + From 496673f6a44f2e04f4b2b9c347c8e7f96abcb547 Mon Sep 17 00:00:00 2001 From: pierwill <19642016+pierwill@users.noreply.github.com> Date: Thu, 24 Oct 2019 11:13:35 -0700 Subject: [PATCH 03/18] Redistribute v1 storage engine information --- content/v2.0/reference/storage-engine.md | 60 ++++++++++++------------ 1 file changed, 31 insertions(+), 29 deletions(-) diff --git a/content/v2.0/reference/storage-engine.md b/content/v2.0/reference/storage-engine.md index 0cfaf5a0e..aae389564 100644 --- a/content/v2.0/reference/storage-engine.md +++ b/content/v2.0/reference/storage-engine.md @@ -1,7 +1,7 @@ --- title: InfluxDB storage engine description: > - Concepts related to InfluxDB storage engine. + An overview of the InfluxDB storage engine architecture. weight: 7 menu: v2_0_ref: @@ -36,7 +36,6 @@ Major topics include: Read about the [v1 storage engine](https://docs.influxdata.com/influxdb/v1.7/concepts/storage_engine). {{% /note %}} - ## Writing data: from API to disk In summary, batches of points are POSTed to InfluxDB. @@ -44,28 +43,10 @@ Those batches are snappy compressed and written to a WAL for immediate durabilit The points are also written to an in-memory cache so that newly written points are immediately queryable. The cache is periodically flushed to TSM files. As TSM files accumulate, they are combined and compacted into higher level TSM files. -TSM data is organized into shards. -The time range covered by a shard and the replication factor of a shard in a clustered deployment are configured by the retention policy. + + - - - - - - - - - - - - - - - - - - -## /write endpoint + Data is written to InfluxDB using [Line protocol](/) sent via HTTP POST request to the `/write` endpoint. Points can be sent individually; however, for efficiency, most applications send points in batches. @@ -78,6 +59,8 @@ Points in a batch do not have to be from the same measurement or tagset. + + To ensure durability, we use a Write Ahead Log (WAL). WAL is a data structure and algorithm that is super simple and powerful. @@ -107,7 +90,7 @@ Answer requests to the /read endpoint. - + @@ -122,16 +105,19 @@ Once you receive a response to a write request, your data is on disk! ## Cache +Queries to the storage engine will merge data from the Cache with data from the TSM files. +Queries execute on a copy of the data that is made from the cache at query processing time. +This way writes that come in while a query is running won’t affect the result. + + + + The Cache is an in-memory copy of all data points current stored in the WAL. The points are organized by the key, which is the measurement, [tag set](/influxdb/v1.7/concepts/glossary/#tag-set), and unique [field](/influxdb/v1.7/concepts/glossary/#field). Each field is kept as its own time-ordered range. The Cache data is not compressed while in memory. -Queries to the storage engine will merge data from the Cache with data from the TSM files. -Queries execute on a copy of the data that is made from the cache at query processing time. -This way writes that come in while a query is running won't affect the result. - Deletes sent to the Cache will clear out the given key or the specific time range for the given key. @@ -148,6 +134,9 @@ The cache is recreated on restart by re-reading the WAL files on disk back into ## Time-Structured Merge Tree + + + Now let's handle more Data! Queries get slower as data grows Service terminates if data size exceeds memory... @@ -183,8 +172,8 @@ We use Time Series Index (TSI), which stores series keys grouped by measurement, TSI answers question what measurements, tags, fields exist? + - @@ -200,3 +189,16 @@ TSI answers question what measurements, tags, fields exist? + + + + + + + + + + + + + From d67081b98a8e0d50b5c85fb5f9b6b6854540ca7e Mon Sep 17 00:00:00 2001 From: pierwill <19642016+pierwill@users.noreply.github.com> Date: Thu, 24 Oct 2019 12:58:32 -0700 Subject: [PATCH 04/18] Work on drafting storage engine doc --- content/v2.0/reference/storage-engine.md | 102 +++++++++-------------- 1 file changed, 41 insertions(+), 61 deletions(-) diff --git a/content/v2.0/reference/storage-engine.md b/content/v2.0/reference/storage-engine.md index aae389564..2d03092b8 100644 --- a/content/v2.0/reference/storage-engine.md +++ b/content/v2.0/reference/storage-engine.md @@ -9,8 +9,6 @@ menu: v2.0/tags: [storage engine, internals, platform] --- -## Introduction - The InfluxDB storage engine ensures the following three things: - Data is safely written to disk @@ -22,12 +20,12 @@ This information is presented both as a reference and to aid those looking to ma Major topics include: -* [Write Ahead Log (WAL)](#) -* [Time-Structed Merge Tree (TSM)](#) -* [Time Series Index (TSI)](#) +* [Write Ahead Log (WAL)](#write-ahead-log-wal) +* [Time-Structed Merge Tree (TSM)](#time-structured-merge-tree-tsm) +* [Time Series Index (TSI)](#time-series-index-tsi) {{% note %}} -##### At a glance: changes to the storage engine in InfluxDB 2.0 +##### At a glance: changes to the InfluxDB storage engine from 1.x to 2.0 - The InfluxDB 2.0 storage engine no longer partitions data into shards by time. - **Buckets** replace databases and retention policies. - Only TSI is used. There is no more in-memory index. @@ -38,38 +36,32 @@ Read about the [v1 storage engine](https://docs.influxdata.com/influxdb/v1.7/con ## Writing data: from API to disk -In summary, batches of points are POSTed to InfluxDB. -Those batches are snappy compressed and written to a WAL for immediate durability. -The points are also written to an in-memory cache so that newly written points are immediately queryable. -The cache is periodically flushed to TSM files. -As TSM files accumulate, they are combined and compacted into higher level TSM files. - - - - - +The storage engine handles data from the point an API request is received through writing it to the physical disk. Data is written to InfluxDB using [Line protocol](/) sent via HTTP POST request to the `/write` endpoint. +Batches of points are sent to InfluxDB. +Those batches are compressed and written to a WAL for immediate durability. +The points are also written to an in-memory cache so that newly written points are immediately queryable. +The cache is periodically written to disk as TSM files. +As TSM files accumulate, they are combined and compacted into higher level TSM files. + Points can be sent individually; however, for efficiency, most applications send points in batches. A typical batch ranges in size from hundreds to thousands of points. Points in a POST body can be from an arbitrary number of series. Points in a batch do not have to be from the same measurement or tagset. -## Durability: Write Ahead Log (WAL) - - - - - +## Write Ahead Log (WAL) To ensure durability, we use a Write Ahead Log (WAL). + WAL is a data structure and algorithm that is super simple and powerful. It ensures that written data does not disappear when storage engine restarts. -When a client sends a /write request, the following occurs: +When a client sends a write request, the following occurs: 1. Write request is appended to the end of the WAL file. 2. fsync() the data to the file. 3. Update the in-memory database. + 4. Return success to caller. fsync() is a system call, so it has a kernel context switch which costs something. @@ -78,25 +70,16 @@ fsync() is expensive _in terms of time_ but guarantees your data is safe on disk **Important** to batch your points (send in ~2000 points at a time), to fsync() less frequently. -When the storage engine restarts, open WAL file and read it back into the in-memory database. +When the storage engine restarts, + +open WAL file and read it back into the in-memory database. Answer requests to the /read endpoint. - - - - - - - - + - - - - {{% note%}} @@ -109,34 +92,21 @@ Queries to the storage engine will merge data from the Cache with data from the Queries execute on a copy of the data that is made from the cache at query processing time. This way writes that come in while a query is running won’t affect the result. - - - - The Cache is an in-memory copy of all data points current stored in the WAL. The points are organized by the key, which is the measurement, [tag set](/influxdb/v1.7/concepts/glossary/#tag-set), and unique [field](/influxdb/v1.7/concepts/glossary/#field). Each field is kept as its own time-ordered range. The Cache data is not compressed while in memory. -Deletes sent to the Cache will clear out the given key or the specific time range for the given key. +It is queried at runtime and merged with the data stored in TSM files. - - - - - - - - - +Deletes sent to the Cache will clear out the given key or the specific time range for the given key. The cache is recreated on restart by re-reading the WAL files on disk back into memory. -## Time-Structured Merge Tree +## Time-Structured Merge Tree (TSM) - Now let's handle more Data! Queries get slower as data grows Service terminates if data size exceeds memory... @@ -162,18 +132,20 @@ There’s a lot of logic and sophistication in the TSM compaction code. However, the high-level goal is quite simple: organize values for a series together into long runs to best optimize compression and scanning queries. -## TSI +## Time Series Index (TSI) + +TSI stores series keys grouped by measure, tag, field. To keep queries fast as we have more data, we use a **Time Series Index**. cardinality = quantity of series keys With high cardinality, we have to search through all series keys. So how to quickly find and match series keys? We use Time Series Index (TSI), which stores series keys grouped by measurement,tag,field -TSI answers question what measurements, tags, fields exist? - +TSI answers two questions well: +1) What measurements, tags, fields exist? +2) Given a measurement, tags, and fields, what series keys exist? - - + @@ -184,10 +156,9 @@ TSI answers question what measurements, tags, fields exist? - - - - + + + @@ -202,3 +173,12 @@ TSI answers question what measurements, tags, fields exist? + + + + + + + + + From 239192638bbc1bd793d422f42f8d7d4cb495a488 Mon Sep 17 00:00:00 2001 From: pierwill <19642016+pierwill@users.noreply.github.com> Date: Thu, 24 Oct 2019 13:20:20 -0700 Subject: [PATCH 05/18] Continued drafting storage engine doc --- content/v2.0/reference/storage-engine.md | 42 +++++++++--------------- 1 file changed, 16 insertions(+), 26 deletions(-) diff --git a/content/v2.0/reference/storage-engine.md b/content/v2.0/reference/storage-engine.md index 2d03092b8..d4657a5e0 100644 --- a/content/v2.0/reference/storage-engine.md +++ b/content/v2.0/reference/storage-engine.md @@ -79,7 +79,7 @@ Answer requests to the /read endpoint. - + {{% note%}} @@ -93,40 +93,31 @@ Queries execute on a copy of the data that is made from the cache at query proce This way writes that come in while a query is running won’t affect the result. The Cache is an in-memory copy of all data points current stored in the WAL. -The points are organized by the key, which is the measurement, [tag set](/influxdb/v1.7/concepts/glossary/#tag-set), and unique [field](/influxdb/v1.7/concepts/glossary/#field). +The points are organized by the key, which is the measurement, tag set, and unique field. Each field is kept as its own time-ordered range. The Cache data is not compressed while in memory. - -It is queried at runtime and merged with the data stored in TSM files. - +The cache is recreated on restart by re-reading the WAL files on disk back into memory. Deletes sent to the Cache will clear out the given key or the specific time range for the given key. -The cache is recreated on restart by re-reading the WAL files on disk back into memory. + ## Time-Structured Merge Tree (TSM) - - -Now let's handle more Data! -Queries get slower as data grows -Service terminates if data size exceeds memory... - -**Time-Structured Merge Tree** (TSM) is our data format -We group field values grouped by series key, then order field values by time - - -series key = measurement, tag key+value, field key -point = series key, field value, timestamp - +In order to efficiently compact and store data, +We group field values grouped by series key, then order field values by time. +**Time-Structured Merge Tree** (TSM) is our data format. +TSM files store compressed series data in a columnar format. Within a series, we store only differences between values, which is more efficient. - Column-Oriented storage means we can read by series key and ignore what it doesn't need. -Compression helps with performance. + +Some terminology: + +- a *series key* is defined by measurement, tag key+value, and field key. +- a *point* is a series key, field value, and timestamp. After fields are stored safely in TSM files, WAL is truncated... -(This stuff is configurable) There’s a lot of logic and sophistication in the TSM compaction code. However, the high-level goal is quite simple: @@ -137,10 +128,9 @@ organize values for a series together into long runs to best optimize compressio TSI stores series keys grouped by measure, tag, field. To keep queries fast as we have more data, we use a **Time Series Index**. -cardinality = quantity of series keys -With high cardinality, we have to search through all series keys. -So how to quickly find and match series keys? -We use Time Series Index (TSI), which stores series keys grouped by measurement,tag,field +In data with high cardinality (a large quantity of series), it becomes slower to search through all series keys. + +We use Time Series Index (TSI), which stores series keys grouped by measurement, tag, and field. TSI answers two questions well: 1) What measurements, tags, fields exist? 2) Given a measurement, tags, and fields, what series keys exist? From 413a299bc040a7f715c0a42a61c41cdf0857339c Mon Sep 17 00:00:00 2001 From: pierwill <19642016+pierwill@users.noreply.github.com> Date: Thu, 24 Oct 2019 15:05:36 -0700 Subject: [PATCH 06/18] Begin addressing PR feedback --- content/v2.0/reference/storage-engine.md | 15 ++------------- 1 file changed, 2 insertions(+), 13 deletions(-) diff --git a/content/v2.0/reference/storage-engine.md b/content/v2.0/reference/storage-engine.md index d4657a5e0..8cfc3d2b1 100644 --- a/content/v2.0/reference/storage-engine.md +++ b/content/v2.0/reference/storage-engine.md @@ -24,23 +24,12 @@ Major topics include: * [Time-Structed Merge Tree (TSM)](#time-structured-merge-tree-tsm) * [Time Series Index (TSI)](#time-series-index-tsi) -{{% note %}} -##### At a glance: changes to the InfluxDB storage engine from 1.x to 2.0 -- The InfluxDB 2.0 storage engine no longer partitions data into shards by time. -- **Buckets** replace databases and retention policies. -- Only TSI is used. There is no more in-memory index. -- The configuration interface and options have changed. Configuration options are not currently exposed, but will be. - -Read about the [v1 storage engine](https://docs.influxdata.com/influxdb/v1.7/concepts/storage_engine). -{{% /note %}} - ## Writing data: from API to disk The storage engine handles data from the point an API request is received through writing it to the physical disk. Data is written to InfluxDB using [Line protocol](/) sent via HTTP POST request to the `/write` endpoint. -Batches of points are sent to InfluxDB. -Those batches are compressed and written to a WAL for immediate durability. -The points are also written to an in-memory cache so that newly written points are immediately queryable. +Batches of [points](/v2.0/reference/glossary/#point) are sent to InfluxDB, compressed, and written to a WAL for immediate durability. +The points are also written to an in-memory cache and become immediately queryable. The cache is periodically written to disk as TSM files. As TSM files accumulate, they are combined and compacted into higher level TSM files. From 2c16b89008b61062c5390a191231b12c12278047 Mon Sep 17 00:00:00 2001 From: pierwill <19642016+pierwill@users.noreply.github.com> Date: Thu, 24 Oct 2019 15:59:24 -0700 Subject: [PATCH 07/18] Edit storage engine doc Addressing PR feedback --- content/v2.0/reference/storage-engine.md | 27 ++++++++++++------------ 1 file changed, 13 insertions(+), 14 deletions(-) diff --git a/content/v2.0/reference/storage-engine.md b/content/v2.0/reference/storage-engine.md index 8cfc3d2b1..8f3e5ea0f 100644 --- a/content/v2.0/reference/storage-engine.md +++ b/content/v2.0/reference/storage-engine.md @@ -35,16 +35,13 @@ As TSM files accumulate, they are combined and compacted into higher level TSM f Points can be sent individually; however, for efficiency, most applications send points in batches. A typical batch ranges in size from hundreds to thousands of points. -Points in a POST body can be from an arbitrary number of series. +Points in a POST body can be from an arbitrary number of series, measurements, and tag sets. Points in a batch do not have to be from the same measurement or tagset. ## Write Ahead Log (WAL) -To ensure durability, we use a Write Ahead Log (WAL). - - -WAL is a data structure and algorithm that is super simple and powerful. -It ensures that written data does not disappear when storage engine restarts. +The Write Ahead Log (WAL) ensures durability by retaining data when the storage engine restarts. +It ensures that written data does not disappear in an unexpected failure. When a client sends a write request, the following occurs: 1. Write request is appended to the end of the WAL file. @@ -53,16 +50,18 @@ When a client sends a write request, the following occurs: 4. Return success to caller. -fsync() is a system call, so it has a kernel context switch which costs something. fsync() takes the file and pushes pending writes all the way through any buffers and caches to disk. -fsync() is expensive _in terms of time_ but guarantees your data is safe on disk. -**Important** to batch your points (send in ~2000 points at a time), to fsync() less frequently. +As a system call, fsync() has a kernel context switch which is expensive _in terms of time_ but guarantees your data is safe on disk. + +{{% note%}} +To fsync() less frequently, batch your points (send in ~2000 points at a time). +{{% /note %}} When the storage engine restarts, open WAL file and read it back into the in-memory database. -Answer requests to the /read endpoint. +InfluxDB then snswer requests to the `/read` endpoint. @@ -82,9 +81,9 @@ Queries execute on a copy of the data that is made from the cache at query proce This way writes that come in while a query is running won’t affect the result. The Cache is an in-memory copy of all data points current stored in the WAL. -The points are organized by the key, which is the measurement, tag set, and unique field. -Each field is kept as its own time-ordered range. -The Cache data is not compressed while in memory. +Points are organized by the key, which is the measurement, tag set, and unique field. +Each field is stored in its own time-ordered range. +Data is not compressed in the cache. The cache is recreated on restart by re-reading the WAL files on disk back into memory. Deletes sent to the Cache will clear out the given key or the specific time range for the given key. @@ -114,7 +113,7 @@ organize values for a series together into long runs to best optimize compressio ## Time Series Index (TSI) -TSI stores series keys grouped by measure, tag, field. +TSI stores series keys grouped by measurement, tag, and field. To keep queries fast as we have more data, we use a **Time Series Index**. In data with high cardinality (a large quantity of series), it becomes slower to search through all series keys. From cf1d2906fdfe3052602785b8a0a034484040c82e Mon Sep 17 00:00:00 2001 From: pierwill <19642016+pierwill@users.noreply.github.com> Date: Fri, 25 Oct 2019 08:53:43 -0700 Subject: [PATCH 08/18] Continue editing storage enginge doc --- content/v2.0/reference/storage-engine.md | 41 +++++++++++------------- 1 file changed, 18 insertions(+), 23 deletions(-) diff --git a/content/v2.0/reference/storage-engine.md b/content/v2.0/reference/storage-engine.md index 8f3e5ea0f..f1d215a69 100644 --- a/content/v2.0/reference/storage-engine.md +++ b/content/v2.0/reference/storage-engine.md @@ -27,10 +27,10 @@ Major topics include: ## Writing data: from API to disk The storage engine handles data from the point an API request is received through writing it to the physical disk. -Data is written to InfluxDB using [Line protocol](/) sent via HTTP POST request to the `/write` endpoint. +Data is written to InfluxDB using [line protocol](/v2.0/reference/line-) sent via HTTP POST request to the `/write` endpoint. Batches of [points](/v2.0/reference/glossary/#point) are sent to InfluxDB, compressed, and written to a WAL for immediate durability. The points are also written to an in-memory cache and become immediately queryable. -The cache is periodically written to disk as TSM files. +The cache is periodically written to disk in the form of [TSM](#time-structured-merge-tree-tsm) files. As TSM files accumulate, they are combined and compacted into higher level TSM files. Points can be sent individually; however, for efficiency, most applications send points in batches. @@ -45,22 +45,19 @@ It ensures that written data does not disappear in an unexpected failure. When a client sends a write request, the following occurs: 1. Write request is appended to the end of the WAL file. -2. fsync() the data to the file. +2. `fsync()` the data to the file. 3. Update the in-memory database. 4. Return success to caller. -fsync() takes the file and pushes pending writes all the way through any buffers and caches to disk. -As a system call, fsync() has a kernel context switch which is expensive _in terms of time_ but guarantees your data is safe on disk. +`fsync()` takes the file and pushes pending writes all the way through any buffers and caches to disk. +As a system call, `fsync()` has a kernel context switch which is expensive _in terms of time_ but guarantees your data is safe on disk. {{% note%}} -To fsync() less frequently, batch your points (send in ~2000 points at a time). +To `fsync()` less frequently, batch your points (send in ~2000 points at a time). {{% /note %}} - -When the storage engine restarts, - -open WAL file and read it back into the in-memory database. +When the storage engine restarts, the WAL file is read back into the in-memory database. InfluxDB then snswer requests to the `/read` endpoint. @@ -80,23 +77,26 @@ Queries to the storage engine will merge data from the Cache with data from the Queries execute on a copy of the data that is made from the cache at query processing time. This way writes that come in while a query is running won’t affect the result. -The Cache is an in-memory copy of all data points current stored in the WAL. +The cache is an in-memory copy of data points current stored in the WAL. Points are organized by the key, which is the measurement, tag set, and unique field. Each field is stored in its own time-ordered range. Data is not compressed in the cache. The cache is recreated on restart by re-reading the WAL files on disk back into memory. +The cache is queried at runtime and merged with the data stored in TSM files. Deletes sent to the Cache will clear out the given key or the specific time range for the given key. - - ## Time-Structured Merge Tree (TSM) -In order to efficiently compact and store data, -We group field values grouped by series key, then order field values by time. -**Time-Structured Merge Tree** (TSM) is our data format. +To efficiently compact and store data, +the storage engine groups field values by [series](/v2.0/reference/key-concepts/data-elements/#series) key, +and then orders those field values by time. + +The storage engine uses a **Time-Structured Merge Tree** (TSM) data format. TSM files store compressed series data in a columnar format. -Within a series, we store only differences between values, which is more efficient. +Within a series, we store only , which is more efficient. +To improve efficiency, the storage engine only stores differences between values in a series. Column-Oriented storage means we can read by series key and ignore what it doesn't need. +Storing data in columns lets the storage engine read by series key. Some terminology: @@ -113,25 +113,21 @@ organize values for a series together into long runs to best optimize compressio ## Time Series Index (TSI) -TSI stores series keys grouped by measurement, tag, and field. - To keep queries fast as we have more data, we use a **Time Series Index**. +TSI stores series keys grouped by measurement, tag, and field. In data with high cardinality (a large quantity of series), it becomes slower to search through all series keys. - We use Time Series Index (TSI), which stores series keys grouped by measurement, tag, and field. TSI answers two questions well: 1) What measurements, tags, fields exist? 2) Given a measurement, tags, and fields, what series keys exist? - - @@ -139,7 +135,6 @@ TSI answers two questions well: - From 024eb95096b287f1a500b24fb2dfe700984fba51 Mon Sep 17 00:00:00 2001 From: pierwill <19642016+pierwill@users.noreply.github.com> Date: Mon, 28 Oct 2019 14:27:35 -0700 Subject: [PATCH 09/18] Work on storage engine docs --- content/v2.0/reference/storage-engine.md | 31 +++++++++++++++--------- 1 file changed, 19 insertions(+), 12 deletions(-) diff --git a/content/v2.0/reference/storage-engine.md b/content/v2.0/reference/storage-engine.md index f1d215a69..ef259cebf 100644 --- a/content/v2.0/reference/storage-engine.md +++ b/content/v2.0/reference/storage-engine.md @@ -6,6 +6,7 @@ weight: 7 menu: v2_0_ref: name: Storage engine + parent: Internals v2.0/tags: [storage engine, internals, platform] --- @@ -34,13 +35,13 @@ The cache is periodically written to disk in the form of [TSM](#time-structured- As TSM files accumulate, they are combined and compacted into higher level TSM files. Points can be sent individually; however, for efficiency, most applications send points in batches. -A typical batch ranges in size from hundreds to thousands of points. + Points in a POST body can be from an arbitrary number of series, measurements, and tag sets. Points in a batch do not have to be from the same measurement or tagset. ## Write Ahead Log (WAL) -The Write Ahead Log (WAL) ensures durability by retaining data when the storage engine restarts. +The **Write Ahead Log** (WAL) ensures durability by retaining data when the storage engine restarts. It ensures that written data does not disappear in an unexpected failure. When a client sends a write request, the following occurs: @@ -54,7 +55,7 @@ When a client sends a write request, the following occurs: As a system call, `fsync()` has a kernel context switch which is expensive _in terms of time_ but guarantees your data is safe on disk. {{% note%}} -To `fsync()` less frequently, batch your points (send in ~2000 points at a time). +To `fsync()` less frequently, batch your points. {{% /note %}} When the storage engine restarts, the WAL file is read back into the in-memory database. @@ -73,16 +74,19 @@ Once you receive a response to a write request, your data is on disk! ## Cache -Queries to the storage engine will merge data from the Cache with data from the TSM files. -Queries execute on a copy of the data that is made from the cache at query processing time. -This way writes that come in while a query is running won’t affect the result. - -The cache is an in-memory copy of data points current stored in the WAL. +The **cache** is an in-memory copy of data points current stored in the WAL. Points are organized by the key, which is the measurement, tag set, and unique field. Each field is stored in its own time-ordered range. Data is not compressed in the cache. The cache is recreated on restart by re-reading the WAL files on disk back into memory. The cache is queried at runtime and merged with the data stored in TSM files. + + + +Queries to the storage engine will merge data from the cache with data from the TSM files. +Queries execute on a copy of the data that is made from the cache at query processing time. +This way writes that come in while a query is running won’t affect the result. + Deletes sent to the Cache will clear out the given key or the specific time range for the given key. ## Time-Structured Merge Tree (TSM) @@ -93,15 +97,14 @@ and then orders those field values by time. The storage engine uses a **Time-Structured Merge Tree** (TSM) data format. TSM files store compressed series data in a columnar format. -Within a series, we store only , which is more efficient. To improve efficiency, the storage engine only stores differences between values in a series. -Column-Oriented storage means we can read by series key and ignore what it doesn't need. +Column-oriented storage means we can read by series key and ignore what it doesn't need. Storing data in columns lets the storage engine read by series key. Some terminology: -- a *series key* is defined by measurement, tag key+value, and field key. +- a *series key* is defined by measurement, tag key and value, and field key. - a *point* is a series key, field value, and timestamp. After fields are stored safely in TSM files, WAL is truncated... @@ -113,10 +116,14 @@ organize values for a series together into long runs to best optimize compressio ## Time Series Index (TSI) +As data cardinality (number of series) grows, queries read more series keys and become slower. + +The **Time Series Index** ensures queries fast as data cardinality of data grows... To keep queries fast as we have more data, we use a **Time Series Index**. + TSI stores series keys grouped by measurement, tag, and field. In data with high cardinality (a large quantity of series), it becomes slower to search through all series keys. -We use Time Series Index (TSI), which stores series keys grouped by measurement, tag, and field. +The TSI stores series keys grouped by measurement, tag, and field. TSI answers two questions well: 1) What measurements, tags, fields exist? 2) Given a measurement, tags, and fields, what series keys exist? From 7e22fa8ac5de199e29949650671c928854c8e0aa Mon Sep 17 00:00:00 2001 From: pierwill <19642016+pierwill@users.noreply.github.com> Date: Mon, 28 Oct 2019 14:28:13 -0700 Subject: [PATCH 10/18] Create "Internals" reference section Move storage engine docs Edit storage engine page tags --- content/v2.0/reference/{ => internals}/storage-engine.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) rename content/v2.0/reference/{ => internals}/storage-engine.md (99%) diff --git a/content/v2.0/reference/storage-engine.md b/content/v2.0/reference/internals/storage-engine.md similarity index 99% rename from content/v2.0/reference/storage-engine.md rename to content/v2.0/reference/internals/storage-engine.md index ef259cebf..8bf604cff 100644 --- a/content/v2.0/reference/storage-engine.md +++ b/content/v2.0/reference/internals/storage-engine.md @@ -7,7 +7,7 @@ menu: v2_0_ref: name: Storage engine parent: Internals -v2.0/tags: [storage engine, internals, platform] +v2.0/tags: [storage, internals] --- The InfluxDB storage engine ensures the following three things: From 761220c90a22bb3168771a1f44e26e6b476f092e Mon Sep 17 00:00:00 2001 From: pierwill <19642016+pierwill@users.noreply.github.com> Date: Wed, 30 Oct 2019 14:05:00 -0700 Subject: [PATCH 11/18] Add internals landing page --- content/v2.0/reference/internals/_index.md | 9 ++++++ .../reference/internals/storage-engine.md | 32 +++++++------------ 2 files changed, 20 insertions(+), 21 deletions(-) create mode 100644 content/v2.0/reference/internals/_index.md diff --git a/content/v2.0/reference/internals/_index.md b/content/v2.0/reference/internals/_index.md new file mode 100644 index 000000000..1f063cdf8 --- /dev/null +++ b/content/v2.0/reference/internals/_index.md @@ -0,0 +1,9 @@ +--- +title: InfluxDB Internals +menu: + v2_0_ref: + name: InfluxDB Internals + weight: 8 +--- + +{{< children >}} diff --git a/content/v2.0/reference/internals/storage-engine.md b/content/v2.0/reference/internals/storage-engine.md index 8bf604cff..744bda9d1 100644 --- a/content/v2.0/reference/internals/storage-engine.md +++ b/content/v2.0/reference/internals/storage-engine.md @@ -6,7 +6,7 @@ weight: 7 menu: v2_0_ref: name: Storage engine - parent: Internals + parent: InfluxDB Internals v2.0/tags: [storage, internals] --- @@ -35,7 +35,6 @@ The cache is periodically written to disk in the form of [TSM](#time-structured- As TSM files accumulate, they are combined and compacted into higher level TSM files. Points can be sent individually; however, for efficiency, most applications send points in batches. - Points in a POST body can be from an arbitrary number of series, measurements, and tag sets. Points in a batch do not have to be from the same measurement or tagset. @@ -47,8 +46,7 @@ When a client sends a write request, the following occurs: 1. Write request is appended to the end of the WAL file. 2. `fsync()` the data to the file. -3. Update the in-memory database. - +3. Update the in-memory cache. 4. Return success to caller. `fsync()` takes the file and pushes pending writes all the way through any buffers and caches to disk. @@ -61,7 +59,6 @@ To `fsync()` less frequently, batch your points. When the storage engine restarts, the WAL file is read back into the in-memory database. InfluxDB then snswer requests to the `/read` endpoint. - @@ -81,11 +78,14 @@ Data is not compressed in the cache. The cache is recreated on restart by re-reading the WAL files on disk back into memory. The cache is queried at runtime and merged with the data stored in TSM files. + + + Queries to the storage engine will merge data from the cache with data from the TSM files. Queries execute on a copy of the data that is made from the cache at query processing time. -This way writes that come in while a query is running won’t affect the result. +This way writes that come in while a query is running do not affect the result. Deletes sent to the Cache will clear out the given key or the specific time range for the given key. @@ -97,15 +97,15 @@ and then orders those field values by time. The storage engine uses a **Time-Structured Merge Tree** (TSM) data format. TSM files store compressed series data in a columnar format. -To improve efficiency, the storage engine only stores differences between values in a series. +To improve efficiency, the storage engine only stores differences (or *deltas*) between values in a series. Column-oriented storage means we can read by series key and ignore what it doesn't need. Storing data in columns lets the storage engine read by series key. -Some terminology: + -- a *series key* is defined by measurement, tag key and value, and field key. -- a *point* is a series key, field value, and timestamp. + + After fields are stored safely in TSM files, WAL is truncated... @@ -118,7 +118,7 @@ organize values for a series together into long runs to best optimize compressio As data cardinality (number of series) grows, queries read more series keys and become slower. -The **Time Series Index** ensures queries fast as data cardinality of data grows... +The **Time Series Index** ensures queries remain fast as data cardinality of data grows... To keep queries fast as we have more data, we use a **Time Series Index**. TSI stores series keys grouped by measurement, tag, and field. @@ -152,13 +152,3 @@ TSI answers two questions well: - - - - - - - - - - From fe88a9d6e424e57fca836060782db365e4d0e437 Mon Sep 17 00:00:00 2001 From: pierwill <19642016+pierwill@users.noreply.github.com> Date: Tue, 12 Nov 2019 09:40:45 -0800 Subject: [PATCH 12/18] Continue work on storage engine doc --- .../reference/internals/storage-engine.md | 58 +++---------------- 1 file changed, 8 insertions(+), 50 deletions(-) diff --git a/content/v2.0/reference/internals/storage-engine.md b/content/v2.0/reference/internals/storage-engine.md index 744bda9d1..25f188d90 100644 --- a/content/v2.0/reference/internals/storage-engine.md +++ b/content/v2.0/reference/internals/storage-engine.md @@ -30,6 +30,7 @@ Major topics include: The storage engine handles data from the point an API request is received through writing it to the physical disk. Data is written to InfluxDB using [line protocol](/v2.0/reference/line-) sent via HTTP POST request to the `/write` endpoint. Batches of [points](/v2.0/reference/glossary/#point) are sent to InfluxDB, compressed, and written to a WAL for immediate durability. +(A *point* is a series key, field value, and timestamp.) The points are also written to an in-memory cache and become immediately queryable. The cache is periodically written to disk in the form of [TSM](#time-structured-merge-tree-tsm) files. As TSM files accumulate, they are combined and compacted into higher level TSM files. @@ -50,21 +51,11 @@ When a client sends a write request, the following occurs: 4. Return success to caller. `fsync()` takes the file and pushes pending writes all the way through any buffers and caches to disk. -As a system call, `fsync()` has a kernel context switch which is expensive _in terms of time_ but guarantees your data is safe on disk. - -{{% note%}} -To `fsync()` less frequently, batch your points. -{{% /note %}} +As a system call, `fsync()` has a kernel context switch which is computationally expensive, but guarantees your data is safe on disk. When the storage engine restarts, the WAL file is read back into the in-memory database. InfluxDB then snswer requests to the `/read` endpoint. - - - - - - {{% note%}} Once you receive a response to a write request, your data is on disk! {{% /note %}} @@ -78,22 +69,20 @@ Data is not compressed in the cache. The cache is recreated on restart by re-reading the WAL files on disk back into memory. The cache is queried at runtime and merged with the data stored in TSM files. - - - - +When the storage engine restarts, WAL files are re-read into the in-memory cache. Queries to the storage engine will merge data from the cache with data from the TSM files. Queries execute on a copy of the data that is made from the cache at query processing time. This way writes that come in while a query is running do not affect the result. -Deletes sent to the Cache will clear out the given key or the specific time range for the given key. +Deletes sent to the cache will clear out the given key or the specific time range for the given key. ## Time-Structured Merge Tree (TSM) To efficiently compact and store data, -the storage engine groups field values by [series](/v2.0/reference/key-concepts/data-elements/#series) key, +the storage engine groups field values by series key, and then orders those field values by time. +(A *series key* is defined by measurement, tag key and value, and field key.) The storage engine uses a **Time-Structured Merge Tree** (TSM) data format. TSM files store compressed series data in a columnar format. @@ -101,13 +90,7 @@ To improve efficiency, the storage engine only stores differences (or *deltas*) Column-oriented storage means we can read by series key and ignore what it doesn't need. Storing data in columns lets the storage engine read by series key. - - - - - - -After fields are stored safely in TSM files, WAL is truncated... +After fields are stored safely in TSM files, the WAL is truncated and the cache is cleared. There’s a lot of logic and sophistication in the TSM compaction code. @@ -116,7 +99,7 @@ organize values for a series together into long runs to best optimize compressio ## Time Series Index (TSI) -As data cardinality (number of series) grows, queries read more series keys and become slower. +As data cardinality (the number of series) grows, queries read more series keys and become slower. The **Time Series Index** ensures queries remain fast as data cardinality of data grows... To keep queries fast as we have more data, we use a **Time Series Index**. @@ -127,28 +110,3 @@ The TSI stores series keys grouped by measurement, tag, and field. TSI answers two questions well: 1) What measurements, tags, fields exist? 2) Given a measurement, tags, and fields, what series keys exist? - - - - - - - - - - - - - - - - - - - - - - - - - From be5238b5ce99dc8dab9abcd39127ee5035bd1d1d Mon Sep 17 00:00:00 2001 From: pierwill <19642016+pierwill@users.noreply.github.com> Date: Mon, 9 Dec 2019 10:56:36 -0800 Subject: [PATCH 13/18] Work on storage engine doc --- .../reference/internals/storage-engine.md | 36 +++++++++---------- 1 file changed, 18 insertions(+), 18 deletions(-) diff --git a/content/v2.0/reference/internals/storage-engine.md b/content/v2.0/reference/internals/storage-engine.md index 25f188d90..71a062ed1 100644 --- a/content/v2.0/reference/internals/storage-engine.md +++ b/content/v2.0/reference/internals/storage-engine.md @@ -10,28 +10,28 @@ menu: v2.0/tags: [storage, internals] --- -The InfluxDB storage engine ensures the following three things: +The InfluxDB storage engine ensures that - Data is safely written to disk - Queried data is returned complete and correct - Data is accurate (first) and performant (second) -This document details the internal workings of the storage engine. +This document outlines the internal workings of the storage engine. This information is presented both as a reference and to aid those looking to maximize performance. Major topics include: * [Write Ahead Log (WAL)](#write-ahead-log-wal) +* [Cache](#cache) * [Time-Structed Merge Tree (TSM)](#time-structured-merge-tree-tsm) * [Time Series Index (TSI)](#time-series-index-tsi) ## Writing data: from API to disk -The storage engine handles data from the point an API request is received through writing it to the physical disk. -Data is written to InfluxDB using [line protocol](/v2.0/reference/line-) sent via HTTP POST request to the `/write` endpoint. +The storage engine handles data from the point an API write request is received through writing data to the physical disk. +Data is written to InfluxDB using [line protocol](/v2.0/reference/line-protocol/) sent via HTTP POST request to the `/write` endpoint. Batches of [points](/v2.0/reference/glossary/#point) are sent to InfluxDB, compressed, and written to a WAL for immediate durability. -(A *point* is a series key, field value, and timestamp.) -The points are also written to an in-memory cache and become immediately queryable. +Points are also written to an in-memory cache and become immediately queryable. The cache is periodically written to disk in the form of [TSM](#time-structured-merge-tree-tsm) files. As TSM files accumulate, they are combined and compacted into higher level TSM files. @@ -43,27 +43,27 @@ Points in a batch do not have to be from the same measurement or tagset. The **Write Ahead Log** (WAL) ensures durability by retaining data when the storage engine restarts. It ensures that written data does not disappear in an unexpected failure. -When a client sends a write request, the following occurs: +When a client sends a write request, the following steps occur: -1. Write request is appended to the end of the WAL file. -2. `fsync()` the data to the file. +1. Append write request to the end of the WAL file. +2. Write data to disk using `fsync()`. 3. Update the in-memory cache. 4. Return success to caller. -`fsync()` takes the file and pushes pending writes all the way through any buffers and caches to disk. -As a system call, `fsync()` has a kernel context switch which is computationally expensive, but guarantees your data is safe on disk. - -When the storage engine restarts, the WAL file is read back into the in-memory database. -InfluxDB then snswer requests to the `/read` endpoint. +`fsync()` takes the file and pushes pending writes all the way to the disk. +As a system call, `fsync()` has a kernel context switch which is computationally expensive, but guarantees that data is safe on disk. {{% note%}} Once you receive a response to a write request, your data is on disk! {{% /note %}} +When the storage engine restarts, the WAL file is read back into the in-memory database. +InfluxDB then answers requests to the `/read` endpoint. + ## Cache The **cache** is an in-memory copy of data points current stored in the WAL. -Points are organized by the key, which is the measurement, tag set, and unique field. +Points are organized by key, which is the measurement, tag set, and unique field. Each field is stored in its own time-ordered range. Data is not compressed in the cache. The cache is recreated on restart by re-reading the WAL files on disk back into memory. @@ -82,7 +82,7 @@ Deletes sent to the cache will clear out the given key or the specific time rang To efficiently compact and store data, the storage engine groups field values by series key, and then orders those field values by time. -(A *series key* is defined by measurement, tag key and value, and field key.) +(A [series key](/v2/) is defined by measurement, tag key and value, and field key.) The storage engine uses a **Time-Structured Merge Tree** (TSM) data format. TSM files store compressed series data in a columnar format. @@ -93,7 +93,7 @@ Storing data in columns lets the storage engine read by series key. After fields are stored safely in TSM files, the WAL is truncated and the cache is cleared. -There’s a lot of logic and sophistication in the TSM compaction code. +The TSM compaction code is quite complex. However, the high-level goal is quite simple: organize values for a series together into long runs to best optimize compression and scanning queries. @@ -101,7 +101,7 @@ organize values for a series together into long runs to best optimize compressio As data cardinality (the number of series) grows, queries read more series keys and become slower. -The **Time Series Index** ensures queries remain fast as data cardinality of data grows... +The **Time Series Index** ensures queries remain fast as data cardinality grows. To keep queries fast as we have more data, we use a **Time Series Index**. TSI stores series keys grouped by measurement, tag, and field. From 60f1bc7d9300d6f90f39e7f06d8f9cecf914063a Mon Sep 17 00:00:00 2001 From: pierwill <19642016+pierwill@users.noreply.github.com> Date: Mon, 6 Jan 2020 10:08:29 -0800 Subject: [PATCH 14/18] Edit paragraphing in storage engine doc --- content/v2.0/reference/internals/storage-engine.md | 10 ++-------- 1 file changed, 2 insertions(+), 8 deletions(-) diff --git a/content/v2.0/reference/internals/storage-engine.md b/content/v2.0/reference/internals/storage-engine.md index 71a062ed1..bca7a8a2b 100644 --- a/content/v2.0/reference/internals/storage-engine.md +++ b/content/v2.0/reference/internals/storage-engine.md @@ -62,26 +62,23 @@ InfluxDB then answers requests to the `/read` endpoint. ## Cache -The **cache** is an in-memory copy of data points current stored in the WAL. +The **cache** is an in-memory copy of data points currently stored in the WAL. Points are organized by key, which is the measurement, tag set, and unique field. Each field is stored in its own time-ordered range. Data is not compressed in the cache. The cache is recreated on restart by re-reading the WAL files on disk back into memory. The cache is queried at runtime and merged with the data stored in TSM files. - When the storage engine restarts, WAL files are re-read into the in-memory cache. Queries to the storage engine will merge data from the cache with data from the TSM files. Queries execute on a copy of the data that is made from the cache at query processing time. This way writes that come in while a query is running do not affect the result. - Deletes sent to the cache will clear out the given key or the specific time range for the given key. ## Time-Structured Merge Tree (TSM) To efficiently compact and store data, -the storage engine groups field values by series key, -and then orders those field values by time. +the storage engine groups field values by series key, and then orders those field values by time. (A [series key](/v2/) is defined by measurement, tag key and value, and field key.) The storage engine uses a **Time-Structured Merge Tree** (TSM) data format. @@ -91,8 +88,6 @@ Column-oriented storage means we can read by series key and ignore what it doesn Storing data in columns lets the storage engine read by series key. After fields are stored safely in TSM files, the WAL is truncated and the cache is cleared. - - The TSM compaction code is quite complex. However, the high-level goal is quite simple: organize values for a series together into long runs to best optimize compression and scanning queries. @@ -100,7 +95,6 @@ organize values for a series together into long runs to best optimize compressio ## Time Series Index (TSI) As data cardinality (the number of series) grows, queries read more series keys and become slower. - The **Time Series Index** ensures queries remain fast as data cardinality grows. To keep queries fast as we have more data, we use a **Time Series Index**. From f19ecd70dca4755b4fddac2ad5433e49643b0b1f Mon Sep 17 00:00:00 2001 From: pierwill <19642016+pierwill@users.noreply.github.com> Date: Mon, 6 Jan 2020 14:04:20 -0800 Subject: [PATCH 15/18] Storage engine doc: begin addressing PR feedback --- .../reference/internals/storage-engine.md | 36 +++++++++---------- 1 file changed, 18 insertions(+), 18 deletions(-) diff --git a/content/v2.0/reference/internals/storage-engine.md b/content/v2.0/reference/internals/storage-engine.md index bca7a8a2b..e495c6d15 100644 --- a/content/v2.0/reference/internals/storage-engine.md +++ b/content/v2.0/reference/internals/storage-engine.md @@ -10,7 +10,7 @@ menu: v2.0/tags: [storage, internals] --- -The InfluxDB storage engine ensures that +The InfluxDB storage engine ensures that: - Data is safely written to disk - Queried data is returned complete and correct @@ -19,43 +19,42 @@ The InfluxDB storage engine ensures that This document outlines the internal workings of the storage engine. This information is presented both as a reference and to aid those looking to maximize performance. -Major topics include: +The storage engine includes the following components: * [Write Ahead Log (WAL)](#write-ahead-log-wal) * [Cache](#cache) * [Time-Structed Merge Tree (TSM)](#time-structured-merge-tree-tsm) * [Time Series Index (TSI)](#time-series-index-tsi) -## Writing data: from API to disk +## Writing data from API to disk The storage engine handles data from the point an API write request is received through writing data to the physical disk. Data is written to InfluxDB using [line protocol](/v2.0/reference/line-protocol/) sent via HTTP POST request to the `/write` endpoint. Batches of [points](/v2.0/reference/glossary/#point) are sent to InfluxDB, compressed, and written to a WAL for immediate durability. Points are also written to an in-memory cache and become immediately queryable. -The cache is periodically written to disk in the form of [TSM](#time-structured-merge-tree-tsm) files. +The in-memory cache is periodically written to disk in the form of [TSM](#time-structured-merge-tree-tsm) files. As TSM files accumulate, they are combined and compacted into higher level TSM files. +{{% note %}} Points can be sent individually; however, for efficiency, most applications send points in batches. Points in a POST body can be from an arbitrary number of series, measurements, and tag sets. Points in a batch do not have to be from the same measurement or tagset. +{{% /note %}} ## Write Ahead Log (WAL) -The **Write Ahead Log** (WAL) ensures durability by retaining data when the storage engine restarts. -It ensures that written data does not disappear in an unexpected failure. -When a client sends a write request, the following steps occur: +The **Write Ahead Log** (WAL) retains InfluxDB data when the storage engine restarts. +The WAL ensures data is durable in case of an unexpected failure. -1. Append write request to the end of the WAL file. -2. Write data to disk using `fsync()`. -3. Update the in-memory cache. -4. Return success to caller. +When the storage engine receives a write request, the following steps occur: + +1. The write request is appended to the end of the WAL file. +2. Data is written data to disk using `fsync()`. +3. The in-memory cache is updated. +4. When data is successfully written to disk, a response confirms the write request was successful. `fsync()` takes the file and pushes pending writes all the way to the disk. -As a system call, `fsync()` has a kernel context switch which is computationally expensive, but guarantees that data is safe on disk. - -{{% note%}} -Once you receive a response to a write request, your data is on disk! -{{% /note %}} +As a system call, `fsync()` has a kernel context switch that's computationally expensive, but guarantees that data is safe on disk. When the storage engine restarts, the WAL file is read back into the in-memory database. InfluxDB then answers requests to the `/read` endpoint. @@ -102,5 +101,6 @@ TSI stores series keys grouped by measurement, tag, and field. In data with high cardinality (a large quantity of series), it becomes slower to search through all series keys. The TSI stores series keys grouped by measurement, tag, and field. TSI answers two questions well: -1) What measurements, tags, fields exist? -2) Given a measurement, tags, and fields, what series keys exist? + +- What measurements, tags, fields exist? +- Given a measurement, tags, and fields, what series keys exist? From 46803fc3ee50ccf16b20d81bb247b2e1cbf8efd8 Mon Sep 17 00:00:00 2001 From: pierwill <19642016+pierwill@users.noreply.github.com> Date: Tue, 7 Jan 2020 09:25:24 -0800 Subject: [PATCH 16/18] Storage engine doc: more review work --- content/v2.0/reference/internals/storage-engine.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/v2.0/reference/internals/storage-engine.md b/content/v2.0/reference/internals/storage-engine.md index e495c6d15..6e26fa11a 100644 --- a/content/v2.0/reference/internals/storage-engine.md +++ b/content/v2.0/reference/internals/storage-engine.md @@ -33,7 +33,7 @@ Data is written to InfluxDB using [line protocol](/v2.0/reference/line-protocol/ Batches of [points](/v2.0/reference/glossary/#point) are sent to InfluxDB, compressed, and written to a WAL for immediate durability. Points are also written to an in-memory cache and become immediately queryable. The in-memory cache is periodically written to disk in the form of [TSM](#time-structured-merge-tree-tsm) files. -As TSM files accumulate, they are combined and compacted into higher level TSM files. +As TSM files accumulate, the storage engine combines and compacts accumulated them into higher level TSM files. {{% note %}} Points can be sent individually; however, for efficiency, most applications send points in batches. @@ -69,7 +69,7 @@ The cache is recreated on restart by re-reading the WAL files on disk back into The cache is queried at runtime and merged with the data stored in TSM files. When the storage engine restarts, WAL files are re-read into the in-memory cache. -Queries to the storage engine will merge data from the cache with data from the TSM files. +Queries to the storage engine merge data from the cache with data from the TSM files. Queries execute on a copy of the data that is made from the cache at query processing time. This way writes that come in while a query is running do not affect the result. Deletes sent to the cache will clear out the given key or the specific time range for the given key. From 09e6cef6d3bd9532fa70a9baf48397d0dd0d2533 Mon Sep 17 00:00:00 2001 From: pierwill <19642016+pierwill@users.noreply.github.com> Date: Wed, 8 Jan 2020 10:41:29 -0800 Subject: [PATCH 17/18] Storage engine doc: continue PR review work --- .../reference/internals/storage-engine.md | 24 +++++++++---------- 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/content/v2.0/reference/internals/storage-engine.md b/content/v2.0/reference/internals/storage-engine.md index 6e26fa11a..047c7b1d2 100644 --- a/content/v2.0/reference/internals/storage-engine.md +++ b/content/v2.0/reference/internals/storage-engine.md @@ -36,7 +36,7 @@ The in-memory cache is periodically written to disk in the form of [TSM](#time-s As TSM files accumulate, the storage engine combines and compacts accumulated them into higher level TSM files. {{% note %}} -Points can be sent individually; however, for efficiency, most applications send points in batches. +While points can be sent individually, for efficiency, most applications send points in batches. Points in a POST body can be from an arbitrary number of series, measurements, and tag sets. Points in a batch do not have to be from the same measurement or tagset. {{% /note %}} @@ -62,29 +62,29 @@ InfluxDB then answers requests to the `/read` endpoint. ## Cache The **cache** is an in-memory copy of data points currently stored in the WAL. -Points are organized by key, which is the measurement, tag set, and unique field. -Each field is stored in its own time-ordered range. -Data is not compressed in the cache. -The cache is recreated on restart by re-reading the WAL files on disk back into memory. -The cache is queried at runtime and merged with the data stored in TSM files. -When the storage engine restarts, WAL files are re-read into the in-memory cache. +The cache: + +- Organizes points by key (measurement, tag set, and unique field) + Each field is stored in its own time-ordered range. +- Stores uncompressed data. +- Gets updates from the WAL each time the storage engine restarts. + The cache is queried at runtime and merged with the data stored in TSM files. Queries to the storage engine merge data from the cache with data from the TSM files. Queries execute on a copy of the data that is made from the cache at query processing time. This way writes that come in while a query is running do not affect the result. -Deletes sent to the cache will clear out the given key or the specific time range for the given key. +Deletes sent to the cache clear the specified key or time range for a specified key. ## Time-Structured Merge Tree (TSM) To efficiently compact and store data, the storage engine groups field values by series key, and then orders those field values by time. -(A [series key](/v2/) is defined by measurement, tag key and value, and field key.) +(A [series key](/v2.0/reference/glossary/#series-key) is defined by measurement, tag key and value, and field key.) The storage engine uses a **Time-Structured Merge Tree** (TSM) data format. TSM files store compressed series data in a columnar format. To improve efficiency, the storage engine only stores differences (or *deltas*) between values in a series. -Column-oriented storage means we can read by series key and ignore what it doesn't need. -Storing data in columns lets the storage engine read by series key. +Column-oriented storage lets the engine read by series key and omit extraneous data. After fields are stored safely in TSM files, the WAL is truncated and the cache is cleared. The TSM compaction code is quite complex. @@ -98,7 +98,7 @@ The **Time Series Index** ensures queries remain fast as data cardinality grows. To keep queries fast as we have more data, we use a **Time Series Index**. TSI stores series keys grouped by measurement, tag, and field. -In data with high cardinality (a large quantity of series), it becomes slower to search through all series keys. +In data with high cardinality (a large quantity of series), queries become slower. The TSI stores series keys grouped by measurement, tag, and field. TSI answers two questions well: From e5d273f71f67d59ff23289525f552f75b50b25e3 Mon Sep 17 00:00:00 2001 From: pierwill <19642016+pierwill@users.noreply.github.com> Date: Wed, 8 Jan 2020 10:55:47 -0800 Subject: [PATCH 18/18] Storage engine: edit TSI section --- content/v2.0/reference/internals/storage-engine.md | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/content/v2.0/reference/internals/storage-engine.md b/content/v2.0/reference/internals/storage-engine.md index 047c7b1d2..1aabe5995 100644 --- a/content/v2.0/reference/internals/storage-engine.md +++ b/content/v2.0/reference/internals/storage-engine.md @@ -95,12 +95,9 @@ organize values for a series together into long runs to best optimize compressio As data cardinality (the number of series) grows, queries read more series keys and become slower. The **Time Series Index** ensures queries remain fast as data cardinality grows. -To keep queries fast as we have more data, we use a **Time Series Index**. - -TSI stores series keys grouped by measurement, tag, and field. -In data with high cardinality (a large quantity of series), queries become slower. The TSI stores series keys grouped by measurement, tag, and field. -TSI answers two questions well: +This allows the database to answer two questions well: - What measurements, tags, fields exist? + (This happens in meta queries.) - Given a measurement, tags, and fields, what series keys exist?