influxdb/tsdb/tsi1/doc.go

239 lines
11 KiB
Go

/*
Package tsi1 provides a memory-mapped index implementation that supports
high cardinality series.
Overview
The top-level object in tsi1 is the Index. It is the primary access point from
the rest of the system. The Index is composed of LogFile and IndexFile objects.
Log files are small write-ahead log files that record new series immediately
in the order that they are received. The data within the file is indexed
in-memory so it can be quickly accessed. When the system is restarted, this log
file is replayed and the in-memory representation is rebuilt.
Index files also contain series information, however, they are highly indexed
so that reads can be performed quickly. Index files are built through a process
called compaction where a log file or multiple index files are merged together.
Operations
The index can perform many tasks related to series, measurement, & tag data.
All data is inserted by adding a series to the index. When adding a series,
the measurement, tag keys, and tag values are all extracted and indexed
separately.
Once a series has been added, it can be removed in several ways. First, the
individual series can be removed. Second, it can be removed as part of a bulk
operation by deleting the entire measurement.
The query engine needs to be able to look up series in a variety of ways such
as by measurement name, by tag value, or by using regular expressions. The
index provides an API to iterate over subsets of series and perform set
operations such as unions and intersections.
Log File Layout
The write-ahead file that series initially are inserted into simply appends
all new operations sequentially. It is simply composed of a series of log
entries. An entry contains a flag to specify the operation type, the measurement
name, the tag set, and a checksum.
┏━━━━━━━━━LogEntry━━━━━━━━━┓
┃ ┌──────────────────────┐ ┃
┃ │ Flag │ ┃
┃ ├──────────────────────┤ ┃
┃ │ Measurement │ ┃
┃ ├──────────────────────┤ ┃
┃ │ Key/Value │ ┃
┃ ├──────────────────────┤ ┃
┃ │ Key/Value │ ┃
┃ ├──────────────────────┤ ┃
┃ │ Key/Value │ ┃
┃ ├──────────────────────┤ ┃
┃ │ Checksum │ ┃
┃ └──────────────────────┘ ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━┛
When the log file is replayed, if the checksum is incorrect or the entry is
incomplete (because of a partially failed write) then the log is truncated.
Index File Layout
The index file is composed of 3 main block types: one series block, one or more
tag blocks, and one measurement block. At the end of the index file is a
trailer that records metadata such as the offsets to these blocks.
Series Block Layout
The series block stores raw series keys in sorted order. It also provides hash
indexes so that series can be looked up quickly. Hash indexes are inserted
periodically so that memory size is limited at write time. Once all the series
and hash indexes have been written then a list of index entries are written
so that hash indexes can be looked up via binary search.
The end of the block contains two HyperLogLog++ sketches which track the
estimated number of created series and deleted series. After the sketches is
a trailer which contains metadata about the block.
┏━━━━━━━SeriesBlock━━━━━━━━┓
┃ ┌──────────────────────┐ ┃
┃ │ Series Key │ ┃
┃ ├──────────────────────┤ ┃
┃ │ Series Key │ ┃
┃ ├──────────────────────┤ ┃
┃ │ Series Key │ ┃
┃ ├──────────────────────┤ ┃
┃ │ │ ┃
┃ │ Hash Index │ ┃
┃ │ │ ┃
┃ ├──────────────────────┤ ┃
┃ │ Series Key │ ┃
┃ ├──────────────────────┤ ┃
┃ │ Series Key │ ┃
┃ ├──────────────────────┤ ┃
┃ │ Series Key │ ┃
┃ ├──────────────────────┤ ┃
┃ │ │ ┃
┃ │ Hash Index │ ┃
┃ │ │ ┃
┃ ├──────────────────────┤ ┃
┃ │ Index Entries │ ┃
┃ ├──────────────────────┤ ┃
┃ │ HLL Sketches │ ┃
┃ ├──────────────────────┤ ┃
┃ │ Trailer │ ┃
┃ └──────────────────────┘ ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━┛
Tag Block Layout
After the series block is one or more tag blocks. One of these blocks exists
for every measurement in the index file. The block is structured as a sorted
list of values for each key and then a sorted list of keys. Each of these lists
has their own hash index for fast direct lookups.
┏━━━━━━━━Tag Block━━━━━━━━━┓
┃ ┌──────────────────────┐ ┃
┃ │ Value │ ┃
┃ ├──────────────────────┤ ┃
┃ │ Value │ ┃
┃ ├──────────────────────┤ ┃
┃ │ Value │ ┃
┃ ├──────────────────────┤ ┃
┃ │ │ ┃
┃ │ Hash Index │ ┃
┃ │ │ ┃
┃ └──────────────────────┘ ┃
┃ ┌──────────────────────┐ ┃
┃ │ Value │ ┃
┃ ├──────────────────────┤ ┃
┃ │ Value │ ┃
┃ ├──────────────────────┤ ┃
┃ │ │ ┃
┃ │ Hash Index │ ┃
┃ │ │ ┃
┃ └──────────────────────┘ ┃
┃ ┌──────────────────────┐ ┃
┃ │ Key │ ┃
┃ ├──────────────────────┤ ┃
┃ │ Key │ ┃
┃ ├──────────────────────┤ ┃
┃ │ │ ┃
┃ │ Hash Index │ ┃
┃ │ │ ┃
┃ └──────────────────────┘ ┃
┃ ┌──────────────────────┐ ┃
┃ │ Trailer │ ┃
┃ └──────────────────────┘ ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━┛
Each entry for values contains a sorted list of offsets for series keys that use
that value. Series iterators can be built around a single tag key value or
multiple iterators can be merged with set operators such as union or
intersection.
Measurement block
The measurement block stores a sorted list of measurements, their associated
series offsets, and the offset to their tag block. This allows all series for
a measurement to be traversed quickly and it allows fast direct lookups of
measurements and their tags.
This block also contains HyperLogLog++ sketches for new and deleted
measurements.
┏━━━━Measurement Block━━━━━┓
┃ ┌──────────────────────┐ ┃
┃ │ Measurement │ ┃
┃ ├──────────────────────┤ ┃
┃ │ Measurement │ ┃
┃ ├──────────────────────┤ ┃
┃ │ Measurement │ ┃
┃ ├──────────────────────┤ ┃
┃ │ │ ┃
┃ │ Hash Index │ ┃
┃ │ │ ┃
┃ ├──────────────────────┤ ┃
┃ │ HLL Sketches │ ┃
┃ ├──────────────────────┤ ┃
┃ │ Trailer │ ┃
┃ └──────────────────────┘ ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━┛
Manifest file
The index is simply an ordered set of log and index files. These files can be
merged together or rewritten but their order must always be the same. This is
because series, measurements, & tags can be marked as deleted (aka tombstoned)
and this action needs to be tracked in time order.
Whenever the set of active files is changed, a manifest file is written to
track the set. The manifest specifies the ordering of files and, on startup,
all files not in the manifest are removed from the index directory.
Compacting index files
Compaction is the process of taking files and merging them together into a
single file. There are two stages of compaction within TSI.
First, once log files exceed a size threshold then they are compacted into an
index file. This threshold is relatively small because log files must maintain
their index in the heap which TSI tries to avoid. Small log files are also very
quick to convert into an index file so this is done aggressively.
Second, once a contiguous set of index files exceed a factor (e.g. 10x) then
they are all merged together into a single index file and the old files are
discarded. Because all blocks are written in sorted order, the new index file
can be streamed and minimize memory use.
Concurrency
Index files are immutable so they do not require fine grained locks, however,
compactions require that we track which files are in use so they are not
discarded too soon. This is done by using reference counting with file sets.
A file set is simply an ordered list of index files. When the current file set
is obtained from the index, a counter is incremented to track its usage. Once
the user is done with the file set, it is released and the counter is
decremented. A file cannot be removed from the file system until this counter
returns to zero.
Besides the reference counting, there are no other locking mechanisms when
reading or writing index files. Log files, however, do require a lock whenever
they are accessed. This is another reason to minimize log file size.
*/
package tsi1