239 lines
11 KiB
Go
239 lines
11 KiB
Go
/*
|
|
|
|
Package tsi1 provides a memory-mapped index implementation that supports
|
|
high cardinality series.
|
|
|
|
Overview
|
|
|
|
The top-level object in tsi1 is the Index. It is the primary access point from
|
|
the rest of the system. The Index is composed of LogFile and IndexFile objects.
|
|
|
|
Log files are small write-ahead log files that record new series immediately
|
|
in the order that they are received. The data within the file is indexed
|
|
in-memory so it can be quickly accessed. When the system is restarted, this log
|
|
file is replayed and the in-memory representation is rebuilt.
|
|
|
|
Index files also contain series information, however, they are highly indexed
|
|
so that reads can be performed quickly. Index files are built through a process
|
|
called compaction where a log file or multiple index files are merged together.
|
|
|
|
|
|
Operations
|
|
|
|
The index can perform many tasks related to series, measurement, & tag data.
|
|
All data is inserted by adding a series to the index. When adding a series,
|
|
the measurement, tag keys, and tag values are all extracted and indexed
|
|
separately.
|
|
|
|
Once a series has been added, it can be removed in several ways. First, the
|
|
individual series can be removed. Second, it can be removed as part of a bulk
|
|
operation by deleting the entire measurement.
|
|
|
|
The query engine needs to be able to look up series in a variety of ways such
|
|
as by measurement name, by tag value, or by using regular expressions. The
|
|
index provides an API to iterate over subsets of series and perform set
|
|
operations such as unions and intersections.
|
|
|
|
|
|
Log File Layout
|
|
|
|
The write-ahead file that series initially are inserted into simply appends
|
|
all new operations sequentially. It is simply composed of a series of log
|
|
entries. An entry contains a flag to specify the operation type, the measurement
|
|
name, the tag set, and a checksum.
|
|
|
|
┏━━━━━━━━━LogEntry━━━━━━━━━┓
|
|
┃ ┌──────────────────────┐ ┃
|
|
┃ │ Flag │ ┃
|
|
┃ ├──────────────────────┤ ┃
|
|
┃ │ Measurement │ ┃
|
|
┃ ├──────────────────────┤ ┃
|
|
┃ │ Key/Value │ ┃
|
|
┃ ├──────────────────────┤ ┃
|
|
┃ │ Key/Value │ ┃
|
|
┃ ├──────────────────────┤ ┃
|
|
┃ │ Key/Value │ ┃
|
|
┃ ├──────────────────────┤ ┃
|
|
┃ │ Checksum │ ┃
|
|
┃ └──────────────────────┘ ┃
|
|
┗━━━━━━━━━━━━━━━━━━━━━━━━━━┛
|
|
|
|
When the log file is replayed, if the checksum is incorrect or the entry is
|
|
incomplete (because of a partially failed write) then the log is truncated.
|
|
|
|
|
|
Index File Layout
|
|
|
|
The index file is composed of 3 main block types: one series block, one or more
|
|
tag blocks, and one measurement block. At the end of the index file is a
|
|
trailer that records metadata such as the offsets to these blocks.
|
|
|
|
|
|
Series Block Layout
|
|
|
|
The series block stores raw series keys in sorted order. It also provides hash
|
|
indexes so that series can be looked up quickly. Hash indexes are inserted
|
|
periodically so that memory size is limited at write time. Once all the series
|
|
and hash indexes have been written then a list of index entries are written
|
|
so that hash indexes can be looked up via binary search.
|
|
|
|
The end of the block contains two HyperLogLog++ sketches which track the
|
|
estimated number of created series and deleted series. After the sketches is
|
|
a trailer which contains metadata about the block.
|
|
|
|
┏━━━━━━━SeriesBlock━━━━━━━━┓
|
|
┃ ┌──────────────────────┐ ┃
|
|
┃ │ Series Key │ ┃
|
|
┃ ├──────────────────────┤ ┃
|
|
┃ │ Series Key │ ┃
|
|
┃ ├──────────────────────┤ ┃
|
|
┃ │ Series Key │ ┃
|
|
┃ ├──────────────────────┤ ┃
|
|
┃ │ │ ┃
|
|
┃ │ Hash Index │ ┃
|
|
┃ │ │ ┃
|
|
┃ ├──────────────────────┤ ┃
|
|
┃ │ Series Key │ ┃
|
|
┃ ├──────────────────────┤ ┃
|
|
┃ │ Series Key │ ┃
|
|
┃ ├──────────────────────┤ ┃
|
|
┃ │ Series Key │ ┃
|
|
┃ ├──────────────────────┤ ┃
|
|
┃ │ │ ┃
|
|
┃ │ Hash Index │ ┃
|
|
┃ │ │ ┃
|
|
┃ ├──────────────────────┤ ┃
|
|
┃ │ Index Entries │ ┃
|
|
┃ ├──────────────────────┤ ┃
|
|
┃ │ HLL Sketches │ ┃
|
|
┃ ├──────────────────────┤ ┃
|
|
┃ │ Trailer │ ┃
|
|
┃ └──────────────────────┘ ┃
|
|
┗━━━━━━━━━━━━━━━━━━━━━━━━━━┛
|
|
|
|
|
|
Tag Block Layout
|
|
|
|
After the series block is one or more tag blocks. One of these blocks exists
|
|
for every measurement in the index file. The block is structured as a sorted
|
|
list of values for each key and then a sorted list of keys. Each of these lists
|
|
has their own hash index for fast direct lookups.
|
|
|
|
┏━━━━━━━━Tag Block━━━━━━━━━┓
|
|
┃ ┌──────────────────────┐ ┃
|
|
┃ │ Value │ ┃
|
|
┃ ├──────────────────────┤ ┃
|
|
┃ │ Value │ ┃
|
|
┃ ├──────────────────────┤ ┃
|
|
┃ │ Value │ ┃
|
|
┃ ├──────────────────────┤ ┃
|
|
┃ │ │ ┃
|
|
┃ │ Hash Index │ ┃
|
|
┃ │ │ ┃
|
|
┃ └──────────────────────┘ ┃
|
|
┃ ┌──────────────────────┐ ┃
|
|
┃ │ Value │ ┃
|
|
┃ ├──────────────────────┤ ┃
|
|
┃ │ Value │ ┃
|
|
┃ ├──────────────────────┤ ┃
|
|
┃ │ │ ┃
|
|
┃ │ Hash Index │ ┃
|
|
┃ │ │ ┃
|
|
┃ └──────────────────────┘ ┃
|
|
┃ ┌──────────────────────┐ ┃
|
|
┃ │ Key │ ┃
|
|
┃ ├──────────────────────┤ ┃
|
|
┃ │ Key │ ┃
|
|
┃ ├──────────────────────┤ ┃
|
|
┃ │ │ ┃
|
|
┃ │ Hash Index │ ┃
|
|
┃ │ │ ┃
|
|
┃ └──────────────────────┘ ┃
|
|
┃ ┌──────────────────────┐ ┃
|
|
┃ │ Trailer │ ┃
|
|
┃ └──────────────────────┘ ┃
|
|
┗━━━━━━━━━━━━━━━━━━━━━━━━━━┛
|
|
|
|
Each entry for values contains a sorted list of offsets for series keys that use
|
|
that value. Series iterators can be built around a single tag key value or
|
|
multiple iterators can be merged with set operators such as union or
|
|
intersection.
|
|
|
|
|
|
Measurement block
|
|
|
|
The measurement block stores a sorted list of measurements, their associated
|
|
series offsets, and the offset to their tag block. This allows all series for
|
|
a measurement to be traversed quickly and it allows fast direct lookups of
|
|
measurements and their tags.
|
|
|
|
This block also contains HyperLogLog++ sketches for new and deleted
|
|
measurements.
|
|
|
|
┏━━━━Measurement Block━━━━━┓
|
|
┃ ┌──────────────────────┐ ┃
|
|
┃ │ Measurement │ ┃
|
|
┃ ├──────────────────────┤ ┃
|
|
┃ │ Measurement │ ┃
|
|
┃ ├──────────────────────┤ ┃
|
|
┃ │ Measurement │ ┃
|
|
┃ ├──────────────────────┤ ┃
|
|
┃ │ │ ┃
|
|
┃ │ Hash Index │ ┃
|
|
┃ │ │ ┃
|
|
┃ ├──────────────────────┤ ┃
|
|
┃ │ HLL Sketches │ ┃
|
|
┃ ├──────────────────────┤ ┃
|
|
┃ │ Trailer │ ┃
|
|
┃ └──────────────────────┘ ┃
|
|
┗━━━━━━━━━━━━━━━━━━━━━━━━━━┛
|
|
|
|
|
|
Manifest file
|
|
|
|
The index is simply an ordered set of log and index files. These files can be
|
|
merged together or rewritten but their order must always be the same. This is
|
|
because series, measurements, & tags can be marked as deleted (aka tombstoned)
|
|
and this action needs to be tracked in time order.
|
|
|
|
Whenever the set of active files is changed, a manifest file is written to
|
|
track the set. The manifest specifies the ordering of files and, on startup,
|
|
all files not in the manifest are removed from the index directory.
|
|
|
|
|
|
Compacting index files
|
|
|
|
Compaction is the process of taking files and merging them together into a
|
|
single file. There are two stages of compaction within TSI.
|
|
|
|
First, once log files exceed a size threshold then they are compacted into an
|
|
index file. This threshold is relatively small because log files must maintain
|
|
their index in the heap which TSI tries to avoid. Small log files are also very
|
|
quick to convert into an index file so this is done aggressively.
|
|
|
|
Second, once a contiguous set of index files exceed a factor (e.g. 10x) then
|
|
they are all merged together into a single index file and the old files are
|
|
discarded. Because all blocks are written in sorted order, the new index file
|
|
can be streamed and minimize memory use.
|
|
|
|
|
|
Concurrency
|
|
|
|
Index files are immutable so they do not require fine grained locks, however,
|
|
compactions require that we track which files are in use so they are not
|
|
discarded too soon. This is done by using reference counting with file sets.
|
|
|
|
A file set is simply an ordered list of index files. When the current file set
|
|
is obtained from the index, a counter is incremented to track its usage. Once
|
|
the user is done with the file set, it is released and the counter is
|
|
decremented. A file cannot be removed from the file system until this counter
|
|
returns to zero.
|
|
|
|
Besides the reference counting, there are no other locking mechanisms when
|
|
reading or writing index files. Log files, however, do require a lock whenever
|
|
they are accessed. This is another reason to minimize log file size.
|
|
|
|
|
|
*/
|
|
package tsi1
|