Scanner objects and iterators often need a ValuerEval. This
object is created, often with a function call, and has at
least one interface in it, so it allocates storage. Then it's
dropped again right away. The only part of it that might be
subject to change is usually a map. While the map's contents
change over time, the actual map doesn't change for the
lifetime of the object.
So, in both iterators and scanners, stash the ValuerEval
and continue reusing it. On a query returning a fair number
of data points, this produces a small (<5% in practice)
improvement in observed performance, visible as a significant
reduction in time spent in runtime (mallocgc, newobject,
etcetera).
The performance improvement isn't big, but it's reasonably
easy to evaluate it and establish that it's a safe change
to make.
Signed-off-by: seebs <seebs@seebs.net>
In the case of caching TSI bitmaps belonging to immutable .tsi files,
the underlying bitset data can be mmapped. It is possible, though rare,
for this data to be unmapped (e.g., via a TSI compaction) but for the
cached bitmap to be subsequently read. This leads to a segfault.
This only happens when copy-on-write is set to true on the roaring
bitmap, because in that case only the internal pointers are cloned.
This change will reduce the TSI cache performance by around 10%, which I
have deemed to account for only a few microseconds typically.
This commit adds a config option to the tsdb Config allowing the size of
the bitset cached in the TSI index to be specified.
Setting the cache size to 0 will disable the cache.
This commit limits the number of files that can be compacted in
a single group when forcing a full compaction or when a shard
becomes cold. This is to prevent too many files being compacted
at the same time.
Before this, if you deleted everything with `delete where true`
for example, then you would be left with all of your measurements
in the fields index. That would cause ghost fields to reappear
if someone reinserted to the measurement.
This fixes that by making it so the deepest most delete code
checks if the measurement was removed from the index, and if so
cleaning it up out of the fields index.
Additionally, it fixes bugs in that cleanup code where if you had
a measurement like "m1" and "m10", when iterating over the cache
or file store, "m1" would match "m10" due to it only checking the
prefix. This also has it check the character right after the
measurement to be either a comma because tags started, or the first
character of the field separator.
This change fixes#10511 that manifests when a shard is considered cold
faster than its cache is snapshotted. This can happen if WAL is enabled
because previously the code only considered the last modification of
compacted tsm1 files. Instead Engine.LastModified() also takes the WAL
into account if necessary.
There are some problematic races that occur when deletes happen
against writes to the same points at the same time. This change
introduces guards and an epoch based system to coordinate these
modifications.
A guard matches a point based on the time, measurement name, and
some conditions loaded from an influxql expression. The intent
is to be as precise as possible without allowing any false
neagatives: if a point would be deleted, the guard must match it.
We are allowed to match more points than necessary, at the cost
of slowing down writes.
The epoch based system keeps track of outstanding writes and
deletes and their associated guards. When a delete operation
is going to start, it waits until all current writes are
done, and installs its guard, blocking all future writes that
contain points that may conflict with the delete. This allows
writes to disjoint points to proceed uncontended, and the
implementation is optimized for assuming there are few
outstanding deletes. For example, in the case that there are no
deletes, a write just has to take a mutex, bump a counter, and
compare a value against zero. The epoch trackers are per shard,
so that different shards never have to contend with one another.
TSI1 and inmem indexes have different properties during deletes.
Specifically, inmem shares a global index across all shards, where
every tsi1 index is contained to a specific shard. When deleting
a series, it may cause the last reference to the series across all
shards to be dropped, necessitating a removal from the series file.
Since the inmem index shares the index across all shards, removing
the series when it's removed from the series file is sufficient.
However, in the case of a mixed index database, if the last shard
is a TSI1 shard, the other inmem indexes are not available when we
discover that it was the last reference to the series. This ends
up leaving the series in the inmem index without a series id in
the series file, causing all sorts of misbehavior.
Rather than continue curling ourselves into a ball to try to fix
this unsupported mode, give a helpful error message to the user
that they must run their database in a non-mixed index mode to
allow deletes.
Removes cloning measurement fields on writes, instead atomically swaps out
measurement field sets when fields are added (with new overhead of copying
existing fields whenever a new one is added).
We already make copies when no expression is provided, because
the backing slices may go away if the shard they came from is
closed. This fixes the other spot where some backing slices
would be returned.
This change makes the digest reader read and discard the manifest if
needed. Not all readers of a digest are interested in the manifest.
This change also makes it a requirement for the writer to write a
manifest because it is a non-optional part of a digest file.
This commit adds an `indexType` key to the shard sections of the
`/debug/vars` endpoint, as well as the `_internal` shard statistics.
The tag will be reported as `"indexType": "inmem"` or `"indexType":
"tsi1"`.
Encode the compressed data at the start internal buffer. This ensures
the returned slice maintains the entire capacity and is available for
subsequent use.
When we pool / reuse string buffers, this will help considerably.
Improvements over previous commit:
```
name old time/op new time/op delta
EncodeStrings/10/batch-8 542ns ± 1% 355ns ± 2% -34.53% (p=0.008 n=5+5)
EncodeStrings/100/batch-8 5.29µs ± 1% 3.58µs ± 2% -32.20% (p=0.008 n=5+5)
EncodeStrings/1000/batch-8 48.6µs ± 0% 36.2µs ± 2% -25.40% (p=0.008 n=5+5)
name old alloc/op new alloc/op delta
EncodeStrings/10/batch-8 704B ± 0% 0B -100.00% (p=0.008 n=5+5)
EncodeStrings/100/batch-8 9.47kB ± 0% 0.00kB -100.00% (p=0.008 n=5+5)
EncodeStrings/1000/batch-8 90.1kB ± 0% 0.0kB -100.00% (p=0.008 n=5+5)
name old allocs/op new allocs/op delta
EncodeStrings/10/batch-8 0.00 0.00 ~ (all equal)
EncodeStrings/100/batch-8 1.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5)
EncodeStrings/1000/batch-8 1.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5)
```
This commit adds a tsm1 function for encoding a batch of booleans into a
provided buffer.
The following benchmarks compare the performance of the existing
iterator based encoders, and the new batch oriented encoders using
randomly generated sets of booleans.
This commit adds a tsm1 function for encoding a batch of strings into a
provided buffer. The new function also shares the buffer between the
input data and the snappy encoded output, reducing allocations.
The following benchmarks compare the performance of the existing
iterator based encoders, and the new batch oriented encoders using
randomly generated strings.
name old time/op new time/op delta
EncodeStrings/10 2.14µs ± 4% 1.42µs ± 4% -33.56% (p=0.000 n=10+10)
EncodeStrings/100 12.7µs ± 3% 10.9µs ± 2% -14.46% (p=0.000 n=10+10)
EncodeStrings/1000 132µs ± 2% 114µs ± 2% -13.88% (p=0.000 n=10+9)
name old alloc/op new alloc/op delta
EncodeStrings/10 657B ± 0% 704B ± 0% +7.15% (p=0.000 n=10+10)
EncodeStrings/100 6.14kB ± 0% 9.47kB ± 0% +54.14% (p=0.000 n=10+10)
EncodeStrings/1000 61.4kB ± 0% 90.1kB ± 0% +46.66% (p=0.000 n=10+10)
name old allocs/op new allocs/op delta
EncodeStrings/10 3.00 ± 0% 0.00 -100.00% (p=0.000 n=10+10)
EncodeStrings/100 3.00 ± 0% 1.00 ± 0% -66.67% (p=0.000 n=10+10)
EncodeStrings/1000 3.00 ± 0% 1.00 ± 0% -66.67% (p=0.000 n=10+10)
This commit adds a tsm1 function for encoding a batch of floats into a
buffer. Further, it replaces the `bitstream` library used in the
existing encoders (and all the current decoders) with inlined bit
expressions within the encoder, significantly reducing the function call
overhead for larger batches.
The following benchmarks compare the performance of the existing
iterator based encoders, and the new batch oriented encoders. They look
at a sequential input slice and a randomly generated input slice.
name old time/op new time/op delta
EncodeFloats/10_seq 1.14µs ± 3% 0.24µs ± 3% -78.94% (p=0.000 n=10+10)
EncodeFloats/10_ran 1.69µs ± 2% 0.21µs ± 3% -87.43% (p=0.000 n=10+10)
EncodeFloats/100_seq 7.07µs ± 1% 1.72µs ± 1% -75.62% (p=0.000 n=7+9)
EncodeFloats/100_ran 15.8µs ± 4% 1.8µs ± 1% -88.60% (p=0.000 n=10+9)
EncodeFloats/1000_seq 50.2µs ± 3% 16.2µs ± 2% -67.66% (p=0.000 n=10+10)
EncodeFloats/1000_ran 174µs ± 2% 16µs ± 2% -90.77% (p=0.000 n=10+10)
name old alloc/op new alloc/op delta
EncodeFloats/10_seq 0.00B 0.00B ~ (all equal)
EncodeFloats/10_ran 0.00B 0.00B ~ (all equal)
EncodeFloats/100_seq 0.00B 0.00B ~ (all equal)
EncodeFloats/100_ran 0.00B 0.00B ~ (all equal)
EncodeFloats/1000_seq 0.00B 0.00B ~ (all equal)
EncodeFloats/1000_ran 0.00B 0.00B ~ (all equal)
name old allocs/op new allocs/op delta
EncodeFloats/10_seq 0.00 0.00 ~ (all equal)
EncodeFloats/10_ran 0.00 0.00 ~ (all equal)
EncodeFloats/100_seq 0.00 0.00 ~ (all equal)
EncodeFloats/100_ran 0.00 0.00 ~ (all equal)
EncodeFloats/1000_seq 0.00 0.00 ~ (all equal)
EncodeFloats/1000_ran 0.00 0.00 ~ (all equal)
This commit deletes most of the code to service reads from influxdb
and pulls it in from platform instead.
Of note, the models.Tag and models.Tags types are now aliases to the
platform models.Tag and models.Tags types. Additionally, many types
in the tsdb package relating to cursors are also aliases to the same
types in the platform cursors package.
This updates the platform and flux repos to the current master in the
Gopkg.lock.
This commit fixes an issue with the series file compaction process
where tombstones are lost after compaction and series existence
checks are not correct. This commit also fixes some smaller flushing
issues within the series file that mainly related to testing.
If there was an error after the cache has been snapshotted to one or
more TSM files, but before the cache and WAL are cleaned up, then the
cache would be repeatedly snapshotted, generated duplicate level 1 TSM
files.
This commit attempts to clean those files up by removing the temporary
TSM file(s). The snapshot will be retried.
Since all tag sets are materialised to strings before this method
returns, a large number of allocations can be avoided by carefully
resuing buffers and containers.
This commit reduces allocations by about 75%, which can be very
significant for high cardinality workloads.
The benchmark results shown below are for a benchmark that asks for all
series keys matching `tag5=value0'. There are 100K matching series keys.
benchmark old ns/op new ns/op delta
BenchmarkIndexSet_TagSets/1M_series/inmem-8 10959963 11144345 +1.68%
BenchmarkIndexSet_TagSets/1M_series/tsi1-8 23632757 18768888 -20.58%
BenchmarkIndexSet_TagSets/1M_series/inmem-8 10496303 10380551 -1.10%
BenchmarkIndexSet_TagSets/1M_series/tsi1-8 24344359 19020234 -21.87%
BenchmarkIndexSet_TagSets/1M_series/inmem-8 10359864 10818296 +4.43%
BenchmarkIndexSet_TagSets/1M_series/tsi1-8 23453357 19027445 -18.87%
BenchmarkIndexSet_TagSets/1M_series/inmem-8 10479519 10400619 -0.75%
BenchmarkIndexSet_TagSets/1M_series/tsi1-8 26364965 19023749 -27.84%
BenchmarkIndexSet_TagSets/1M_series/inmem-8 10437794 10557066 +1.14%
BenchmarkIndexSet_TagSets/1M_series/tsi1-8 23126946 19196955 -16.99%
benchmark old allocs new allocs delta
BenchmarkIndexSet_TagSets/1M_series/inmem-8 51 51 +0.00%
BenchmarkIndexSet_TagSets/1M_series/tsi1-8 80067 20071 -74.93%
BenchmarkIndexSet_TagSets/1M_series/inmem-8 51 51 +0.00%
BenchmarkIndexSet_TagSets/1M_series/tsi1-8 80067 20071 -74.93%
BenchmarkIndexSet_TagSets/1M_series/inmem-8 51 51 +0.00%
BenchmarkIndexSet_TagSets/1M_series/tsi1-8 80067 20071 -74.93%
BenchmarkIndexSet_TagSets/1M_series/inmem-8 51 51 +0.00%
BenchmarkIndexSet_TagSets/1M_series/tsi1-8 80067 20071 -74.93%
BenchmarkIndexSet_TagSets/1M_series/inmem-8 51 51 +0.00%
BenchmarkIndexSet_TagSets/1M_series/tsi1-8 80067 20071 -74.93%
benchmark old bytes new bytes delta
BenchmarkIndexSet_TagSets/1M_series/inmem-8 3556728 3556728 +0.00%
BenchmarkIndexSet_TagSets/1M_series/tsi1-8 12677328 5157992 -59.31%
BenchmarkIndexSet_TagSets/1M_series/inmem-8 3556728 3556728 +0.00%
BenchmarkIndexSet_TagSets/1M_series/tsi1-8 12677328 5157992 -59.31%
BenchmarkIndexSet_TagSets/1M_series/inmem-8 3556728 3556728 +0.00%
BenchmarkIndexSet_TagSets/1M_series/tsi1-8 12677328 5157992 -59.31%
BenchmarkIndexSet_TagSets/1M_series/inmem-8 3556728 3556728 +0.00%
BenchmarkIndexSet_TagSets/1M_series/tsi1-8 12677328 5157992 -59.31%
BenchmarkIndexSet_TagSets/1M_series/inmem-8 3556728 3556728 +0.00%
BenchmarkIndexSet_TagSets/1M_series/tsi1-8 12677328 5157992 -59.31%
This commit sets the copy-on-write feature of the SeriesIDSets, such
that we can make immutable clones of underlying bitmaps efficiently. If
the original bitmap is modified then a copy will be made, which won't
affect the clone.
This commit ensures that cached bitset results at the Index level are
updated whenever new series ids are created that would belong in those
bitsets.
For example, if we have a cached bitset for the tuple {mem, region,
west}, and we add the series mem,host=prod,region=west then we would
update the cached bitset for {mem, region, west} with the series id of
the newly written series.
This commit removes the HLL sketches on each `tsi1.LogFile` and
`tsi1.IndexFile` and instead caches the data at the `tsi1.Index`
level. This reduces the heap size significantly for servers with
many TSI-enabled shards.
FloatBatchDecodeAll behaves the same as the iterator-based float
decoder, returning an empty slice and no error when passed a buffer
with no encoded float values.
Fixes#10270
Since we append to the file itself, once we have read the file in, we
can be done with the mmap'd data.
Ideally we can rework UnmarshalBinary and do away with the mmap
completely. That is future work.
This commit ensures that any orphaned series (series that are to be
removed and no longer are referenced anywhere in the database) are
removed from the `inmem` index when a shard is dropped.
Since all tag sets are materialised to strings before this method
returns, a large number of allocations can be avoided by carefully
resuing buffers and containers.
This commit reduces allocations by about 75%, which can be very
significant for high cardinality workloads.
The benchmark results shown below are for a benchmark that asks for all
series keys matching `tag5=value0'.
name old time/op new time/op delta
Index_ConcurrentWriteQuery/inmem/queries_100000-8 5.66s ± 4% 5.70s ± 5% ~ (p=0.739 n=10+10)
Index_ConcurrentWriteQuery/tsi1/queries_100000-8 26.5s ± 8% 26.8s ±12% ~ (p=0.579 n=10+10)
IndexSet_TagSets/1M_series/inmem-8 11.9ms ±18% 10.4ms ± 2% -12.81% (p=0.000 n=10+10)
IndexSet_TagSets/1M_series/tsi1-8 23.4ms ± 5% 18.9ms ± 1% -19.07% (p=0.000 n=10+9)
name old alloc/op new alloc/op delta
Index_ConcurrentWriteQuery/inmem/queries_100000-8 2.50GB ± 0% 2.50GB ± 0% ~ (p=0.315 n=10+10)
Index_ConcurrentWriteQuery/tsi1/queries_100000-8 32.6GB ± 0% 32.6GB ± 0% ~ (p=0.247 n=10+10)
IndexSet_TagSets/1M_series/inmem-8 3.56MB ± 0% 3.56MB ± 0% ~ (all equal)
IndexSet_TagSets/1M_series/tsi1-8 12.7MB ± 0% 5.2MB ± 0% -59.02% (p=0.000 n=10+10)
name old allocs/op new allocs/op delta
Index_ConcurrentWriteQuery/inmem/queries_100000-8 24.0M ± 0% 24.0M ± 0% ~ (p=0.353 n=10+10)
Index_ConcurrentWriteQuery/tsi1/queries_100000-8 96.6M ± 0% 96.7M ± 0% ~ (p=0.579 n=10+10)
IndexSet_TagSets/1M_series/inmem-8 51.0 ± 0% 51.0 ± 0% ~ (all equal)
IndexSet_TagSets/1M_series/tsi1-8 80.4k ± 0% 20.4k ± 0% -74.65% (p=0.000 n=10+10)
The internals of `newSeriesCursor` returned a struct pointer that
implicitly got turned into the interface. Unfortunately, Go treats this
type of interface conversion as a nil pointer to the struct rather than
as just nil so if you attempted to compare the returned cursor to nil,
they would not be equal and it would think it was non-nil and attempt to
use the cursor.
This PR adds a configuration option that can be used to inform the
kernel that we intent to page in much of the TSM files.
This madvise value has been problematic in the past when its been set,
so this option defaults to off. It may be useful to some users with slow
disks.
PR #9204 introduced a maximum default concurrent compaction limit of 4.
The idea was to reduce IO utilisation on large systems with many cores,
and high write load. Often on these systems, disks were not scaled
appropriately to to the write volume, and while the write path could
keep up, compactions would saturate disks.
In #9225 work was done to reduce IO saturation by limiting the
compaction throughput. To some extent, both #9204 and #9225 work towards
solving the same problem.
We have recently begun to notice larger clusters to suffer from
situations where compactions are not keeping up because they have been
scaled up, but the limit of 4 has stayed in place. While users can
manually override the setting, it seems more user friendly if we remove
the limit by default, and set it manually in cases where compactions are
causing too much IO on large boxes.
If it's known that the read request only needs to use a single
measurement, then we can avoid the need to get field keys via the query
engine.
However, that means that a new method of getting the field keys for a
measurement would be needed. This commit exposes a method to efficiently
get field key names for a measurement across multiple shards.
name
Array cursors are enabled for storage RPC calls
tsm1:
* Implemented cursors that utilize Array decoders
storage:
* Abstractions to easily switch to Array cursors
* introduced tmpl from Arrow, which allows existing templates to be
reused with additional command-line properties to control output.
* duplicated suite of ReadFloatBlock tests for ReadFloatArrayBlock
* only the float data type is tested as the Read APIs are generated
from a single template.
* separate slices for time and values
* structured to be Arrow ready
* batch decoders fill time and value slices independently that
vastly improves performance (benchmarks linked in PR)
* APIs decode an entire byte slice of encoded data into the provided
`dst` slice
* APIs are stateless and in almost all cases avoid any allocations
* Intended to be used future batch-oriented TSM block decode APIs
* duplicated tests from original iterator-based APIs
This commit swaps out map[uint64]struct{} implementations for roaring
bitmaps, which in turn improves memory usage and read performance.
The bitmap implementation is abstracted such that for low cardinality
sets a simple slice of ids is used, to reduce in-use memory.
When adding many series using offline tooling, it's likely that every
series involves an entry being appended to a LogFile. Typically an entry
is 11 or 12 bytes, but the default bufio.Writer buffer size is only 4K.
This means by default a write of 10,000 new series would involve ~30
buffer flushes.
This commit makes the buffer configurable, and sets the value in
`buildtsi` such that it reflects the number of series being written to
the LogFile.
When running offline tooling, flushing buffers and syncing files on
every write to a `LogFile` is not necessary. Were a hard exit
with data loss to occur, the tooling can simply be run again.
TSI LogFile compactions occasionally race with insert and delete
operations because the index partition FileSet is retained needlessly by
the method that calls Partition.CheckLogFile.
In this change:
- TSI LogFile compaction respects enable/disable compactions
- Partition FileSet.Release before log compaction is triggered
An alternative to the second step is to handle log file compaction in a
new goroutine. Log file compaction errors would be logged and not
returned to the caller.
After this change, `DELETE FROM /regex/` does not deadlock; performance:
- 30s to delete 100 measurements
- 5m30s to delete 1000 measurements