Fixes the `tsm1.BlockIterator` so that it returns the current
key if there are still additional entries remaining. This previously
caused multiple entries not to be merged together during compaction
because the iterator would check if the next key matched the current
key but the key for the next set of entries was returned.
Adds export tooling to `influxd inspect export-blocks` so that we
can dump out block data in SQL format for better analysis during
the debugging process.
Adds a total cursor counter and seek location counter to a new
`readMetrics` that is added to each `Engine`. Default labels group
by `engine_id` and `node_id`.
Adds the ability to set the current generation to use when compacting
the cache only. Previously, we used the current generation for all
files but this causes issues and we should only use the current
generation for level 1 compaction.
There exists a possibility for an in-flight read on a TSMReader to read
a stale reference to an mmapped TSM file index, which has become
unmapped.
This commit resolves that issue by simply renaming the file, leaving the
original file handler open and the data mapped. The path is updated so
that if any callers need to refer to the name of the TSM file after it's
renamed, the new name will be reflected.
The orphaned file handler will be closed when the TSM file is closed.
Previously the series file did not include tombstones in the total
count. This commit now includes tombstones in the count as well as
fixes an issue where replayed tombstone records could exist but
their underlying ID did not exist. This caused the count to become
negative and with the count being `uint64` it caused the count to
rollover to `math.Uint64Max`.
StringArrayEncodeAll will panic if the total length of strings
contained in the src slice is > 0xffffffff. This change adds a unit
test to replicate the issue and an associated fix to return an error.
This also raises an issue that compactions will be unable to make
progress under the following condition:
* multiple string blocks are to be merged to a single block and
* the total length of all strings exceeds the maximum block size that
snappy will encode (0xffffffff)
The observable effect of this is errors in the logs indicating a
compaction failure.
Fixes#13687
This commit teaches the storage schema APIs how to track statistics
and make them available via the returned `cursors.StringIterator`.
Statistics are only tracked when decoding TSM blocks or when scanning
the in-memory cache.
Closes#13541
The TagValues API will perform a linear scan if there is no predicate;
otherwise, it will use the index to find a list of candidate series
keys.
TagKeys expects the predicate to be transformed such that
`_measurement` and `_field` are remapped to `\x00` and `\xff`
respectively.
There is one TODO marked to analyze the predicate for a
`\x00 = '<measurement>'` pattern. If found, the predicate can be
eliminated and fall back to a linear prefix scan by combining the org,
bucket and measurement. This is tracked by issue #13497.
When a tsi1 partition closes, it waits on the wait group for compactions
and then acquires the lock. Unfortunately, a compaction may start in the
mean time, holding on to some resources. Then, close will attempt to
close those resources while holding the lock. That will block until
the compaction has finished, but it also needs to acquire the lock
in order to finish, leading to deadlock.
One cannot just move the wait group wait into the lock because, once
again, the compaction must acquire the lock before finishing. Compaction
can't finish before acquiring the lock because then it might be operating
on an invalid resource.
This change splits the locks into two: one to protect just against
concurrent Open and Close calls, and one to protect all of the other
state. We then just close the partition, acquire the lock, then free
the resources. Starting a compaction requires acquiring a resource
to the partition itself, so that it can't start one after it has
started closing.
This change also introduces a cancellation channel into a reference
to a resource that is closed when the resource is being closed, allowing
processes that have acquired a reference to clean up quicker if someone
is trying to close the resource.
The TagValues API will perform a linear scan if there is no predicate;
otherwise, it will use the index to find a list of candidate series
keys.
TagValues expects the predicate to be transformed such that
`_measurement` and `_field` are remapped to `\x00` and `\xff`
respectively.
There is one TODO marked to analyze the predicate for a
`\x00 = '<measurement>'` pattern. If found, the predicate can be
eliminated and fall back to a linear prefix scan by combining the org,
bucket and measurement.
The TimeRangeIterator permits linear or random index scans and
can answer whether the current key has data for the specified time
interval, considering any tombstones.
When there are no tombstones there are some opportunities for
optimization to skip decoding blocks. Specifically, if the
queried time interval overlaps any boundaries of the TSM index entries.
Add a Contains API which is a peer to the TimestampArray.Contains
function. This is used by the schema APIs to determine if data exists
in the cache for a given key and time interval.
Permits random access of the iterator, correctly maintaining state,
so that Next may be called to iterator from a given key.
This API will be used by the schema APIs when a predicate is specified,
typically requiring random access.
TimestampArray.Contains(min,max) API performs a binary search to
determine if timestamps exist for the given time interval.
It also implements Exclude to drop timestamps that have been tombstoned.
DecodeTimestampArrayBlock decodes only the timestamps of the provided
block.
Removes the `STATS` file generated during TSI compaction as it had
potential for becoming inconsistent with the index data. Instead,
stats are recalculated on start up and on each compaction on a
per-partition basis.
Computing stats for 10M series across 10K measurements takes
approximately 0.171s.
The storer interface isn't necessary if the init/Free logic is
removed, which is unnecessary in a world with only one shard.
Additionally, there were some cases where an init/Free call could
race and cause data loss in the cache. Not doing it at all fixes
all of those races.
This change fixes#10511 that manifests when a shard is considered cold
faster than its cache is snapshotted. Previously the code only looked at
the last modification of compacted tsm1 files. Instead the (restored)
Engine.lastModified() also takes the cache into account.
Ports #10522 to master where engine.go has moved and Engine.LastModified()
was deleted because it was unused.
This commit adds a reason label to the total compaction metric. For
snapshots, the reason will indicate why the cache was snapshotted. For
other compactions, the reason label will be blank.
This commit adds a new Cache option, via the
`tsm1.CacheConfig.SnapshotAgeDuration` field, which controls the maximum
age the cache can reach before it is snapshotted to a TSM file.
The default value for this option is `0`, which means that the cache
will never be snapshotted based only on age. Setting this value to, for
example, 10 seconds, would result in the cache snapshotting every 10
seconds.
Snapshotting the cache more frequently can provide better durability
guarantees in some circumstances, though more, smaller TSM files will
lead to more work needed to compact them down to larger, more dense
files.
When using InfluxDB with a WAL there isn't really a strong reason to
alter `tsm1.CacheConfig.SnapshotAgeDuration` from `0`.
When the WAL was moved up, the validation that happened at the cache
was skipped. This moves the field type validation for a batch of
points up ahead of the WAL again.
During Recover, we forgot to propagate the disabled flag to the
keyIDMap options like we do during Open. Since we still do propagate
the singleton `ims` which is initialized lazily, if the first
initialization has a different set of labels, it will cause an
inconsistent usage even if the metrics are disabled.
This commit adds the pkg/lifecycle.Resource to help manage opening,
closing, and leasing out references to some resource. A resource
cannot be closed until all acquired references have been released.
If the debug_ref tag is enabled, all resource acquisitions keep
track of the stack trace that created them and have a finalizer
associated with them to print on stderr if they are leaked. It also
registers a handler on SIGUSR2 to dump all of the currently live
resources.
Having resources tracked in a uniform way with a data type allows us
to do more sophisticated tracking with the debug_ref tag, as well.
For example, we could panic the process if a resource cannot be
closed within a certain time frame, or attempt to figure out the
DAG of resource ownership dynamically.
This commit also fixes many issues around resources, correctness
during error scenarios, reporting of errors, idempotency of
close, tracking of memory for some data structures, resource leaks
in tests, and out of order dependency closes in tests.
It turns out that LastModified and DiskSize are unused, and so it
was easy to change to not care about the WAL.
This hooks up metrics and starts the WAL again.
At the cost of some nil checks, we don't have to have an interface, defend against
subtle bugs with nils in non-nil interfaces, an empty implementation, etc.
Also, the tsm1 engine is losing the WAL anyway.
Because the WAL relies on the tsm1.Value type, we move that into its own
tsm1/value package and set up some aliases forwarding them into tsm1. This
also required adding some methods and changing consumers to avoid the
unexported fields. I imagine this step will be useful one day when we make
the write path more efficient with respect to consuming points.
This commit additionally fixes some issues with generation. The iterator.tmpldata
and generation for array_cursor_* were removed accidentally when removing
iterators, making those generated files stale. Restore that and regenerate.
No change in functionality.
In the case of caching TSI bitmaps belonging to immutable .tsi files,
the underlying bitset data can be mmapped. It is possible, though rare,
for this data to be unmapped (e.g., via a TSI compaction) but for the
cached bitmap to be subsequently read. This leads to a segfault.
This only happens when copy-on-write is set to true on the roaring
bitmap, because in that case only the internal pointers are cloned.
This change will reduce the TSI cache performance by around 10%, which I
have deemed to account for only a few microseconds typically.
If a bucket had bytes in it that would be escaped by the models
parser/package, then the index would not be correctly purged of those
series data when the bucket was dropped.
Previously series that were being removed were tracked at the key level.
This means that when removing them from the series file, the series id
first had to be looked up. This can cause lock thrashing when there are
many series ids to look up (such as with a bulk delete), because there
are no bulk methods to do this.
This commit changes how the series file delete is done by extracting
the series ids from the index before we remove the index entries. It's
then possible to delete all those series ids from the series file
without having to lookup the ids.
This commit improves the performance of a mass delete on the TSI index
by deleting at the measurement level instead of deleting each series
individually.
I did this with a dumb editor macro, so some comments changed too.
Also rename root package from platform to influxdb.
In interest of minimizing risk, anyone importing the root package has
now aliased it to "platform" so that no changes beyond imports were
necessary in those files.
Lastly, replace the old platform module to local path /dev/null so that
nobody can accidentally reintroduce a platform dependency while
migrating platform code to influxdb.
It exposes an API that will clean up the bodies of many methods and
provide a safe abstraction around iteration that will be able to
handle reads with concurrent deletes.
Benchmarks are flat.
Since the methods inline and dead code is eliminated, it has no runtime
overhead in the benchmarks when disabled.
benchmark recorded faults
BenchmarkIndirectIndex_Entries-8 11
BenchmarkIndirectIndex_ReadEntries-8 11
BenchmarkIndirectIndex_DeleteRangeLast-8 17
BenchmarkIndirectIndex_DeleteRangeFull-8 2218
BenchmarkIndirectIndex_Delete-8 2084
This commit adds support for "prefix keys". Prefix keys differ from
regular tombstone key entries in that the key of the entry should act as
a prefix that matches all series with the same prefix key for the given
time range.
This means only one entry is needed to delete many series.
The tombstone entries now have a maximum length of 16777215 (24 bits),
with the remaining 8 high bits available for setting further options /
meta information about the tombstone entry.
In this case, the top bit is used to indicate that the tombstone entry
is intended to be a prefix. This leaves 7 spare bits for future use.
rather than starting at the first key, do a binary search to the
first key. changes O(N) when deleting the largest key to O(log N).
benchmark old ns/op new ns/op delta
BenchmarkIndirectIndex_DeleteRangeFull-8 17884166763 738717473 -95.87%
Report the total number of gets, puts, and deletes at the end of the
test. I've found this kind of output to be a useful sanity check in
similar tests that exercise concurrency involving tasks.
Use a local random source in each goroutine. I unscientifically
eyeballed that to increase total operations by 5-10%.
Also call t.Parallel in a few more tests that involve disk access. This
shaves 1-2 seconds off the full tsi1 test suite on my machine.
This commit adds the `.tss` files generated for TSM statistics to
the `FileObserver` so that package users can be notified when new
stats files are created and removed.
This commit replaces an `os.OpenFile()` call with an `os.Create()`
call which drops `O_EXCL` for `O_TRUNC` since `.tss` files are only
created after `.tsm` files so lingering temporary files are safe to
overwrite.
The tsdb/tsm1 package was one of the test suites that took the longest
to run in platform with go test -short. The rule of thumb on the Go
project is that short mode should skip any individual test that takes
longer than one second. This change skips two such tests, and it
eliminates a string concatenation loop in two other tests, so that they
report completion in "0.00s" rather than about 0.94s, on my machine.
These cumulative changes take `go test -short ./tsdb/tsm1` from about 14
seconds to about 7 seconds on my machine.
And add a test to cover that.
The data race would look roughly like:
```
WARNING: DATA RACE
Write at 0x00c000024e18 by goroutine 8:
github.com/RoaringBitmap/roaring.(*roaringArray).markAllAsNeedingCopyOnWrite()
/Users/mr/go/pkg/mod/github.com/!roaring!bitmap/roaring@v0.4.16/roaringarray.go:881 +0x6b
github.com/RoaringBitmap/roaring.(*roaringArray).clone()
/Users/mr/go/pkg/mod/github.com/!roaring!bitmap/roaring@v0.4.16/roaringarray.go:266 +0x808
github.com/RoaringBitmap/roaring.(*Bitmap).Clone()
/Users/mr/go/pkg/mod/github.com/!roaring!bitmap/roaring@v0.4.16/roaring.go:385 +0x58
github.com/influxdata/platform/tsdb.(*SeriesIDSet).CloneNoLock()
/Users/mr/go/src/github.com/influxdata/platform/tsdb/series_set.go:229 +0x73
github.com/influxdata/platform/tsdb.(*SeriesIDSet).Clone()
Previous write at 0x00c000024e18 by goroutine 7:
github.com/RoaringBitmap/roaring.(*roaringArray).markAllAsNeedingCopyOnWrite()
/Users/mr/go/pkg/mod/github.com/!roaring!bitmap/roaring@v0.4.16/roaringarray.go:881 +0x6b
github.com/RoaringBitmap/roaring.(*roaringArray).clone()
/Users/mr/go/pkg/mod/github.com/!roaring!bitmap/roaring@v0.4.16/roaringarray.go:266 +0x808
github.com/RoaringBitmap/roaring.(*Bitmap).Clone()
/Users/mr/go/pkg/mod/github.com/!roaring!bitmap/roaring@v0.4.16/roaring.go:385 +0x58
github.com/influxdata/platform/tsdb.(*SeriesIDSet).CloneNoLock()
/Users/mr/go/src/github.com/influxdata/platform/tsdb/series_set.go:229 +0x73
github.com/influxdata/platform/tsdb.(*SeriesIDSet).Clone()
/Users/mr/go/src/github.com/influxdata/platform/tsdb/series_set.go:223 +0x7b
```
We were passing a non-nil tsm1.Log containing a nil *tsm1.WAL which
would cause a panic when it was attempted to be used. Instead, always
pass a non-nil WAL.
We change the storage engine code to not pass in a nil WAL, and
additionally add a defensive check to change any nil WALs into a
NopWAL.
- Add some documentation.
- Move compaction planner to an option instead of config.
The latter fits with the general theme of having config be things
that can be specified in a toml, and everything else being an
option.
Encode the compressed data at the start internal buffer. This ensures
the returned slice maintains the entire capacity and is available for
subsequent use.
When we pool / reuse string buffers, this will help considerably.
Improvements over previous commit:
```
name old time/op new time/op delta
EncodeStrings/10/batch-8 542ns ± 1% 355ns ± 2% -34.53% (p=0.008 n=5+5)
EncodeStrings/100/batch-8 5.29µs ± 1% 3.58µs ± 2% -32.20% (p=0.008 n=5+5)
EncodeStrings/1000/batch-8 48.6µs ± 0% 36.2µs ± 2% -25.40% (p=0.008 n=5+5)
name old alloc/op new alloc/op delta
EncodeStrings/10/batch-8 704B ± 0% 0B -100.00% (p=0.008 n=5+5)
EncodeStrings/100/batch-8 9.47kB ± 0% 0.00kB -100.00% (p=0.008 n=5+5)
EncodeStrings/1000/batch-8 90.1kB ± 0% 0.0kB -100.00% (p=0.008 n=5+5)
name old allocs/op new allocs/op delta
EncodeStrings/10/batch-8 0.00 0.00 ~ (all equal)
EncodeStrings/100/batch-8 1.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5)
EncodeStrings/1000/batch-8 1.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5)
```
This commit adds a tsm1 function for encoding a batch of booleans into a
provided buffer.
The following benchmarks compare the performance of the existing
iterator based encoders, and the new batch oriented encoders using
randomly generated sets of booleans.
This commit adds a tsm1 function for encoding a batch of strings into a
provided buffer. The new function also shares the buffer between the
input data and the snappy encoded output, reducing allocations.
The following benchmarks compare the performance of the existing
iterator based encoders, and the new batch oriented encoders using
randomly generated strings.
name old time/op new time/op delta
EncodeStrings/10 2.14µs ± 4% 1.42µs ± 4% -33.56% (p=0.000 n=10+10)
EncodeStrings/100 12.7µs ± 3% 10.9µs ± 2% -14.46% (p=0.000 n=10+10)
EncodeStrings/1000 132µs ± 2% 114µs ± 2% -13.88% (p=0.000 n=10+9)
name old alloc/op new alloc/op delta
EncodeStrings/10 657B ± 0% 704B ± 0% +7.15% (p=0.000 n=10+10)
EncodeStrings/100 6.14kB ± 0% 9.47kB ± 0% +54.14% (p=0.000 n=10+10)
EncodeStrings/1000 61.4kB ± 0% 90.1kB ± 0% +46.66% (p=0.000 n=10+10)
name old allocs/op new allocs/op delta
EncodeStrings/10 3.00 ± 0% 0.00 -100.00% (p=0.000 n=10+10)
EncodeStrings/100 3.00 ± 0% 1.00 ± 0% -66.67% (p=0.000 n=10+10)
EncodeStrings/1000 3.00 ± 0% 1.00 ± 0% -66.67% (p=0.000 n=10+10)
This commit adds a tsm1 function for encoding a batch of floats into a
buffer. Further, it replaces the `bitstream` library used in the
existing encoders (and all the current decoders) with inlined bit
expressions within the encoder, significantly reducing the function call
overhead for larger batches.
The following benchmarks compare the performance of the existing
iterator based encoders, and the new batch oriented encoders. They look
at a sequential input slice and a randomly generated input slice.
name old time/op new time/op delta
EncodeFloats/10_seq 1.14µs ± 3% 0.24µs ± 3% -78.94% (p=0.000 n=10+10)
EncodeFloats/10_ran 1.69µs ± 2% 0.21µs ± 3% -87.43% (p=0.000 n=10+10)
EncodeFloats/100_seq 7.07µs ± 1% 1.72µs ± 1% -75.62% (p=0.000 n=7+9)
EncodeFloats/100_ran 15.8µs ± 4% 1.8µs ± 1% -88.60% (p=0.000 n=10+9)
EncodeFloats/1000_seq 50.2µs ± 3% 16.2µs ± 2% -67.66% (p=0.000 n=10+10)
EncodeFloats/1000_ran 174µs ± 2% 16µs ± 2% -90.77% (p=0.000 n=10+10)
name old alloc/op new alloc/op delta
EncodeFloats/10_seq 0.00B 0.00B ~ (all equal)
EncodeFloats/10_ran 0.00B 0.00B ~ (all equal)
EncodeFloats/100_seq 0.00B 0.00B ~ (all equal)
EncodeFloats/100_ran 0.00B 0.00B ~ (all equal)
EncodeFloats/1000_seq 0.00B 0.00B ~ (all equal)
EncodeFloats/1000_ran 0.00B 0.00B ~ (all equal)
name old allocs/op new allocs/op delta
EncodeFloats/10_seq 0.00 0.00 ~ (all equal)
EncodeFloats/10_ran 0.00 0.00 ~ (all equal)
EncodeFloats/100_seq 0.00 0.00 ~ (all equal)
EncodeFloats/100_ran 0.00 0.00 ~ (all equal)
EncodeFloats/1000_seq 0.00 0.00 ~ (all equal)
EncodeFloats/1000_ran 0.00 0.00 ~ (all equal)
This keeps file compatability by just writing out zeros for the
sizes and offsets. Perhaps it's ok to just nuke everything and
remove the data.
It also keeps the hll package because it seems generally useful
even if it's not currently being used.
These are the log messages that get printed immediately when starting
the application for the first time. This fixes the messages to conform
to the logging style guide.