This commit introduces a new API for finding the maximum
timestamp of a series when iterating over the keys in a
set of TSM files.
This API will be used to determine the field type of a single
field key by selecting the series with the maximum timestamp.
It has also refactored the common functionality for iterating
TSM keys into `timeRangeBlockReader`, which is shared
between `TimeRangeIterator` and `TimeRangeMaxTimeIterator`.
These APIs require a measurement, permitting an additional optimization
to reduce the search space against the TSM index. Specifically, the
search key prefix is extended from `org+bucket` to
`org+bucket,\x00=<measurement>`
* MeasurementNames
* MeasurementTagKeys
* MeasurementTagValues
* Adds an api to the models package for efficiently parsing the
measurement tag (\x00) from a normalized series key
The root cause is that the Unsigned data type has no representation
in the valueType function in the cache and falls back to the default
case of 0.
0 is also a sentinel value in the entry#add function that will
result in skipping the value type check.
It therefore is possible that unsigned values followed by some other
data type is stored in the cache.
It is suspected that the write may be rejected before reaching the
cache, and therefore may not occur in practice. Specifically, the
series file stores the data types on a per-series basis and would
reject the write.
This commit turns the value types into explicit constants and
ensures all existing block types are represented. In addition,
it adds a mapping function to convert these to a known Block type,
which will be used by the `MeasurementFields` schema request to
determine the type of a series in the cache.
* refactor(storage): move type ByTagKey to the only package that uses it
* refactor(tsdb): use types in tsdb/cursors
* refactor(tsdb): remove unused type SeriesIDElems
* refactor(tsdb): inline only use of tsdb.ReadAllSeriesIDIterator
* refactor(tsdb): move series file to its own package
* refactor(storage): remove platform->influxdb aliases
* feat(backup): `influx backup` creates data backup
* feat(backup): initial restore work
* feat(restore): initial restore impl
Adds a restore tool which does offline restore of data and metadata.
* fix(restore): pr cleanup
* fix(restore): fix data dir creation
* fix(restore): pr cleanup
* chore: amend CHANGELOG
* fix: restore to empty dir fails differently
* feat(backup): backup and restore credentials
Saves the credentials file to backups and restores it from backups.
Additionally adds some logging for errors when fetching backup files.
* fix(restore): add missed commit
* fix(restore): pr cleanup
* fix(restore): fix default credentials restore path
* fix(backup): actually copy the credentials file for the backup
* fix: dirs get 0777, files get 0666
* fix: small review feedback
Co-authored-by: tmgordeeva <tanya@influxdata.com>
This commit adds numerous tests for ascending and descending cursors
that generate merged blocks across multiple files, which exceed the
default fixed buffer size used by the array cursors (MaxPointsPerBlock).
Tests cover two scenarios
1. Each file has one block and the block from the second file is
entirely contained within the first block of the first file.
When merging, the new block is 1200 values, which exceeds the
MaxPointsPerBlock.
2. Each file has multiple blocks, and the blocks have a mixture of
values which interleave and overwrite.
This commit prevents multiple blocks for the same series key having
values truncated when they are being read into an empty buffer.
The current cursor reader code has an optimisation that incorrectly
assumes the incoming array will be limited to 1,000 values (the maximum
block size), but arrays can contain values from multiple matching
blocks.
Fixes#15817
This commit addresses several data-races on the `tsm1.Predicate` type
that were causing a live-lock or similar in rare cases during a delete.
Because `tsm1/FileStore.Apply` executes concurrently across TSM files
the state of the delete's predicate was being unsafely mutated.
This commit adds a `Clone` method to the `influxdb.Predicate` type,
which should be used whenever an `influxdb.Predicate` implementation
needs to be used concurrently.
* chore: Remove several instances of WithLogger
* chore: unexport Logger fields
* chore: unexport some more Logger fields
* chore: go fmt
chore: fix test
chore: s/logger/log
chore: fix test
chore: revert http.Handler.Handler constructor initialization
* refactor: integrate review feedback, fix all test nop loggers
* refactor: capitalize all log messages
* refactor: rename two logger to log
Fixes#15916.
If a predicate was passed in with multiple key/value matches for the
same tag key, then the value index would be incorrect. This ensures that
each tag key can only be added to the location map once.
Fixes#15859
This commit fixes a defect in the TSI index where a filter using the
negated equality operator would result in no matching series being
returned for series stored within the `IndexFile` portions of the index.
The root cause of this was due to missing legacy-handling code in the
index for this particular iterator.
* fix(storage): add failing test for array cursor iterator stats
* fix(storage): make arrayCursorIterator.Stats() return stats of in-focus cursor
* fix(storage): add failing test to assert arrayCursorIterator.Stats() returns accumulated result
* fix(storage): assumulate stats in arrayCursorIterator.Stats() call across all observed cursors
By default this feature is disabled; the full compaction behaviour does
not change. When this feature is enabled compactions can be limited
across multiple storage engines running in multiple processes.
The mechanism by which this happens is not part of the abstraction added
here.
Previously the TSI partition would panic if a compaction was
started while `Wait()` was waiting. This commit removes the previous
wait group and replaces it with a simple counter. The `Wait()`
function now polls the counter until it reaches zero.
The cache is essentially a set of maps, where a key in each map is a
series key, and the value is a slice of values associated with that key.
The cache is sharded and series keys are hashed to determine which shard
(map) they live in.
When deleting from the cache we have to check each key to see if it
matches the delete command (predicate and timestamp). If it does then
the entries for that range are removed. As part of this work we check if
the entries are already empty (already removed) and if so we don't check
if the key is valid.
This involved a lot of mutex grabbing, which has now been replaced with
atomic operations.
Benchmarking this commit against the previous commit in this branch
shows a 9% improvement:
name old time/op new time/op delta
Engine_DeletePrefixRange_Cache/exists-24 113ms ± 8% 102ms ±11% -9.40% (p=0.000 n=10+10)
Engine_DeletePrefixRange_Cache/not_exists-24 95.6ms ± 2% 97.1ms ± 4% ~ (p=0.089 n=10+10)
name old alloc/op new alloc/op delta
Engine_DeletePrefixRange_Cache/exists-24 29.6MB ± 1% 25.5MB ± 1% -13.71% (p=0.000 n=10+10)
Engine_DeletePrefixRange_Cache/not_exists-24 24.3MB ± 2% 23.9MB ± 1% -1.48% (p=0.000 n=10+10)
name old allocs/op new allocs/op delta
Engine_DeletePrefixRange_Cache/exists-24 334k ± 0% 305k ± 1% -8.67% (p=0.000 n=8+10)
Engine_DeletePrefixRange_Cache/not_exists-24 302k ± 1% 299k ± 1% -1.25% (p=0.000 n=10+9)
Raw benchmarks on a 24T / 32GB / NVME machine:
goos: linux
goarch: amd64
pkg: github.com/influxdata/influxdb/tsdb/tsm1
BenchmarkEngine_DeletePrefixRange_Cache/exists-24 200 91035525 ns/op 25557809 B/op 305258 allocs/op
BenchmarkEngine_DeletePrefixRange_Cache/exists-24 200 99416796 ns/op 25385052 B/op 303584 allocs/op
BenchmarkEngine_DeletePrefixRange_Cache/exists-24 100 100149484 ns/op 25570062 B/op 305761 allocs/op
BenchmarkEngine_DeletePrefixRange_Cache/exists-24 100 100222516 ns/op 25474372 B/op 303089 allocs/op
BenchmarkEngine_DeletePrefixRange_Cache/exists-24 200 101868258 ns/op 25531572 B/op 304736 allocs/op
BenchmarkEngine_DeletePrefixRange_Cache/exists-24 100 106268683 ns/op 25648213 B/op 306768 allocs/op
BenchmarkEngine_DeletePrefixRange_Cache/exists-24 100 102905477 ns/op 25572314 B/op 305798 allocs/op
BenchmarkEngine_DeletePrefixRange_Cache/exists-24 100 108742857 ns/op 25483068 B/op 304788 allocs/op
BenchmarkEngine_DeletePrefixRange_Cache/exists-24 100 103292149 ns/op 25401388 B/op 303401 allocs/op
BenchmarkEngine_DeletePrefixRange_Cache/exists-24 100 107178026 ns/op 25573602 B/op 305821 allocs/op
BenchmarkEngine_DeletePrefixRange_Cache/not_exists-24 200 95082692 ns/op 23942491 B/op 299116 allocs/op
BenchmarkEngine_DeletePrefixRange_Cache/not_exists-24 200 96088487 ns/op 23957028 B/op 298545 allocs/op
BenchmarkEngine_DeletePrefixRange_Cache/not_exists-24 200 94279165 ns/op 23620981 B/op 294536 allocs/op
BenchmarkEngine_DeletePrefixRange_Cache/not_exists-24 200 94509000 ns/op 23989593 B/op 299453 allocs/op
BenchmarkEngine_DeletePrefixRange_Cache/not_exists-24 200 98530062 ns/op 23935846 B/op 299237 allocs/op
BenchmarkEngine_DeletePrefixRange_Cache/not_exists-24 200 98008093 ns/op 23821683 B/op 297875 allocs/op
BenchmarkEngine_DeletePrefixRange_Cache/not_exists-24 200 97603172 ns/op 23878336 B/op 298350 allocs/op
BenchmarkEngine_DeletePrefixRange_Cache/not_exists-24 200 96867920 ns/op 23782588 B/op 296236 allocs/op
BenchmarkEngine_DeletePrefixRange_Cache/not_exists-24 200 99148908 ns/op 23997702 B/op 299277 allocs/op
BenchmarkEngine_DeletePrefixRange_Cache/not_exists-24 100 100866840 ns/op 24019916 B/op 300339 allocs/op
PASS
ok github.com/influxdata/influxdb/tsdb/tsm1 1144.213s
This command performs verification of TSM blocks
* expected and actual CRC-32 checksums match
* expected and actual min and max timestamps match decoded
data
Fixes the `tsm1.BlockIterator` so that it returns the current
key if there are still additional entries remaining. This previously
caused multiple entries not to be merged together during compaction
because the iterator would check if the next key matched the current
key but the key for the next set of entries was returned.
Adds export tooling to `influxd inspect export-blocks` so that we
can dump out block data in SQL format for better analysis during
the debugging process.
Adds a total cursor counter and seek location counter to a new
`readMetrics` that is added to each `Engine`. Default labels group
by `engine_id` and `node_id`.
Adds the ability to set the current generation to use when compacting
the cache only. Previously, we used the current generation for all
files but this causes issues and we should only use the current
generation for level 1 compaction.
There exists a possibility for an in-flight read on a TSMReader to read
a stale reference to an mmapped TSM file index, which has become
unmapped.
This commit resolves that issue by simply renaming the file, leaving the
original file handler open and the data mapped. The path is updated so
that if any callers need to refer to the name of the TSM file after it's
renamed, the new name will be reflected.
The orphaned file handler will be closed when the TSM file is closed.
Previously the series file did not include tombstones in the total
count. This commit now includes tombstones in the count as well as
fixes an issue where replayed tombstone records could exist but
their underlying ID did not exist. This caused the count to become
negative and with the count being `uint64` it caused the count to
rollover to `math.Uint64Max`.
StringArrayEncodeAll will panic if the total length of strings
contained in the src slice is > 0xffffffff. This change adds a unit
test to replicate the issue and an associated fix to return an error.
This also raises an issue that compactions will be unable to make
progress under the following condition:
* multiple string blocks are to be merged to a single block and
* the total length of all strings exceeds the maximum block size that
snappy will encode (0xffffffff)
The observable effect of this is errors in the logs indicating a
compaction failure.
Fixes#13687
This commit teaches the storage schema APIs how to track statistics
and make them available via the returned `cursors.StringIterator`.
Statistics are only tracked when decoding TSM blocks or when scanning
the in-memory cache.
Closes#13541
The TagValues API will perform a linear scan if there is no predicate;
otherwise, it will use the index to find a list of candidate series
keys.
TagKeys expects the predicate to be transformed such that
`_measurement` and `_field` are remapped to `\x00` and `\xff`
respectively.
There is one TODO marked to analyze the predicate for a
`\x00 = '<measurement>'` pattern. If found, the predicate can be
eliminated and fall back to a linear prefix scan by combining the org,
bucket and measurement. This is tracked by issue #13497.
When a tsi1 partition closes, it waits on the wait group for compactions
and then acquires the lock. Unfortunately, a compaction may start in the
mean time, holding on to some resources. Then, close will attempt to
close those resources while holding the lock. That will block until
the compaction has finished, but it also needs to acquire the lock
in order to finish, leading to deadlock.
One cannot just move the wait group wait into the lock because, once
again, the compaction must acquire the lock before finishing. Compaction
can't finish before acquiring the lock because then it might be operating
on an invalid resource.
This change splits the locks into two: one to protect just against
concurrent Open and Close calls, and one to protect all of the other
state. We then just close the partition, acquire the lock, then free
the resources. Starting a compaction requires acquiring a resource
to the partition itself, so that it can't start one after it has
started closing.
This change also introduces a cancellation channel into a reference
to a resource that is closed when the resource is being closed, allowing
processes that have acquired a reference to clean up quicker if someone
is trying to close the resource.
The TagValues API will perform a linear scan if there is no predicate;
otherwise, it will use the index to find a list of candidate series
keys.
TagValues expects the predicate to be transformed such that
`_measurement` and `_field` are remapped to `\x00` and `\xff`
respectively.
There is one TODO marked to analyze the predicate for a
`\x00 = '<measurement>'` pattern. If found, the predicate can be
eliminated and fall back to a linear prefix scan by combining the org,
bucket and measurement.
The TimeRangeIterator permits linear or random index scans and
can answer whether the current key has data for the specified time
interval, considering any tombstones.
When there are no tombstones there are some opportunities for
optimization to skip decoding blocks. Specifically, if the
queried time interval overlaps any boundaries of the TSM index entries.
Add a Contains API which is a peer to the TimestampArray.Contains
function. This is used by the schema APIs to determine if data exists
in the cache for a given key and time interval.
Permits random access of the iterator, correctly maintaining state,
so that Next may be called to iterator from a given key.
This API will be used by the schema APIs when a predicate is specified,
typically requiring random access.
TimestampArray.Contains(min,max) API performs a binary search to
determine if timestamps exist for the given time interval.
It also implements Exclude to drop timestamps that have been tombstoned.
DecodeTimestampArrayBlock decodes only the timestamps of the provided
block.
Removes the `STATS` file generated during TSI compaction as it had
potential for becoming inconsistent with the index data. Instead,
stats are recalculated on start up and on each compaction on a
per-partition basis.
Computing stats for 10M series across 10K measurements takes
approximately 0.171s.
The storer interface isn't necessary if the init/Free logic is
removed, which is unnecessary in a world with only one shard.
Additionally, there were some cases where an init/Free call could
race and cause data loss in the cache. Not doing it at all fixes
all of those races.
This change fixes#10511 that manifests when a shard is considered cold
faster than its cache is snapshotted. Previously the code only looked at
the last modification of compacted tsm1 files. Instead the (restored)
Engine.lastModified() also takes the cache into account.
Ports #10522 to master where engine.go has moved and Engine.LastModified()
was deleted because it was unused.
This commit adds a reason label to the total compaction metric. For
snapshots, the reason will indicate why the cache was snapshotted. For
other compactions, the reason label will be blank.
This commit adds a new Cache option, via the
`tsm1.CacheConfig.SnapshotAgeDuration` field, which controls the maximum
age the cache can reach before it is snapshotted to a TSM file.
The default value for this option is `0`, which means that the cache
will never be snapshotted based only on age. Setting this value to, for
example, 10 seconds, would result in the cache snapshotting every 10
seconds.
Snapshotting the cache more frequently can provide better durability
guarantees in some circumstances, though more, smaller TSM files will
lead to more work needed to compact them down to larger, more dense
files.
When using InfluxDB with a WAL there isn't really a strong reason to
alter `tsm1.CacheConfig.SnapshotAgeDuration` from `0`.
When the WAL was moved up, the validation that happened at the cache
was skipped. This moves the field type validation for a batch of
points up ahead of the WAL again.
During Recover, we forgot to propagate the disabled flag to the
keyIDMap options like we do during Open. Since we still do propagate
the singleton `ims` which is initialized lazily, if the first
initialization has a different set of labels, it will cause an
inconsistent usage even if the metrics are disabled.
This commit adds the pkg/lifecycle.Resource to help manage opening,
closing, and leasing out references to some resource. A resource
cannot be closed until all acquired references have been released.
If the debug_ref tag is enabled, all resource acquisitions keep
track of the stack trace that created them and have a finalizer
associated with them to print on stderr if they are leaked. It also
registers a handler on SIGUSR2 to dump all of the currently live
resources.
Having resources tracked in a uniform way with a data type allows us
to do more sophisticated tracking with the debug_ref tag, as well.
For example, we could panic the process if a resource cannot be
closed within a certain time frame, or attempt to figure out the
DAG of resource ownership dynamically.
This commit also fixes many issues around resources, correctness
during error scenarios, reporting of errors, idempotency of
close, tracking of memory for some data structures, resource leaks
in tests, and out of order dependency closes in tests.