This PR adds a configuration option that can be used to inform the
kernel that we intent to page in much of the TSM files.
This madvise value has been problematic in the past when its been set,
so this option defaults to off. It may be useful to some users with slow
disks.
PR #9204 introduced a maximum default concurrent compaction limit of 4.
The idea was to reduce IO utilisation on large systems with many cores,
and high write load. Often on these systems, disks were not scaled
appropriately to to the write volume, and while the write path could
keep up, compactions would saturate disks.
In #9225 work was done to reduce IO saturation by limiting the
compaction throughput. To some extent, both #9204 and #9225 work towards
solving the same problem.
We have recently begun to notice larger clusters to suffer from
situations where compactions are not keeping up because they have been
scaled up, but the limit of 4 has stayed in place. While users can
manually override the setting, it seems more user friendly if we remove
the limit by default, and set it manually in cases where compactions are
causing too much IO on large boxes.
If it's known that the read request only needs to use a single
measurement, then we can avoid the need to get field keys via the query
engine.
However, that means that a new method of getting the field keys for a
measurement would be needed. This commit exposes a method to efficiently
get field key names for a measurement across multiple shards.
name
Array cursors are enabled for storage RPC calls
tsm1:
* Implemented cursors that utilize Array decoders
storage:
* Abstractions to easily switch to Array cursors
* introduced tmpl from Arrow, which allows existing templates to be
reused with additional command-line properties to control output.
* duplicated suite of ReadFloatBlock tests for ReadFloatArrayBlock
* only the float data type is tested as the Read APIs are generated
from a single template.
* separate slices for time and values
* structured to be Arrow ready
* batch decoders fill time and value slices independently that
vastly improves performance (benchmarks linked in PR)
* APIs decode an entire byte slice of encoded data into the provided
`dst` slice
* APIs are stateless and in almost all cases avoid any allocations
* Intended to be used future batch-oriented TSM block decode APIs
* duplicated tests from original iterator-based APIs
This commit swaps out map[uint64]struct{} implementations for roaring
bitmaps, which in turn improves memory usage and read performance.
The bitmap implementation is abstracted such that for low cardinality
sets a simple slice of ids is used, to reduce in-use memory.
When adding many series using offline tooling, it's likely that every
series involves an entry being appended to a LogFile. Typically an entry
is 11 or 12 bytes, but the default bufio.Writer buffer size is only 4K.
This means by default a write of 10,000 new series would involve ~30
buffer flushes.
This commit makes the buffer configurable, and sets the value in
`buildtsi` such that it reflects the number of series being written to
the LogFile.
When running offline tooling, flushing buffers and syncing files on
every write to a `LogFile` is not necessary. Were a hard exit
with data loss to occur, the tooling can simply be run again.
TSI LogFile compactions occasionally race with insert and delete
operations because the index partition FileSet is retained needlessly by
the method that calls Partition.CheckLogFile.
In this change:
- TSI LogFile compaction respects enable/disable compactions
- Partition FileSet.Release before log compaction is triggered
An alternative to the second step is to handle log file compaction in a
new goroutine. Log file compaction errors would be logged and not
returned to the caller.
After this change, `DELETE FROM /regex/` does not deadlock; performance:
- 30s to delete 100 measurements
- 5m30s to delete 1000 measurements
This commit allows users to filter on the `value` field in the
`SHOW TAG VALUES` command:
SHOW TAG VALUES WITH KEY = "mytag" WHERE "value" = 'myvalue'
Previously this command would return all values.
we were asserting to an *os.File in order to call Sync, but in some
cases the file handle has been wrapped, for example with limiting.
instead, assert to minimal interfaces for the functionality we need
and attempt to add some robustness in the code that creates the
writers by using a stronger interface with a Sync method.
fixes#9991
multiple users have attempted to run influxdb in a docker container
with a windows host and a volume mounted from windows. that causes
problems because it apparently uses samba/cifs which does not
support fsync on directories. this patchset will, if it receives an EINVAL
on directory fsync, as is what appears to happen on samba/cifs, then it
will ignore it. this should help.
fixes#9833.
fixes#9630.
When `influx_inspect buildtsi` is used to create a new `tsi1` index, spaces in measurement names are escaped, so measurement "a b" is changed to "a\ b".
This change modifies `models.ParseKeyBytes()` and `models.ParseName()` to unescape measurement names. `models.ParseKeyBytes()` returns unescaped tag keys, so this seems like the natural place to unescape measurement names.
Also followed `scanMeasurement()` to see what other code could be problematic, and this should be everything (the result of one other use of `scanMeasurement()` is later escaped).
Removed `tsdb.MeasurementFromSeriesKey()`. These methods are exported, so checked for side effects in other InfluxData repositories.
This commit restricts the number of TSM1 files that can be opened
concurrently across the entire `tsdb.Store`. There is currently
a limit for the number of shards that can be opened concurrently,
however, this limit does not help when the number of CPU cores
is higher than the number of shards. Because TSM1 files have a 2GB
limit and there is no limit on the number of files per shard,
extremely large shards (1TB+) can load 1,000s of files simultaneously.
This improvement avoids performing a binary search on the index by
first checking the key against the lower and upper bounds. Particularly
useful for multiple, fully-compacted TSM files.
callers can always ensure that the observer set on the engine options
is appropriate for that shard id. this simplifies the api and reduces
the chance of bugs due to mixing up shard ids.
just adds some interface for hooks about when these files come and go.
we do them before the action is taken so that if the hook has an
error, it doesn't have any consistency problems.
The InUse call on TSMFiles is inherently racy in the presence of
Ref calls outside of the file store mutex. In addition, we return
some TSMFiles to callers without them being Ref'd which might allow
them to be closed from underneath. While I believe it is the case
that it would be impossible, as the only thing that gets a handle
externally is compaction, and compaction enforces that only one
handle exists at a time, and thus is only deleted once after the
compaction is done with it, it's not very obvious or enforced.
Instead, always return a TSMFile with a Ref call under the read
lock, and require that no one else calls Ref. That way, it cannot
transition to referenced if the InUse call returns false under the
write lock.
The CreateSnapshot method was racy in a number of ways in the presence
of multiple calls or compactions: it did not take references to the
TSMFiles, and the temporary directory it creates could have been
shared with concurrent CreateSnapshot calls. In addition, the
files slice could have been concurrently mutated during a compaction
as well.
Instead, under the write lock, make a local copy of the state for
the compaction, including Ref calls (write locks are implicitly
read locks). Then, there is no need for a lock at all afterward.
Add some comments to explain these issues at the call sites of InUse,
and document that the Files method that returns the slice unprotected
is only for tests.
- reduce allocations by making leaf a value type with a bool
- make longestPrefix inlineable and have no bounds checks
- delete any code for functions we don't plan to use
- operate on []byte and only copy when necessary
- inline calls to sort.Search to avoid allocations and indirections
- insert directly in the correct location for addEdge
- reduce allocations during copying with a buffer helper
results:
name old time/op new time/op delta
Tree_Insert-8 1.10ms ± 4% 0.73ms ± 4% -33.54% (p=0.000 n=10+10)
Tree_InsertNew-8 3.18ms ± 2% 1.91ms ± 6% -39.90% (p=0.000 n=10+10)
name old speed new speed delta
Tree_Insert-8 9.12MB/s ± 4% 13.72MB/s ± 4% +50.46% (p=0.000 n=10+10)
Tree_InsertNew-8 3.15MB/s ± 2% 5.24MB/s ± 6% +66.42% (p=0.000 n=10+10)
name old alloc/op new alloc/op delta
Tree_InsertNew-8 1.62MB ± 0% 1.60MB ± 0% -1.28% (p=0.000 n=10+9)
name old allocs/op new allocs/op delta
Tree_InsertNew-8 35.0k ± 0% 15.0k ± 0% -57.04% (p=0.000 n=10+10)
MB/sec in this case is 1 byte per key inserted, so it's really millions
of keys inserted per second.
This is the start of per-series validation that occurs in the
Engine write path. It uses an in-memory radix tree to reduce
memory usage and is re-built on demand the first time a series
is written.
does some basic sanity checks. it's hard to be more exhaustive without
either taking a crazy amount of time, or being non-deterministic,
but at least this makes sure we barf in some cases.
No appreciable changes in benchmark results. It seems like this
function is less than 4% of cpu time in the write workloads in the
benchmarks at least.
at some point, the Inmem field on the engine options became
required, but the benchmarks weren't updated.
also uses filepath everywhere when manipulating file paths.
* filters allow specific combinations of database, retention policy and
shard groups to be opened. This was added to reduce the start-up time
of the export tool and limit the memory usage.
* Check for errors from binary.Uvarint when reading TSI logs
* also check len(parsed) == len(input)
* wrap binary.Uvarint
* make uvarint() more generally useful/used
This moves the time range to delete to be returned by the predicate
func in DeleteSeriesRangeWithPredicate. It allows for a single delete
to delete different ranges of times per series instead of a single
range of time for all series.
* Fix stream package to allow for renaming the file before writing it to the stream
* updated test to make sure that the final tsm file has more than one block
This commit adds the `max-index-log-file-size` configuration flag so
that users can restrict the maximum size of log files before compaction.
The default limit was also lowered from `5MB` to `1MB`. The original
size was set before we partitioned the index so the change reflects this.
This change makes it so that we simplify the math engine so it doesn't
use a complicated set of nested iterators. That way, we have to change
math in one fewer place.
It also greatly simplifies the query engine as now we can create the
necessary iterators, join them by time, name, and tags, and then use the
cursor interface to read them and use eval to compute the result. It
makes it so the auxiliary iterators and all of their complexity can be
removed.
This also makes use of the new eval functionality that was recently
added to the influxql package.
No math functions have been added, but the scaffolding has been included
so things like trigonometry functions are just a single commit away.
This also introduces a small breaking change. Because of the call
optimization, it is now possible to use the same selector multiple times
as a selector. So if you do this:
SELECT max(value) * 2, max(value) / 2 FROM cpu
This will now return the timestamp of the max value rather than zero
since this query is considered to have only a single selector rather
than multiple separate selectors. If any aspect of the selector is
different, such as different selector functions or different arguments,
it will consider the selectors to be aggregates like the old behavior.
This commit fixes a data race in the WAL, which can occur when writes
and deletes are being executed concurrently. The WAL uses a buffer pool
of `[]byte` when reading the WAL. WAL entries are unmarshaled into these
buffers and passed along to the relevant methods handling the different
types of entry (write, delete etc).
In the case of deletes, the keys that need to be deleted were being
stored for later processing, however these keys were part of the backing
array of initial buffer from the pool. As such, those keys could be
written to at a future time when handling other parts of the WAL.
There was a check in inmem TagSets to see if a series was assigned
to a shard to prevent cursors for non-existent series getting created.
This check was lost during TSI development because inmem Series tracking
was removed and then replaced with bitsets. The bitsets were not
re-incorporated as before. This adds the functionality back using
the bitsets.
This commit improves the startup time when using the `inmem` index by
ensuring that the series are created in the index and series file in
batches of 10000, rather than individually.
Fixes#9486.
Re-open the last wal segment instead of creating a new one. This fixes
an issue where the last modified time of the WAL would change on
restart. It also avoids a lot of IO file churn on restart.
Currently if a buffer from the buffer is too small to satisfy its request then we simply drop it and allocate a new one.
This change puts it back in the pool and then allocates a new one.
When a max series per data limit was in place (or 0), we would create
series one at a time which really affects throughput. This does it
in bulk which is less accurate, but more performant.
The batch of writes is almost always larger than the 4096 default
which leads to more write IOs. Increasing to 32k allows the majority
of writes to be handled in one IO.
This was added for preventing concurrent writes and deletes to the
same series. This is not handled by the bitsets for both tsi and
inmme. The time.Now() calls shows up in profiles and is not needed.
This commit adds initial empty sketches back to the tsi1 index, as well
as ensuring that ephemeral sketches in the index `LogFile` are updated
accordingly.
The commit also adds a test that verifies that the merged sketches at
the store level produce the correct results under writes, deletions and
re-opening of the store.
This commit does not provide working sketches for post-compaction on the
tsi1 index.
Because of a race between the index and series file lookups, empty
keys can be returned for series which are tombstoned after the
series ids are obtained but before the caller looks up the key.
The default of 4096 results in writes to the WAL still requiring muliple
IOs. We had previously bumped this to 1M, but that was too high when
there are many shards. Increasing to around 16k reduces the IOs to
one or two for the workloads tested. We may want to make this
configurable in the future.
The large number of partitions cause big HeapInUse swings at higher
cardinality which can lead to OOMs. Reducing this to 16 lowers
write throughput to some extent at lower cardinalities, keeps memory
more stable over the long run.
If all the series in a measurement were tombstone, MeasurementHasSeries
would return true because the ok var was re-used from a prior check
earlier in the func. This caused it to be true all the time unless
the measurment was actually tombstoned.
The Store.Delete series held an RLock while deleting from each shard.
While deleting, the Engine uses shardSet to see if a series is fully
deleted. The shardSet.ForEach also takes and RLock. If a Lock is
requested between these two calls, a deadlock occurs.
To fix, we don't need to hold an RLock for the duration of the delete
in the store as each Shard handles concurrency itself and we have a
snapshot of the shards we need to access.
Under concurrent writes and deletes of the same series, a nil panic
could occur in bytes.Compare. Instead of setting the seriesKeys to
nil, set them to an 0 length slice which prevents the panic.
This separates out the dropping of a measurement from the series
to avoid frequent checks to see if a measurement still has series.
The series are dropped individually and we keep track of which
measurements are involved and then delete each measurment afterwards.
If the fields.idx was corrupted in someway, it would cause the shard
to fail to load. Deleting the file will allow it to be rebuilt.
This change handles this automatically so it's rebuilt if necessary
without user intervention.
Series should only be removed from the series file when they're no
longer present in any shard. This commit ensures that during a shard
rollover, the series local to the shard are checked against all other
series in the database.
Series that are no longer present in any other shards' bitsets, are then
marked as deleted in the series file.
Now that each shard-local index is maintaining a bitset of series ids,
tracking the series present in the local shard's tsm engine, there is no
need to track shards in the `inmem` index.
This commit removes the methods associated with tracking those
series/shard relationships.
use. However, because the reference counting was implemented via
mutexes, it was possible to double `RLock` the series file mutex. This
allows a `Lock` to arrive in-between the two `RLock`s, (such as when
deleting the database), causing deadlock.
This commit addresses this by ensuring that from within `IndexSet`
methods, when calling other `IndexSet` methods, that they're all
unexported, and that those unexported methods never take a lock on the
series file.
Keeping series file locking in exported `IndexSet` methods only, allows
one to see any future races more easily.
There are two series key formats: the `models` package format, which is
also line-protocol format, and the `tsdb` package format, which is used
by the series file when serialising series keys.
When writing to a series, rather than taking a `models` format key from
the `coordinator` package and then converting it to a `tsdb` package
format, it would be cheaper to keep the key in the `models` format
before hashing it to determine which partition the key lives in.
This commit adds the ability to correctly mark a series as deleted in
the global series file. Whenever a shard engine determines that a series
should be deleted, it checks with each shard's bitset for series that
are to be deleted and are no longer contained in any shard-local
bitsets.
These series are then removed from the series file.
When dropping series, if the series file does not exists we returned
and error. This breaks compatibility with prior versions that would
not return an error if the series do not exists.
Since partitions slice is not protected under a lock, setting it to
nil while closing causes a race if concurrent calls are accessing
the partitions slice elsewhere. Since partitions do handle
concurrency and Open resets the slice, just leave it as it after
closing.
This test could hang due to an existing race that is still not fixed.
The snapshot and level compaction goroutines woule end up waiting on
the wrong channel to be closed so whey would never exit.
This commit adds a bitset into each shard's in-memory index, to be used to
track undeleted series ids. Currently tsi1 support is not implemented.
When new series are added to the shard, the series id is added
to the bitset. When series are deleted from the shard, the series
ids are removed from the bitset.
Becasue each shard shares the same inmem index reference, the bitset
is stored in the `ShardIndex`, which is local to each shard, and then
different references are passed into the shared `Index` object, depending
on which shard is writing the series.
* Live Restore + Enterprise data format compatability
* Extended ImportData to import all DB's if no db name given
* Added a new enterprise data test, and backup command now prints the backup file paths at conclusion
* Added whole-system backup test
* Update to use protobuf in all enterprise data cases
* Update to test to do cross-testing with enterprise version
* incremental enterprise backup format support
The cache defaulted to entry capacity size of 32. This default
is fine for lower cardinalities, but causes big spikes in InUse
heap with higher cardinalities that can OOM the process. Since
the hints had to be removed previously due to increased memory usage,
they are now completely removed. For lower cardinalities, we do
grow the slice, but this has a small performance penalty compared
to the large memory usage/OOMs with larger cardinalities.
The series file compaction previously did not snapshot the max
offset before compacting and would keep compacting until it reached
the end of segment file. This caused more entries than expected into
the RHH map and this map gets exponentially slower as it gets close
to full.
* only call ParseTags when necessary
* remove dependency on inmem.Series in tsdb test package
* Measurement and Series are no longer exported. Their use is restricted
to the inmem package
* improve Measurement and Series types by exporting immutable
fields and removing unnecessary APIs and locks
Reduced startup time from 28s to 17s. Overall improvement including
#9162 reduces startup from 46s to 17s for 1MM series across 14 shards.