* APIs decode an entire byte slice of encoded data into the provided
`dst` slice
* APIs are stateless and in almost all cases avoid any allocations
* Intended to be used future batch-oriented TSM block decode APIs
* duplicated tests from original iterator-based APIs
we were asserting to an *os.File in order to call Sync, but in some
cases the file handle has been wrapped, for example with limiting.
instead, assert to minimal interfaces for the functionality we need
and attempt to add some robustness in the code that creates the
writers by using a stronger interface with a Sync method.
fixes#9991
multiple users have attempted to run influxdb in a docker container
with a windows host and a volume mounted from windows. that causes
problems because it apparently uses samba/cifs which does not
support fsync on directories. this patchset will, if it receives an EINVAL
on directory fsync, as is what appears to happen on samba/cifs, then it
will ignore it. this should help.
fixes#9833.
fixes#9630.
When `influx_inspect buildtsi` is used to create a new `tsi1` index, spaces in measurement names are escaped, so measurement "a b" is changed to "a\ b".
This change modifies `models.ParseKeyBytes()` and `models.ParseName()` to unescape measurement names. `models.ParseKeyBytes()` returns unescaped tag keys, so this seems like the natural place to unescape measurement names.
Also followed `scanMeasurement()` to see what other code could be problematic, and this should be everything (the result of one other use of `scanMeasurement()` is later escaped).
Removed `tsdb.MeasurementFromSeriesKey()`. These methods are exported, so checked for side effects in other InfluxData repositories.
This commit restricts the number of TSM1 files that can be opened
concurrently across the entire `tsdb.Store`. There is currently
a limit for the number of shards that can be opened concurrently,
however, this limit does not help when the number of CPU cores
is higher than the number of shards. Because TSM1 files have a 2GB
limit and there is no limit on the number of files per shard,
extremely large shards (1TB+) can load 1,000s of files simultaneously.
This improvement avoids performing a binary search on the index by
first checking the key against the lower and upper bounds. Particularly
useful for multiple, fully-compacted TSM files.
callers can always ensure that the observer set on the engine options
is appropriate for that shard id. this simplifies the api and reduces
the chance of bugs due to mixing up shard ids.
just adds some interface for hooks about when these files come and go.
we do them before the action is taken so that if the hook has an
error, it doesn't have any consistency problems.
The InUse call on TSMFiles is inherently racy in the presence of
Ref calls outside of the file store mutex. In addition, we return
some TSMFiles to callers without them being Ref'd which might allow
them to be closed from underneath. While I believe it is the case
that it would be impossible, as the only thing that gets a handle
externally is compaction, and compaction enforces that only one
handle exists at a time, and thus is only deleted once after the
compaction is done with it, it's not very obvious or enforced.
Instead, always return a TSMFile with a Ref call under the read
lock, and require that no one else calls Ref. That way, it cannot
transition to referenced if the InUse call returns false under the
write lock.
The CreateSnapshot method was racy in a number of ways in the presence
of multiple calls or compactions: it did not take references to the
TSMFiles, and the temporary directory it creates could have been
shared with concurrent CreateSnapshot calls. In addition, the
files slice could have been concurrently mutated during a compaction
as well.
Instead, under the write lock, make a local copy of the state for
the compaction, including Ref calls (write locks are implicitly
read locks). Then, there is no need for a lock at all afterward.
Add some comments to explain these issues at the call sites of InUse,
and document that the Files method that returns the slice unprotected
is only for tests.
- reduce allocations by making leaf a value type with a bool
- make longestPrefix inlineable and have no bounds checks
- delete any code for functions we don't plan to use
- operate on []byte and only copy when necessary
- inline calls to sort.Search to avoid allocations and indirections
- insert directly in the correct location for addEdge
- reduce allocations during copying with a buffer helper
results:
name old time/op new time/op delta
Tree_Insert-8 1.10ms ± 4% 0.73ms ± 4% -33.54% (p=0.000 n=10+10)
Tree_InsertNew-8 3.18ms ± 2% 1.91ms ± 6% -39.90% (p=0.000 n=10+10)
name old speed new speed delta
Tree_Insert-8 9.12MB/s ± 4% 13.72MB/s ± 4% +50.46% (p=0.000 n=10+10)
Tree_InsertNew-8 3.15MB/s ± 2% 5.24MB/s ± 6% +66.42% (p=0.000 n=10+10)
name old alloc/op new alloc/op delta
Tree_InsertNew-8 1.62MB ± 0% 1.60MB ± 0% -1.28% (p=0.000 n=10+9)
name old allocs/op new allocs/op delta
Tree_InsertNew-8 35.0k ± 0% 15.0k ± 0% -57.04% (p=0.000 n=10+10)
MB/sec in this case is 1 byte per key inserted, so it's really millions
of keys inserted per second.
This is the start of per-series validation that occurs in the
Engine write path. It uses an in-memory radix tree to reduce
memory usage and is re-built on demand the first time a series
is written.
This moves the time range to delete to be returned by the predicate
func in DeleteSeriesRangeWithPredicate. It allows for a single delete
to delete different ranges of times per series instead of a single
range of time for all series.
* Fix stream package to allow for renaming the file before writing it to the stream
* updated test to make sure that the final tsm file has more than one block
This commit fixes a data race in the WAL, which can occur when writes
and deletes are being executed concurrently. The WAL uses a buffer pool
of `[]byte` when reading the WAL. WAL entries are unmarshaled into these
buffers and passed along to the relevant methods handling the different
types of entry (write, delete etc).
In the case of deletes, the keys that need to be deleted were being
stored for later processing, however these keys were part of the backing
array of initial buffer from the pool. As such, those keys could be
written to at a future time when handling other parts of the WAL.
This commit improves the startup time when using the `inmem` index by
ensuring that the series are created in the index and series file in
batches of 10000, rather than individually.
Fixes#9486.
Re-open the last wal segment instead of creating a new one. This fixes
an issue where the last modified time of the WAL would change on
restart. It also avoids a lot of IO file churn on restart.
Currently if a buffer from the buffer is too small to satisfy its request then we simply drop it and allocate a new one.
This change puts it back in the pool and then allocates a new one.
This commit adds initial empty sketches back to the tsi1 index, as well
as ensuring that ephemeral sketches in the index `LogFile` are updated
accordingly.
The commit also adds a test that verifies that the merged sketches at
the store level produce the correct results under writes, deletions and
re-opening of the store.
This commit does not provide working sketches for post-compaction on the
tsi1 index.
The default of 4096 results in writes to the WAL still requiring muliple
IOs. We had previously bumped this to 1M, but that was too high when
there are many shards. Increasing to around 16k reduces the IOs to
one or two for the workloads tested. We may want to make this
configurable in the future.
The large number of partitions cause big HeapInUse swings at higher
cardinality which can lead to OOMs. Reducing this to 16 lowers
write throughput to some extent at lower cardinalities, keeps memory
more stable over the long run.
Under concurrent writes and deletes of the same series, a nil panic
could occur in bytes.Compare. Instead of setting the seriesKeys to
nil, set them to an 0 length slice which prevents the panic.
This separates out the dropping of a measurement from the series
to avoid frequent checks to see if a measurement still has series.
The series are dropped individually and we keep track of which
measurements are involved and then delete each measurment afterwards.
If the fields.idx was corrupted in someway, it would cause the shard
to fail to load. Deleting the file will allow it to be rebuilt.
This change handles this automatically so it's rebuilt if necessary
without user intervention.
This commit adds the ability to correctly mark a series as deleted in
the global series file. Whenever a shard engine determines that a series
should be deleted, it checks with each shard's bitset for series that
are to be deleted and are no longer contained in any shard-local
bitsets.
These series are then removed from the series file.
This test could hang due to an existing race that is still not fixed.
The snapshot and level compaction goroutines woule end up waiting on
the wrong channel to be closed so whey would never exit.
This commit adds a bitset into each shard's in-memory index, to be used to
track undeleted series ids. Currently tsi1 support is not implemented.
When new series are added to the shard, the series id is added
to the bitset. When series are deleted from the shard, the series
ids are removed from the bitset.
Becasue each shard shares the same inmem index reference, the bitset
is stored in the `ShardIndex`, which is local to each shard, and then
different references are passed into the shared `Index` object, depending
on which shard is writing the series.
* Live Restore + Enterprise data format compatability
* Extended ImportData to import all DB's if no db name given
* Added a new enterprise data test, and backup command now prints the backup file paths at conclusion
* Added whole-system backup test
* Update to use protobuf in all enterprise data cases
* Update to test to do cross-testing with enterprise version
* incremental enterprise backup format support
The cache defaulted to entry capacity size of 32. This default
is fine for lower cardinalities, but causes big spikes in InUse
heap with higher cardinalities that can OOM the process. Since
the hints had to be removed previously due to increased memory usage,
they are now completely removed. For lower cardinalities, we do
grow the slice, but this has a small performance penalty compared
to the large memory usage/OOMs with larger cardinalities.
* only call ParseTags when necessary
* remove dependency on inmem.Series in tsdb test package
* Measurement and Series are no longer exported. Their use is restricted
to the inmem package
* improve Measurement and Series types by exporting immutable
fields and removing unnecessary APIs and locks
Reduced startup time from 28s to 17s. Overall improvement including
#9162 reduces startup from 46s to 17s for 1MM series across 14 shards.
This commit ensures that the series file should work appropriately on
32-bit architecturs. It does this by reducing the maximum size of a
series file to 512MB on 32-bit systems, which should be fully
addressable.
It further updates tests so that the series file size can be reduced
further when running many tests in parallel on 32-bit architectures.
This limits the disk IO for writing TSM files during compactions
and snapshots. This helps reduce the spiky IO patterns on SSDs and
when compactions run very quickly.
Since possibly v0.9 DELETE SERIES has had the unwanted side effect of
removing series from the index when the last traces of series data are
removed from TSM. This occurred because the inmem index was rebuilt on
startup, and if there was no TSM data for a series then there could be
not series to add to the index.
This commit returns to the original (documented) DROP/DETETE SERIES
behaviour. As such, when issuing DROP SERIES all instances of matching
series will be removed from both the TSM engine and the index. When
issuing DELETE SERIES only TSM data will be removed.
It is up to the operator to remove series from the index.
NB, this commit does not address how to remove series data from the
series file when a shard rolls over.
This changes the approach to adjusting the amount of concurrency
used for snapshotting to be based on the snapshot latency vs
cardinality. The cardinality approach could use too much concurrency
and increase the number of level 1 TSM files too quickly which incurs
more disk IO.
The latency model seems to adjust better to different workloads.