This commit sets the copy-on-write feature of the SeriesIDSets, such
that we can make immutable clones of underlying bitmaps efficiently. If
the original bitmap is modified then a copy will be made, which won't
affect the clone.
This commit ensures that cached bitset results at the Index level are
updated whenever new series ids are created that would belong in those
bitsets.
For example, if we have a cached bitset for the tuple {mem, region,
west}, and we add the series mem,host=prod,region=west then we would
update the cached bitset for {mem, region, west} with the series id of
the newly written series.
This commit removes the HLL sketches on each `tsi1.LogFile` and
`tsi1.IndexFile` and instead caches the data at the `tsi1.Index`
level. This reduces the heap size significantly for servers with
many TSI-enabled shards.
FloatBatchDecodeAll behaves the same as the iterator-based float
decoder, returning an empty slice and no error when passed a buffer
with no encoded float values.
Fixes#10270
Since we append to the file itself, once we have read the file in, we
can be done with the mmap'd data.
Ideally we can rework UnmarshalBinary and do away with the mmap
completely. That is future work.
This commit ensures that any orphaned series (series that are to be
removed and no longer are referenced anywhere in the database) are
removed from the `inmem` index when a shard is dropped.
Since all tag sets are materialised to strings before this method
returns, a large number of allocations can be avoided by carefully
resuing buffers and containers.
This commit reduces allocations by about 75%, which can be very
significant for high cardinality workloads.
The benchmark results shown below are for a benchmark that asks for all
series keys matching `tag5=value0'.
name old time/op new time/op delta
Index_ConcurrentWriteQuery/inmem/queries_100000-8 5.66s ± 4% 5.70s ± 5% ~ (p=0.739 n=10+10)
Index_ConcurrentWriteQuery/tsi1/queries_100000-8 26.5s ± 8% 26.8s ±12% ~ (p=0.579 n=10+10)
IndexSet_TagSets/1M_series/inmem-8 11.9ms ±18% 10.4ms ± 2% -12.81% (p=0.000 n=10+10)
IndexSet_TagSets/1M_series/tsi1-8 23.4ms ± 5% 18.9ms ± 1% -19.07% (p=0.000 n=10+9)
name old alloc/op new alloc/op delta
Index_ConcurrentWriteQuery/inmem/queries_100000-8 2.50GB ± 0% 2.50GB ± 0% ~ (p=0.315 n=10+10)
Index_ConcurrentWriteQuery/tsi1/queries_100000-8 32.6GB ± 0% 32.6GB ± 0% ~ (p=0.247 n=10+10)
IndexSet_TagSets/1M_series/inmem-8 3.56MB ± 0% 3.56MB ± 0% ~ (all equal)
IndexSet_TagSets/1M_series/tsi1-8 12.7MB ± 0% 5.2MB ± 0% -59.02% (p=0.000 n=10+10)
name old allocs/op new allocs/op delta
Index_ConcurrentWriteQuery/inmem/queries_100000-8 24.0M ± 0% 24.0M ± 0% ~ (p=0.353 n=10+10)
Index_ConcurrentWriteQuery/tsi1/queries_100000-8 96.6M ± 0% 96.7M ± 0% ~ (p=0.579 n=10+10)
IndexSet_TagSets/1M_series/inmem-8 51.0 ± 0% 51.0 ± 0% ~ (all equal)
IndexSet_TagSets/1M_series/tsi1-8 80.4k ± 0% 20.4k ± 0% -74.65% (p=0.000 n=10+10)
The internals of `newSeriesCursor` returned a struct pointer that
implicitly got turned into the interface. Unfortunately, Go treats this
type of interface conversion as a nil pointer to the struct rather than
as just nil so if you attempted to compare the returned cursor to nil,
they would not be equal and it would think it was non-nil and attempt to
use the cursor.
This PR adds a configuration option that can be used to inform the
kernel that we intent to page in much of the TSM files.
This madvise value has been problematic in the past when its been set,
so this option defaults to off. It may be useful to some users with slow
disks.
PR #9204 introduced a maximum default concurrent compaction limit of 4.
The idea was to reduce IO utilisation on large systems with many cores,
and high write load. Often on these systems, disks were not scaled
appropriately to to the write volume, and while the write path could
keep up, compactions would saturate disks.
In #9225 work was done to reduce IO saturation by limiting the
compaction throughput. To some extent, both #9204 and #9225 work towards
solving the same problem.
We have recently begun to notice larger clusters to suffer from
situations where compactions are not keeping up because they have been
scaled up, but the limit of 4 has stayed in place. While users can
manually override the setting, it seems more user friendly if we remove
the limit by default, and set it manually in cases where compactions are
causing too much IO on large boxes.
If it's known that the read request only needs to use a single
measurement, then we can avoid the need to get field keys via the query
engine.
However, that means that a new method of getting the field keys for a
measurement would be needed. This commit exposes a method to efficiently
get field key names for a measurement across multiple shards.
name
Array cursors are enabled for storage RPC calls
tsm1:
* Implemented cursors that utilize Array decoders
storage:
* Abstractions to easily switch to Array cursors
* introduced tmpl from Arrow, which allows existing templates to be
reused with additional command-line properties to control output.
* duplicated suite of ReadFloatBlock tests for ReadFloatArrayBlock
* only the float data type is tested as the Read APIs are generated
from a single template.
* separate slices for time and values
* structured to be Arrow ready
* batch decoders fill time and value slices independently that
vastly improves performance (benchmarks linked in PR)
* APIs decode an entire byte slice of encoded data into the provided
`dst` slice
* APIs are stateless and in almost all cases avoid any allocations
* Intended to be used future batch-oriented TSM block decode APIs
* duplicated tests from original iterator-based APIs
This commit swaps out map[uint64]struct{} implementations for roaring
bitmaps, which in turn improves memory usage and read performance.
The bitmap implementation is abstracted such that for low cardinality
sets a simple slice of ids is used, to reduce in-use memory.
When adding many series using offline tooling, it's likely that every
series involves an entry being appended to a LogFile. Typically an entry
is 11 or 12 bytes, but the default bufio.Writer buffer size is only 4K.
This means by default a write of 10,000 new series would involve ~30
buffer flushes.
This commit makes the buffer configurable, and sets the value in
`buildtsi` such that it reflects the number of series being written to
the LogFile.
When running offline tooling, flushing buffers and syncing files on
every write to a `LogFile` is not necessary. Were a hard exit
with data loss to occur, the tooling can simply be run again.
TSI LogFile compactions occasionally race with insert and delete
operations because the index partition FileSet is retained needlessly by
the method that calls Partition.CheckLogFile.
In this change:
- TSI LogFile compaction respects enable/disable compactions
- Partition FileSet.Release before log compaction is triggered
An alternative to the second step is to handle log file compaction in a
new goroutine. Log file compaction errors would be logged and not
returned to the caller.
After this change, `DELETE FROM /regex/` does not deadlock; performance:
- 30s to delete 100 measurements
- 5m30s to delete 1000 measurements
This commit allows users to filter on the `value` field in the
`SHOW TAG VALUES` command:
SHOW TAG VALUES WITH KEY = "mytag" WHERE "value" = 'myvalue'
Previously this command would return all values.
we were asserting to an *os.File in order to call Sync, but in some
cases the file handle has been wrapped, for example with limiting.
instead, assert to minimal interfaces for the functionality we need
and attempt to add some robustness in the code that creates the
writers by using a stronger interface with a Sync method.
fixes#9991
multiple users have attempted to run influxdb in a docker container
with a windows host and a volume mounted from windows. that causes
problems because it apparently uses samba/cifs which does not
support fsync on directories. this patchset will, if it receives an EINVAL
on directory fsync, as is what appears to happen on samba/cifs, then it
will ignore it. this should help.
fixes#9833.
fixes#9630.
When `influx_inspect buildtsi` is used to create a new `tsi1` index, spaces in measurement names are escaped, so measurement "a b" is changed to "a\ b".
This change modifies `models.ParseKeyBytes()` and `models.ParseName()` to unescape measurement names. `models.ParseKeyBytes()` returns unescaped tag keys, so this seems like the natural place to unescape measurement names.
Also followed `scanMeasurement()` to see what other code could be problematic, and this should be everything (the result of one other use of `scanMeasurement()` is later escaped).
Removed `tsdb.MeasurementFromSeriesKey()`. These methods are exported, so checked for side effects in other InfluxData repositories.
This commit restricts the number of TSM1 files that can be opened
concurrently across the entire `tsdb.Store`. There is currently
a limit for the number of shards that can be opened concurrently,
however, this limit does not help when the number of CPU cores
is higher than the number of shards. Because TSM1 files have a 2GB
limit and there is no limit on the number of files per shard,
extremely large shards (1TB+) can load 1,000s of files simultaneously.
This improvement avoids performing a binary search on the index by
first checking the key against the lower and upper bounds. Particularly
useful for multiple, fully-compacted TSM files.
callers can always ensure that the observer set on the engine options
is appropriate for that shard id. this simplifies the api and reduces
the chance of bugs due to mixing up shard ids.
just adds some interface for hooks about when these files come and go.
we do them before the action is taken so that if the hook has an
error, it doesn't have any consistency problems.
The InUse call on TSMFiles is inherently racy in the presence of
Ref calls outside of the file store mutex. In addition, we return
some TSMFiles to callers without them being Ref'd which might allow
them to be closed from underneath. While I believe it is the case
that it would be impossible, as the only thing that gets a handle
externally is compaction, and compaction enforces that only one
handle exists at a time, and thus is only deleted once after the
compaction is done with it, it's not very obvious or enforced.
Instead, always return a TSMFile with a Ref call under the read
lock, and require that no one else calls Ref. That way, it cannot
transition to referenced if the InUse call returns false under the
write lock.
The CreateSnapshot method was racy in a number of ways in the presence
of multiple calls or compactions: it did not take references to the
TSMFiles, and the temporary directory it creates could have been
shared with concurrent CreateSnapshot calls. In addition, the
files slice could have been concurrently mutated during a compaction
as well.
Instead, under the write lock, make a local copy of the state for
the compaction, including Ref calls (write locks are implicitly
read locks). Then, there is no need for a lock at all afterward.
Add some comments to explain these issues at the call sites of InUse,
and document that the Files method that returns the slice unprotected
is only for tests.
- reduce allocations by making leaf a value type with a bool
- make longestPrefix inlineable and have no bounds checks
- delete any code for functions we don't plan to use
- operate on []byte and only copy when necessary
- inline calls to sort.Search to avoid allocations and indirections
- insert directly in the correct location for addEdge
- reduce allocations during copying with a buffer helper
results:
name old time/op new time/op delta
Tree_Insert-8 1.10ms ± 4% 0.73ms ± 4% -33.54% (p=0.000 n=10+10)
Tree_InsertNew-8 3.18ms ± 2% 1.91ms ± 6% -39.90% (p=0.000 n=10+10)
name old speed new speed delta
Tree_Insert-8 9.12MB/s ± 4% 13.72MB/s ± 4% +50.46% (p=0.000 n=10+10)
Tree_InsertNew-8 3.15MB/s ± 2% 5.24MB/s ± 6% +66.42% (p=0.000 n=10+10)
name old alloc/op new alloc/op delta
Tree_InsertNew-8 1.62MB ± 0% 1.60MB ± 0% -1.28% (p=0.000 n=10+9)
name old allocs/op new allocs/op delta
Tree_InsertNew-8 35.0k ± 0% 15.0k ± 0% -57.04% (p=0.000 n=10+10)
MB/sec in this case is 1 byte per key inserted, so it's really millions
of keys inserted per second.
This is the start of per-series validation that occurs in the
Engine write path. It uses an in-memory radix tree to reduce
memory usage and is re-built on demand the first time a series
is written.
does some basic sanity checks. it's hard to be more exhaustive without
either taking a crazy amount of time, or being non-deterministic,
but at least this makes sure we barf in some cases.
No appreciable changes in benchmark results. It seems like this
function is less than 4% of cpu time in the write workloads in the
benchmarks at least.
at some point, the Inmem field on the engine options became
required, but the benchmarks weren't updated.
also uses filepath everywhere when manipulating file paths.
* filters allow specific combinations of database, retention policy and
shard groups to be opened. This was added to reduce the start-up time
of the export tool and limit the memory usage.
* Check for errors from binary.Uvarint when reading TSI logs
* also check len(parsed) == len(input)
* wrap binary.Uvarint
* make uvarint() more generally useful/used
This moves the time range to delete to be returned by the predicate
func in DeleteSeriesRangeWithPredicate. It allows for a single delete
to delete different ranges of times per series instead of a single
range of time for all series.
* Fix stream package to allow for renaming the file before writing it to the stream
* updated test to make sure that the final tsm file has more than one block
This commit adds the `max-index-log-file-size` configuration flag so
that users can restrict the maximum size of log files before compaction.
The default limit was also lowered from `5MB` to `1MB`. The original
size was set before we partitioned the index so the change reflects this.
This change makes it so that we simplify the math engine so it doesn't
use a complicated set of nested iterators. That way, we have to change
math in one fewer place.
It also greatly simplifies the query engine as now we can create the
necessary iterators, join them by time, name, and tags, and then use the
cursor interface to read them and use eval to compute the result. It
makes it so the auxiliary iterators and all of their complexity can be
removed.
This also makes use of the new eval functionality that was recently
added to the influxql package.
No math functions have been added, but the scaffolding has been included
so things like trigonometry functions are just a single commit away.
This also introduces a small breaking change. Because of the call
optimization, it is now possible to use the same selector multiple times
as a selector. So if you do this:
SELECT max(value) * 2, max(value) / 2 FROM cpu
This will now return the timestamp of the max value rather than zero
since this query is considered to have only a single selector rather
than multiple separate selectors. If any aspect of the selector is
different, such as different selector functions or different arguments,
it will consider the selectors to be aggregates like the old behavior.
This commit fixes a data race in the WAL, which can occur when writes
and deletes are being executed concurrently. The WAL uses a buffer pool
of `[]byte` when reading the WAL. WAL entries are unmarshaled into these
buffers and passed along to the relevant methods handling the different
types of entry (write, delete etc).
In the case of deletes, the keys that need to be deleted were being
stored for later processing, however these keys were part of the backing
array of initial buffer from the pool. As such, those keys could be
written to at a future time when handling other parts of the WAL.
There was a check in inmem TagSets to see if a series was assigned
to a shard to prevent cursors for non-existent series getting created.
This check was lost during TSI development because inmem Series tracking
was removed and then replaced with bitsets. The bitsets were not
re-incorporated as before. This adds the functionality back using
the bitsets.
This commit improves the startup time when using the `inmem` index by
ensuring that the series are created in the index and series file in
batches of 10000, rather than individually.
Fixes#9486.
Re-open the last wal segment instead of creating a new one. This fixes
an issue where the last modified time of the WAL would change on
restart. It also avoids a lot of IO file churn on restart.
Currently if a buffer from the buffer is too small to satisfy its request then we simply drop it and allocate a new one.
This change puts it back in the pool and then allocates a new one.
When a max series per data limit was in place (or 0), we would create
series one at a time which really affects throughput. This does it
in bulk which is less accurate, but more performant.
The batch of writes is almost always larger than the 4096 default
which leads to more write IOs. Increasing to 32k allows the majority
of writes to be handled in one IO.
This was added for preventing concurrent writes and deletes to the
same series. This is not handled by the bitsets for both tsi and
inmme. The time.Now() calls shows up in profiles and is not needed.
This commit adds initial empty sketches back to the tsi1 index, as well
as ensuring that ephemeral sketches in the index `LogFile` are updated
accordingly.
The commit also adds a test that verifies that the merged sketches at
the store level produce the correct results under writes, deletions and
re-opening of the store.
This commit does not provide working sketches for post-compaction on the
tsi1 index.
Because of a race between the index and series file lookups, empty
keys can be returned for series which are tombstoned after the
series ids are obtained but before the caller looks up the key.
The default of 4096 results in writes to the WAL still requiring muliple
IOs. We had previously bumped this to 1M, but that was too high when
there are many shards. Increasing to around 16k reduces the IOs to
one or two for the workloads tested. We may want to make this
configurable in the future.
The large number of partitions cause big HeapInUse swings at higher
cardinality which can lead to OOMs. Reducing this to 16 lowers
write throughput to some extent at lower cardinalities, keeps memory
more stable over the long run.
If all the series in a measurement were tombstone, MeasurementHasSeries
would return true because the ok var was re-used from a prior check
earlier in the func. This caused it to be true all the time unless
the measurment was actually tombstoned.
The Store.Delete series held an RLock while deleting from each shard.
While deleting, the Engine uses shardSet to see if a series is fully
deleted. The shardSet.ForEach also takes and RLock. If a Lock is
requested between these two calls, a deadlock occurs.
To fix, we don't need to hold an RLock for the duration of the delete
in the store as each Shard handles concurrency itself and we have a
snapshot of the shards we need to access.