When the compaction planner runs, if it cannot acquire
a lock on the files it plans to compact, it returns a
nil list of compaction groups. This, in turn, sets the
engine statistics for compactions queues to zero,
which is incorrect. Instead, use the length of pending
files which would have been returned.
closes https://github.com/influxdata/influxdb/issues/22138
This fix ensures that memory-mapped files are not released
before pointers into them are copied into heap memory.
MeasurementNamesByExpr() and MeasurementNamesByPredicate() can
cause panics by copying memory from mmapped files that have been
released. The functions they call use iterators to files which
are closed (releasing the mmapped files) before the memory is
safely copied to the heap.
closes https://github.com/influxdata/influxdb/issues/22000
Compaction logging will generate intermediate information on
volume of data written and output files created, as well as
improve some of the anti-entropy messages related to compaction.
This will also apply to `influx_tools compact`
Closes https://github.com/influxdata/influxdb/issues/21704
tsm1.DigestWithOptions closes its network connection
twice. This may cause broken pipe errors on concurrent
invocations of the same procedure, by closing a reused
i/o descriptor. This fix also captures errors from TSM
file closures, which were previously ignored.
Closes https://github.com/influxdata/influxdb/issues/21656
Under heavy write load creating new fields and measurements
the rewrite of the fields.idx file is a bottleneck. This
enhancement combines multiple writes into a single one and
shares any error return value with all of the combined
invocations. MeasurementFieldSet and the new
MeasurementFieldSetWriter must both now be explicitly
closed.
Closes#21577
tsdb.Engine.IsIdle and tsdb.Engine.Digest now return a reason string for why the engine & shard are not idle.
Callers can then use this string for logging, if desired. The returned reason does not allocate memory, so the
caller may want to add the shard ID and path for more information in the log. This is intended to be used in
calls from the anti-entropy service in Enterprise.
(cherry picked from commit bf45841359)
fixes https://github.com/influxdata/influxdb/issues/21448
* fix: backport tsdb fix for window pushdowns
From https://github.com/influxdata/influxdb/pull/19855
* fix(storage): cursor requests are [start, stop] instead of [start, stop)
The cursors were previously [start, stop) to be consistent with how flux
requests data, but the underlying storage file store was [start, stop]
because that's how influxql read data. This reverts back the cursor
behavior so that it is now [start, stop] everywhere and the conversion
from [start, stop) to [start, stop] is performed when doing the cursor
request to get the next cursor.
cherry-pick from #21318
Co-authored-by: Sam Arnold <sarnold@influxdata.com>
(cherry picked from commit 7766672797)
* chore: fix formatting
Co-authored-by: Jonathan A. Sternberg <jonathan@influxdata.com>
The anti-entropy service will loop trying to copy an empty shard to a
data node missing that shard. This fix is one of two changes that
correctly create an empty shard on a new node. This fix will set the
LastModified date of an empty shard directory to the modification time
of that directory, instead of to the Unix epoch.
Fixes: https://github.com/influxdata/influxdb/issues/21273
This is a backport of #14262 to the 1.x storage engine.
This also ports the table tests that existed with the pre-beta version of the
storage engine to the one that is now used in the production version.
A few of the tests are skipped. These are portions of the storage engine
that have not been ported over. They should be unskipped when that
functionality is ported over.
Co-authored-by: Jonathan A. Sternberg <jonathan@influxdata.com>
* feat: add cursors and readers for window aggregates
* fix: backport fix + tests for race condition in flux tag cache
* test: port 2.x test for array_cursor
This fixes multi measurement queries that go through the storage service
to correctly pick up all series that apply with the filter. Previously,
negative queries such as `!=`, `!~`, and predicates attempting to match
empty tags did not work correctly with the storage service when multiple
measurements or `OR` conditions were included.
This was because these predicates would be categorized as "multiple
measurements" and then it would attempt to use the field keys iterator
to find the fields for each measurement. The meta queries for these did
not correctly account for negative equality operators or empty tags when
finding appropriate measurements and those could not be changed because
it would cause a breaking change to influxql too.
This modifies the storage service to use new methods that correctly
account for the above situations rather than the field keys iterator.
Some queries that appeared to be single measurement queries also get
considered as multiple measurement queries. Any query with an `OR`
condition will be considered a multiple measurement query.
This bug did not apply to single measurement queries where one
measurement was selected and all of the logical operators were `AND`
values. This is because it used a different code path that correctly
handled these situations.
Backport of #19566.
(cherry picked from commit ceead88bd5)
Co-authored-by: Jonathan A. Sternberg <jonathan@influxdata.com>
* fix: Change from RewriteExpr to PartitionExpr
Also remove some dead code
* feat: WITH KEY implementation
* feat: query rewriting for WITH KEY in SHOW TAG KEYS
Meta queries (SHOW TAG VALUES, SHOW TAG KEYS, SHOW SERIES CARDINALITY, etc.) do not respect
the QueryTimeout config parameter. Meta queries should check the query context when possible
to allow cancellation and timeout. This will not be as frequent as regular queries, which
use iterators, because meta queries return data in batches.
Add a context.Context to
(*Store).MeasurementNames()
(*Store).MeasurementsCardinality()
(*Store).SeriesCardinality()
(*Store).TagValues()
(*Store).TagKeys()
(*Store).SeriesSketches()
(*Store).MeasurementsSketches()
which is tested for timeout or cancellation
to allow limitation of time spent in meta queries
https://github.com/influxdata/influxdb/issues/20736
* feat(query): hyper log log counting in query engine
In addition to helping with normal queries, this can improve the 'SHOW CARDINALITY'
meta-queries:
time influx -database mydb -execute 'select count_hll(sum_hll(_seriesKey)) from big'
name: big
time count_hll
---- ---------
0 200767781
influx -database mydb -execute 0.06s user 0.12s system 0% cpu 8:49.99 total
Extending the context instead of fixing the API breaks type safety.
For tracking the number of points / values written, it is much clearer
to pass an explicit tracker.
When using queries like 'select count(_seriesKey) from bigmeasurement`, we
should iterate over the tsi structures to serve the query instead of loading
all the series into memory up front.
Closes#20543
fields.idx frequent writes cause lock contention and fields.idx is recreated
when a field or measurement is added in a WritePointsWithContext()
This eliminates locking during the actual file rewrite, and limits it to
the times when the MeasurementFieldSet is actually being read or written
in memory and when the new file is being renamed.
Test verification of correct behavior by checking the fields.idx
file matches the in-memory copy after heavily parallel measurement addition.
Fixes https://github.com/influxdata/influxdb/issues/20500
When a SELECT INTO query generates an illegal value that cannot be inserted,
like +/- Inf, it should return an error, rather than failing silently.
This adds a boolean parameter to the [data] section of influxdb.conf:
* strict-error-handling
When false, the default, the old behavior is preserved. When true,
unsupported values will return an error from SELECT INTO queries
Fixes https://github.com/influxdata/influxdb/issues/20426
Loop with backoff in (*Engine).CreateSnapshot() to retry
(*Engine).WriteSnapshot() up to 3 times if
ErrSnapshotInPrgress is returned. Then continue
on no error or on SnapshotInProgress if skipCacheOk is
true.
https://github.com/influxdata/plutonium/issues/3227
(cherry picked from commit dfa6aa8cea)
Test the skipCacheOk flag to tsdb.Shard.CreateSnapshot() and
tsdb.Engine.CreateSnapshot()
A value of true allows the backup to proceed even if a cache
snapshot cannot be taken.
https://github.com/influxdata/plutonium/issues/3227
This fix adds a skipCacheOk flag to
tsdb.Store.CreateShardSnapshot() and tsdb.Shard.CreateSnapshot()
to pass to tsdb.Engine.CreateSnapshot()
A value of true allows the backup to proceed even if a cache snapshot
cannot be taken.
This flag is set to true in tsm1.Engine.Backup(), the OSS backup code path
This flag is set to false in tsm1.Engine.Export()
https://github.com/influxdata/plutonium/issues/3227
When an InfluxDB database is very busy writing new points the backup
the process can fail because it can not write a new snapshot.
The error is: operation timed out with error: create snapshot: snapshot in progress.
This happens because InfluxDB takes almost "continuously" a snapshot
from the cache caused by the high number of points ingested.
The fix for this was https://github.com/influxdata/influxdb/pull/16627
but it was for OSS only, and was not in the code path for backups
in clusters.
This fix adds a skipCacheOk flag to tsdb.Engine.CreateSnapshot().
A value of true allows the backup to proceed even if a cache snapshot
cannot be taken.
This flag is set to true in tsm1.Engine.Backup(), the OSS backup code path
and in tsdb.Shard.CreateSnapshot(), the cluster backup code path.
This flag is set to false in tsm1.Engine.Export()
https://github.com/influxdata/plutonium/issues/3227
This feature allows compaction to be disabled on a per-shard basis by
creating a file named do_not_compact in a shard's directory. When
disabled, a message is logged every 15 minutes with the reason for
compaction being disabled (existance of the file). This makes it easy to
know if compaction has been disabled for any shards by searching the log
for "compaction disabled" or running "find path/to/data -type f -name
do_not_compact".
* Remove redundant type in slice/array declarations.
* Call t.Fatal() from test-functions, not non-test go-routines.
* Remove unnecessary empty value operator from ranges.
* Call defer .Close() methods only after checking for error on Open().
This patch protects an internal map for concurrent use.
(*LogFile).Writes() method calls
(*LogFile).createMeasurementIfNotExists() which writes to a shared map.
(*LogFile).Writes() acquires a read-lock which leaves
createMeasurementIfNotExists() open to concurrent writes to its shared
map.
This commit adds the ExecEntries method to the *LogFile type so that we
can properly lock calls to (*LogFile).appendEntry() using defer.
(*LogFile).ExecEntries() is used to mostly replace the body of
(*LogFile).Writes() and incurs another function call since ExecEntries()
can't be inlined. Below is the output of build with "-m -m -m" gcflags:
./log_file.go:1076:6: cannot inline (*LogFile).ExecEntries: unhandled op DEFER
The performace impact of the additional function call should be
negligable and is outwieghed by the safety and simplicity of using
defer.
The original version of verifyVersion() reads into a byte slice,
manually ensures its byte order, then converts it to a type comparable
with Version and MagicNumber.
This patch hides those details by calling binary.Read() and reading
values into properly typed variables.
This adds a bit of overhead but this code isn't in the hot-path and this
patch greatly simplifies the code.
verifyVersion() originally accepted an io.ReadSeeker. It is only called
in once place and that function immediately calls seek after
verifyVersion(), therefore it is probably safe to call Seek() BEFORE
verifyVersion().
The benefit is that verifyVersion() is easier to test since we can pass
it a bytes.Buffer.
This patch adds a test for verifyVersion() as well as a benchmark.
benchmark old ns/op new ns/op delta
BenchmarkVerifyVersion-8 73.5 123 +67.35%
Finally, this commit moves verifyVersion() from writer.go to reader.go
which is where it is actually used.
* feat(engine/tsm1): Add WritePointsWithContext()
Add WritePontsWithContext() and make WritePoints() a thin wrapper for
it.
The purpose is to add statistics context values that we'll use to
propagate the number of fields and points written to calls up the call
chain.
* feat(tsdb): Add WriteToShardWithContext()
When applied, this patch adds WriteToShardWithContext() and wraps it
with WriteToShard() to preserve the API.
The the purpose of this addition is to propagate a context.Context value
to Shard.WritePointsWithContext().
* feat(tsdb/shard): Add WritePointsWithContext()
The purpose of adding WritePointsWithContext() is to propage context
values down to engine code and propage statistics via the context.Value
up to callers.
This patch also adds values written statistics to the shard.
* feat(http): Gather values written stats
WritePointsWithContext() was added to propagate context values down to
the engine and communicate stats to the caller.
* feat(http): Gather values written stats
WritePointsWithContext() was added to propagate context values down to
the engine and communicate stats to the caller.
* refactor: Change MetricKey to ContextKey
This patch gives the type we're useing for context keys a better name.
When applied this patch will:
* log snapshot directory removal errors
Prior to this patch, errors when removing temporary snapshot
directories happens silently.
This patch ensures that errors are logged when os.RemoveAll() fails.
* refactor tsm1: Declare error value in condition
Save a line of code and limits the scope of an error value.
* refactor tsm1: Add MakeSnapshotLinks()
This commit adds (*FileStore).MakeSnapshotLinks(). The code in this
function was originally part of CreateSnapshot().
That code was hoisted out and into MakeSnapshotLinks() becuase there
are two points of failure that require cleanup -- we have to delete a
temporary directory on failure.
Placing the code in one function allows us to check its returned error
value and perform cleanup in only once place.
In short, we hoisted code out of CreateSnapshot() to simplify error
handling.
On error, we remove any directories we created.
This commit changes the SeriesIDSet merge/union/intersect functions
to attach the underlying iterators as closers so that files can be
retained until the data is no longer in use. The roaring operations
can leave containers pointing at mmap data in the resulting bitmap
so we have to track underlying file usage until the data is finished
with.
This commit changes `DefaultSeriesIDSetCacheSize` to zero so that the
tag value cache is disabled by default. There is a rare known bug where
the cache can cause a segfault which crasheds the process. The cache
is being disabled instead of removed as some users may still need the
cache for performance reasons.
This commit quiets staticcheck's warnings about "unnecessary use of
fmt.Sprintf" and "unnecessary use of fmt.Sprint".
Prior to this commit we were wrapping simple constant strings without
any formatting verbs with fmt.Sprintf().
* fix(tsdb): address staticcheck ST1006
This patch addresses staticcheck warning "receiver name should not be an
underscore, omit the name if it is unused (ST1006)" for 6 methods.
Before this commit, the to and from variables were being re-declared in
a block in such a way that the values were not being used.
This patch uses regular assignment so that the values are visable
outside of the block where they're set.
Closes: 18128
* fix: verify precision parameter in write requests
This change updates the HTTP endpoints that service v1 and v2 writes to
verify the values passed in the precision parameter.
* fix(tsm1): Fix temp directory search bug
The original code's intention is to scan a directory for the directory
with the higest value when converted to an integer.
So directories may be in the form:
0.tmp
1.tmp
2.tmp
30.tmp
...
100.tmp
The loop should scan the directory, strip the basename and extension
from the file name to leave just a number, then store the higest number
it finds.
Before this patch, there is a bug that has the code only store the
higest value if there is an error converting the numeric value into an
integer.
This patch primarily fixes that logic.
In addition, this patch will save an indent level by inverting logic in
two places:
Instead if checkig if a file is a directory and has a suffix of ".tmp",
it is probably better to test if a file is NOT a directory OR does NOT
have an extension of ".tmp" then continue.
Also, instead of testig if len(ss) == 2, we can test if len(ss) != 2 and
continue if so.
Both of these save an indent level and keeps our "happy path" to the
left.
Finally, this patch will use string concatination instead of calling
fmt.Sprintf() to add periods to "tmp" and "tsm" extension.
Co-authored-by: David Norton <dgnorton@gmail.com>
fixes#17440
While encoding or decoding corrupt data, the current behaviour is to `panic`.
This commit replaces the `panic` with `error` to be propagated up to the calling `iterator`.
To avoid overwriting other `error`, iterators now wraps a `TSMErrors` which contains ALL the encountered errors.
TSMErrors itself implements `Error()`, the returned string contains all the error msgs, separated by "," delimiter.
We were seing segfaults in Roaring bitmaps sometimes, under very
high load with networked drives. This may reduce risk of segfault by
forcing marshalling to copy the data.
* fix: access tsi active log file with READ lock
The activeLogFile pointer may be altered by other routine so the READ
lock is needed.
* Merge pull request #16384 from foobar/tsi-partition-lock
fix: access tsi active log file with READ lock
Co-authored-by: Tristan Su <sooqing@gmail.com>
Co-authored-by: David Norton <dgnorton@gmail.com>
When an InfluxDB database is very busy writing new points the backup
the process can fail because it can not write a new snapshot.
The error is: `operation timed out with error: create snapshot: snapshot in progress`.
This happens because InfluxDB takes almost "continuously" a snapshot
from the cache caused by the high number of points ingested.
This PR skips snapshots if the `snapshotter` does not come available
after three attempts when a backup is requested.
The backup won't contain the data in the cache or WAL.
Signed-off-by: Gianluca Arbezzano <gianarb92@gmail.com>
Prior to this change, new series would be added to the series file
before checking the series cardinality limit. If the limit was exceeded,
the write was rejected even though the series had already been added to
the series file.
This commit prevents multiple blocks for the same series key having
values truncated when they are being read into an empty buffer.
The current cursor reader code has an optimisation that incorrectly
assumes the incoming array will be limited to 1,000 values (the maximum
block size), but arrays can contain values from multiple matching
blocks.
* fix(storage): skip TSM files with block read errors
When we find a bad TSM file during compaction, propagate the error up and move
the bad file aside. The engine will disregard the file so the next compaction
will not hit the same error.
This change adds a lock around digest creation so that it is safe for
concurrent calls. Prior to this change, calls from multiple goroutines
resulted in "Digest aborted, problem renaming tmp digest" errors.
Fixes#15859
This commit fixes a defect in the TSI index where a filter using the
negated equality operator would result in no matching series being
returned for series stored within the `IndexFile` portions of the index.
The root cause of this was due to missing legacy-handling code in the
index for this particular iterator.
This upgrades the flux version to v0.50.2.
The secret service, which is used for alerts, is not included. The
`to()` function is also still not included.
Fixes#10052
This commit fixes an issue where field keys would reappear in results
when querying previously dropped measurements.
The issue manifests itself when duplicates of a new series are inserted
into the `inmem` index. In this case, a map that tracks the number of
series belonging to a measurement was incorrectly incremented once for
each duplication of the series. Then, when it came time to drop the
measurement, the index assumed there were several series belonging to
the measurement left in the index (because the counter was higher than
it should be). The result of that was that the `fields.idx` file (which
stores a mapping between measurements and field keys) was not truncated
and rebuilt. This left old field keys in that file, which were then
returned in subsequent queries over all field keys.
The flux in influxdb has been upgraded to use v0.33.2. A lot of
interfaces for the storage engine were changed during this so code had
to change to accomodate the new interfaces and remove the old ones.
Included in this commit is a patch file for the changes that were made.
A patch was generated for the following packages:
* `flux/stdlib/influxdata/influxdb`
* `storage/reads`
* `tsdb/cursors`
These are the three packages that are in common with version 2 of the
database and the first of these packages contains the specific
implementations that are used for version 1.
It is very possible that the next time we upgrade this, the patch will
not apply cleanly just like it wouldn't have applied cleanly to this
update. The patch is mostly meant to document exactly what changed
during the copy over to help ensure we don't forget things when adapting
the interfaces.
Add a patch file to hopefully make this easier in the future
StringArrayEncodeAll will panic if the total length of strings
contained in the src slice is > 0xffffffff. This change adds a unit
test to replicate the issue and an associated fix to return an error.
This also raises an issue that compactions will be unable to make
progress under the following condition:
* multiple string blocks are to be merged to a single block and
* the total length of all strings exceeds the maximum block size that
snappy will encode (0xffffffff)
The observable effect of this is errors in the logs indicating a
compaction failure.
Fixes#13687
StringArrayEncodeAll will panic if the total length of strings
contained in the src slice is > 0xffffffff. This change adds a unit
test to replicate the issue and an associated fix to return an error.
This also raises an issue that compactions will be unable to make
progress under the following condition:
* multiple string blocks are to be merged to a single block and
* the total length of all strings exceeds the maximum block size that
snappy will encode (0xffffffff)
The observable effect of this is errors in the logs indicating a
compaction failure.
Fixes#13687
This integrates the influxdb 1.x series to the latest version of Flux
and updates the code to use it. It also removes the dependency on
platform and copies the necessary code from storage into the 1.x series
so the dependency is unneeded.
The flux functions specific to 1.x have been moved to the same structure
that flux changed to with having a `stdlib` directory instead of a
`functions` directory. It also adds a `databases()` function that
returns the databases from the meta client.
Previously it was possible to set IDs on a `nil` entry which would
in turn cause a panic. If this panic was recovered by the server
then it would result in a mutex in the `inmem` index staying locked
indefinitely.
We're not allowed to access the s.epochs map without holding the
mutex against shard creation and deletion, so create a copy of
all of the epoch trackers we will need while we hold the mutex.
Scanner objects and iterators often need a ValuerEval. This
object is created, often with a function call, and has at
least one interface in it, so it allocates storage. Then it's
dropped again right away. The only part of it that might be
subject to change is usually a map. While the map's contents
change over time, the actual map doesn't change for the
lifetime of the object.
So, in both iterators and scanners, stash the ValuerEval
and continue reusing it. On a query returning a fair number
of data points, this produces a small (<5% in practice)
improvement in observed performance, visible as a significant
reduction in time spent in runtime (mallocgc, newobject,
etcetera).
The performance improvement isn't big, but it's reasonably
easy to evaluate it and establish that it's a safe change
to make.
Signed-off-by: seebs <seebs@seebs.net>
In the case of caching TSI bitmaps belonging to immutable .tsi files,
the underlying bitset data can be mmapped. It is possible, though rare,
for this data to be unmapped (e.g., via a TSI compaction) but for the
cached bitmap to be subsequently read. This leads to a segfault.
This only happens when copy-on-write is set to true on the roaring
bitmap, because in that case only the internal pointers are cloned.
This change will reduce the TSI cache performance by around 10%, which I
have deemed to account for only a few microseconds typically.
This commit adds a config option to the tsdb Config allowing the size of
the bitset cached in the TSI index to be specified.
Setting the cache size to 0 will disable the cache.
This commit limits the number of files that can be compacted in
a single group when forcing a full compaction or when a shard
becomes cold. This is to prevent too many files being compacted
at the same time.