* fix: prevent retention service from hanging (#25055)
Fix issue that can cause the retention service to hang waiting on a
`Shard.Close` call. When this occurs, no other shards will be deleted
by the retention service. This is usually noticed as an increase in
disk usage because old shards are not cleaned up.
The fix adds to new methods to `Store`, `SetShardNewReadersBlocked`
and `InUse`. `InUse` can be used to poll if a shard has active readers,
which the retention service uses to skip over in-use shards to prevent
the service from hanging. `SetShardNewReadersBlocked` determines if
new read access may be granted to a shard. This is required to prevent
race conditions around the use of `InUse` and the deletion of shards.
If the retention service skips over a shard because it is in-use, the
shard will be checked again the next time the retention service is run.
It can be deleted on subsequent checks if it is no longer in-use. If
the shards is stuck in-use, the retention service will not be able to
delete the shards, which can be observed in the logs for manual
intervention. Other shards can still be deleted by the retention service
even if a shard is stuck with readers.
This is a port of ad68ec8 from master-1.x to main-2.x.
closes: #25076
(cherry picked from commit b4bd607eef)
* feat(tsm1/wal): encapsulate expiring WAL files in FileDisposer
This changeset introduces an interface extension point named
FileDisposer to control what to do with WAL files when they are no
longer needed. Currently, the only implementation is to delete the file
which is the existing behavior.
* chore: accumulate errors
Since we're here, capture the previously ignored fs errors and pass up a
combined error (which the only callers log out).
* fix(tsi1/partition/test): fix data race in test code
TestPartition_Compact_Write_Fail test was not locking the partition
before changing the value of MaxLogFileSize. This PR exports the mutex
of the partition to allow the test to access it and lock. Alternatives
require more changes such as a Setter method if we need to hide the
mutex.
* fixes#24042, for #24040
* chore: complete renaming of mutex in file and fix flux test
The flux test is another failing test because it was using a relative
time range.
HTTP 5XX errors were being returned incorrectly from
BoltDB errors that were actually bad requests, e.g.,
names that were too long for buckets, users, and
organizations. Map BoltDB errors to correct Influx
errors and return 4XX errors where appropriate. Also
add op codes to more errors
Some series files which are smaller than the standard
sizes cause SIGBUS in influx_inspect and influxd, because
entry iteration walks onto mapped memory not backed by the
the file. Avoid walking off the end of the file while
iterating series entries in oddly sized files.
closes https://github.com/influxdata/influxdb/issues/24508
Co-authored-by: Geoffrey Wossum <gwossum@influxdata.com>
(cherry picked from commit 969abf3da2)
closes https://github.com/influxdata/influxdb/issues/24511
* test: use `T.TempDir` to create temporary test directory
This commit replaces `os.MkdirTemp` with `t.TempDir` in tests. The
directory created by `t.TempDir` is automatically removed when the test
and all its subtests complete.
Prior to this commit, temporary directory created using `os.MkdirTemp`
needs to be removed manually by calling `os.RemoveAll`, which is omitted
in some tests. The error handling boilerplate e.g.
defer func() {
if err := os.RemoveAll(dir); err != nil {
t.Fatal(err)
}
}
is also tedious, but `t.TempDir` handles this for us nicely.
Reference: https://pkg.go.dev/testing#T.TempDir
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
* test: fix failing TestSendWrite on Windows
=== FAIL: replications/internal TestSendWrite (0.29s)
logger.go:130: 2022-06-23T13:00:54.290Z DEBUG Created new durable queue for replication stream {"id": "0000000000000001", "path": "C:\\Users\\circleci\\AppData\\Local\\Temp\\TestSendWrite1627281409\\001\\replicationq\\0000000000000001"}
logger.go:130: 2022-06-23T13:00:54.457Z ERROR Error in replication stream {"replication_id": "0000000000000001", "error": "remote timeout", "retries": 1}
testing.go:1090: TempDir RemoveAll cleanup: remove C:\Users\circleci\AppData\Local\Temp\TestSendWrite1627281409\001\replicationq\0000000000000001\1: The process cannot access the file because it is being used by another process.
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
* test: fix failing TestStore_BadShard on Windows
=== FAIL: tsdb TestStore_BadShard (0.09s)
logger.go:130: 2022-06-23T12:18:21.827Z INFO Using data dir {"service": "store", "path": "C:\\Users\\circleci\\AppData\\Local\\Temp\\TestStore_BadShard1363295568\\001"}
logger.go:130: 2022-06-23T12:18:21.827Z INFO Compaction settings {"service": "store", "max_concurrent_compactions": 2, "throughput_bytes_per_second": 50331648, "throughput_bytes_per_second_burst": 50331648}
logger.go:130: 2022-06-23T12:18:21.828Z INFO Open store (start) {"service": "store", "op_name": "tsdb_open", "op_event": "start"}
logger.go:130: 2022-06-23T12:18:21.828Z INFO Open store (end) {"service": "store", "op_name": "tsdb_open", "op_event": "end", "op_elapsed": "77.3µs"}
testing.go:1090: TempDir RemoveAll cleanup: remove C:\Users\circleci\AppData\Local\Temp\TestStore_BadShard1363295568\002\data\db0\rp0\1\index\0\L0-00000001.tsl: The process cannot access the file because it is being used by another process.
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
* test: fix failing TestPartition_PrependLogFile_Write_Fail and TestPartition_Compact_Write_Fail on Windows
=== FAIL: tsdb/index/tsi1 TestPartition_PrependLogFile_Write_Fail/write_MANIFEST (0.06s)
testing.go:1090: TempDir RemoveAll cleanup: remove C:\Users\circleci\AppData\Local\Temp\TestPartition_PrependLogFile_Write_Failwrite_MANIFEST656030081\002\0\L0-00000003.tsl: The process cannot access the file because it is being used by another process.
--- FAIL: TestPartition_PrependLogFile_Write_Fail/write_MANIFEST (0.06s)
=== FAIL: tsdb/index/tsi1 TestPartition_Compact_Write_Fail/write_MANIFEST (0.08s)
testing.go:1090: TempDir RemoveAll cleanup: remove C:\Users\circleci\AppData\Local\Temp\TestPartition_Compact_Write_Failwrite_MANIFEST3398667527\002\0\L0-00000003.tsl: The process cannot access the file because it is being used by another process.
--- FAIL: TestPartition_Compact_Write_Fail/write_MANIFEST (0.08s)
We must close the open file descriptor otherwise the temporary file
cannot be cleaned up on Windows.
Fixes: 619eb1cae6 ("fix: restore in-memory Manifest on write error")
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
* test: fix failing TestReplicationStartMissingQueue on Windows
=== FAIL: TestReplicationStartMissingQueue (1.60s)
logger.go:130: 2023-03-17T10:42:07.269Z DEBUG Created new durable queue for replication stream {"id": "0000000000000001", "path": "C:\\Users\\circleci\\AppData\\Local\\Temp\\TestReplicationStartMissingQueue76668607\\001\\replicationq\\0000000000000001"}
logger.go:130: 2023-03-17T10:42:07.305Z INFO Opened replication stream {"id": "0000000000000001", "path": "C:\\Users\\circleci\\AppData\\Local\\Temp\\TestReplicationStartMissingQueue76668607\\001\\replicationq\\0000000000000001"}
testing.go:1206: TempDir RemoveAll cleanup: remove C:\Users\circleci\AppData\Local\Temp\TestReplicationStartMissingQueue76668607\001\replicationq\0000000000000001\1: The process cannot access the file because it is being used by another process.
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
* test: update TestWAL_DiskSize
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
* test: fix failing TestWAL_DiskSize on Windows
=== FAIL: tsdb/engine/tsm1 TestWAL_DiskSize (2.65s)
testing.go:1206: TempDir RemoveAll cleanup: remove C:\Users\circleci\AppData\Local\Temp\TestWAL_DiskSize2736073801\001\_00006.wal: The process cannot access the file because it is being used by another process.
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
---------
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
* fix: improve delete speed when a measurement is part of the predicate
* test: add test for deleting measurement by predicate
* chore: improve error messaging and capturing
* chore: set goland to use the right formatting style
Clamp the value of Store.MeasurementsCardinality so that it can not be less
than 0. This primarily shows up as a negative numMeasurements value in
/debug/vars under some circumstances.
refs #23285
(cherry picked from commit 160cf678d5)
* fix: remove nats for scraper processing
Scrapers now use go channels instead of NATS and interprocess communication.
This should fix#23085 .
Additionally, found and fixed#23106 .
* chore: fix formatting
* chore: fix static check and go.mod
* test: fix some flaky tests
* fix: mark NATS arguments as deprecated
Fixes a race condition in the restore code path that could cause shard data restores to fail. When the race condition occurs, `Error while freeing cold shard resources` appears in the log files.
This is port of PR 22796 from master-1.x to master. Attempts at creating a test case for master failed, so the fix has ported without a corresponding unit test.
fixes#22957