* fix(tsi1/partition/test): fix data races in test code (#57)
* fix(tsi1/partition/test): fix data races in test code
This PR is like influxdata/influxdb#24613 but solves it with a setter
method for MaxLogFileSize which allows unexporting that value and
MaxLogFileAge. There are actually two places locks were needed in test
code. The behavior of production code is unchanged.
(cherry picked from commit f0235c4daf4b97769db932f7346c1d3aecf57f8f)
* feat: modify error handling to be more idiomatic
closes https://github.com/influxdata/influxdb/issues/24042
* fix: errors.Join() filters nil errors
closes https://github.com/influxdata/influxdb/issues/25341
---------
Co-authored-by: Phil Bracikowski <13472206+philjb@users.noreply.github.com>
(cherry picked from commit 5c9e45f033)
* fix(tsi1/partition/test): fix data races in test code
This PR is like #24613 but solves it with a setter
method for MaxLogFileSize which allows unexporting that value and
MaxLogFileAge. There are actually two places locks were needed in test
code. The behavior of production code is unchanged.
(cherry picked from commit f0235c4daf4b97769db932f7346c1d3aecf57f8f)
* fix(tsi1/partition/test): fix data race in test code
TestPartition_Compact_Write_Fail test was not locking the partition
before changing the value of MaxLogFileSize. This PR exports the mutex
of the partition to allow the test to access it and lock. Alternatives
require more changes such as a Setter method if we need to hide the
mutex.
* fixes#24042, for #24040
* chore: complete renaming of mutex in file and fix flux test
The flux test is another failing test because it was using a relative
time range.
* test: use `T.TempDir` to create temporary test directory
This commit replaces `os.MkdirTemp` with `t.TempDir` in tests. The
directory created by `t.TempDir` is automatically removed when the test
and all its subtests complete.
Prior to this commit, temporary directory created using `os.MkdirTemp`
needs to be removed manually by calling `os.RemoveAll`, which is omitted
in some tests. The error handling boilerplate e.g.
defer func() {
if err := os.RemoveAll(dir); err != nil {
t.Fatal(err)
}
}
is also tedious, but `t.TempDir` handles this for us nicely.
Reference: https://pkg.go.dev/testing#T.TempDir
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
* test: fix failing TestSendWrite on Windows
=== FAIL: replications/internal TestSendWrite (0.29s)
logger.go:130: 2022-06-23T13:00:54.290Z DEBUG Created new durable queue for replication stream {"id": "0000000000000001", "path": "C:\\Users\\circleci\\AppData\\Local\\Temp\\TestSendWrite1627281409\\001\\replicationq\\0000000000000001"}
logger.go:130: 2022-06-23T13:00:54.457Z ERROR Error in replication stream {"replication_id": "0000000000000001", "error": "remote timeout", "retries": 1}
testing.go:1090: TempDir RemoveAll cleanup: remove C:\Users\circleci\AppData\Local\Temp\TestSendWrite1627281409\001\replicationq\0000000000000001\1: The process cannot access the file because it is being used by another process.
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
* test: fix failing TestStore_BadShard on Windows
=== FAIL: tsdb TestStore_BadShard (0.09s)
logger.go:130: 2022-06-23T12:18:21.827Z INFO Using data dir {"service": "store", "path": "C:\\Users\\circleci\\AppData\\Local\\Temp\\TestStore_BadShard1363295568\\001"}
logger.go:130: 2022-06-23T12:18:21.827Z INFO Compaction settings {"service": "store", "max_concurrent_compactions": 2, "throughput_bytes_per_second": 50331648, "throughput_bytes_per_second_burst": 50331648}
logger.go:130: 2022-06-23T12:18:21.828Z INFO Open store (start) {"service": "store", "op_name": "tsdb_open", "op_event": "start"}
logger.go:130: 2022-06-23T12:18:21.828Z INFO Open store (end) {"service": "store", "op_name": "tsdb_open", "op_event": "end", "op_elapsed": "77.3µs"}
testing.go:1090: TempDir RemoveAll cleanup: remove C:\Users\circleci\AppData\Local\Temp\TestStore_BadShard1363295568\002\data\db0\rp0\1\index\0\L0-00000001.tsl: The process cannot access the file because it is being used by another process.
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
* test: fix failing TestPartition_PrependLogFile_Write_Fail and TestPartition_Compact_Write_Fail on Windows
=== FAIL: tsdb/index/tsi1 TestPartition_PrependLogFile_Write_Fail/write_MANIFEST (0.06s)
testing.go:1090: TempDir RemoveAll cleanup: remove C:\Users\circleci\AppData\Local\Temp\TestPartition_PrependLogFile_Write_Failwrite_MANIFEST656030081\002\0\L0-00000003.tsl: The process cannot access the file because it is being used by another process.
--- FAIL: TestPartition_PrependLogFile_Write_Fail/write_MANIFEST (0.06s)
=== FAIL: tsdb/index/tsi1 TestPartition_Compact_Write_Fail/write_MANIFEST (0.08s)
testing.go:1090: TempDir RemoveAll cleanup: remove C:\Users\circleci\AppData\Local\Temp\TestPartition_Compact_Write_Failwrite_MANIFEST3398667527\002\0\L0-00000003.tsl: The process cannot access the file because it is being used by another process.
--- FAIL: TestPartition_Compact_Write_Fail/write_MANIFEST (0.08s)
We must close the open file descriptor otherwise the temporary file
cannot be cleaned up on Windows.
Fixes: 619eb1cae6 ("fix: restore in-memory Manifest on write error")
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
* test: fix failing TestReplicationStartMissingQueue on Windows
=== FAIL: TestReplicationStartMissingQueue (1.60s)
logger.go:130: 2023-03-17T10:42:07.269Z DEBUG Created new durable queue for replication stream {"id": "0000000000000001", "path": "C:\\Users\\circleci\\AppData\\Local\\Temp\\TestReplicationStartMissingQueue76668607\\001\\replicationq\\0000000000000001"}
logger.go:130: 2023-03-17T10:42:07.305Z INFO Opened replication stream {"id": "0000000000000001", "path": "C:\\Users\\circleci\\AppData\\Local\\Temp\\TestReplicationStartMissingQueue76668607\\001\\replicationq\\0000000000000001"}
testing.go:1206: TempDir RemoveAll cleanup: remove C:\Users\circleci\AppData\Local\Temp\TestReplicationStartMissingQueue76668607\001\replicationq\0000000000000001\1: The process cannot access the file because it is being used by another process.
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
* test: update TestWAL_DiskSize
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
* test: fix failing TestWAL_DiskSize on Windows
=== FAIL: tsdb/engine/tsm1 TestWAL_DiskSize (2.65s)
testing.go:1206: TempDir RemoveAll cleanup: remove C:\Users\circleci\AppData\Local\Temp\TestWAL_DiskSize2736073801\001\_00006.wal: The process cannot access the file because it is being used by another process.
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
---------
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
* TSI index should compact old or too-large log files
* Old tsl files should be compacted without new writes
* Add extra logging when disk size test fails
Co-authored-by: Sam Arnold <sarnold@influxdata.com>
This commit ensures that cached bitset results at the Index level are
updated whenever new series ids are created that would belong in those
bitsets.
For example, if we have a cached bitset for the tuple {mem, region,
west}, and we add the series mem,host=prod,region=west then we would
update the cached bitset for {mem, region, west} with the series id of
the newly written series.
This commit removes the HLL sketches on each `tsi1.LogFile` and
`tsi1.IndexFile` and instead caches the data at the `tsi1.Index`
level. This reduces the heap size significantly for servers with
many TSI-enabled shards.
When adding many series using offline tooling, it's likely that every
series involves an entry being appended to a LogFile. Typically an entry
is 11 or 12 bytes, but the default bufio.Writer buffer size is only 4K.
This means by default a write of 10,000 new series would involve ~30
buffer flushes.
This commit makes the buffer configurable, and sets the value in
`buildtsi` such that it reflects the number of series being written to
the LogFile.
When running offline tooling, flushing buffers and syncing files on
every write to a `LogFile` is not necessary. Were a hard exit
with data loss to occur, the tooling can simply be run again.
TSI LogFile compactions occasionally race with insert and delete
operations because the index partition FileSet is retained needlessly by
the method that calls Partition.CheckLogFile.
In this change:
- TSI LogFile compaction respects enable/disable compactions
- Partition FileSet.Release before log compaction is triggered
An alternative to the second step is to handle log file compaction in a
new goroutine. Log file compaction errors would be logged and not
returned to the caller.
After this change, `DELETE FROM /regex/` does not deadlock; performance:
- 30s to delete 100 measurements
- 5m30s to delete 1000 measurements
This commit adds the `max-index-log-file-size` configuration flag so
that users can restrict the maximum size of log files before compaction.
The default limit was also lowered from `5MB` to `1MB`. The original
size was set before we partitioned the index so the change reflects this.
This commit adds initial empty sketches back to the tsi1 index, as well
as ensuring that ephemeral sketches in the index `LogFile` are updated
accordingly.
The commit also adds a test that verifies that the merged sketches at
the store level produce the correct results under writes, deletions and
re-opening of the store.
This commit does not provide working sketches for post-compaction on the
tsi1 index.
This separates out the dropping of a measurement from the series
to avoid frequent checks to see if a measurement still has series.
The series are dropped individually and we keep track of which
measurements are involved and then delete each measurment afterwards.