Commit Graph

58 Commits (master)

Author SHA1 Message Date
Eng Zer Jun 903d30d658
test: use `T.TempDir` to create temporary test directory (#23258)
* test: use `T.TempDir` to create temporary test directory

This commit replaces `os.MkdirTemp` with `t.TempDir` in tests. The
directory created by `t.TempDir` is automatically removed when the test
and all its subtests complete.

Prior to this commit, temporary directory created using `os.MkdirTemp`
needs to be removed manually by calling `os.RemoveAll`, which is omitted
in some tests. The error handling boilerplate e.g.
	defer func() {
		if err := os.RemoveAll(dir); err != nil {
			t.Fatal(err)
		}
	}
is also tedious, but `t.TempDir` handles this for us nicely.

Reference: https://pkg.go.dev/testing#T.TempDir
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>

* test: fix failing TestSendWrite on Windows

=== FAIL: replications/internal TestSendWrite (0.29s)
    logger.go:130: 2022-06-23T13:00:54.290Z	DEBUG	Created new durable queue for replication stream	{"id": "0000000000000001", "path": "C:\\Users\\circleci\\AppData\\Local\\Temp\\TestSendWrite1627281409\\001\\replicationq\\0000000000000001"}
    logger.go:130: 2022-06-23T13:00:54.457Z	ERROR	Error in replication stream	{"replication_id": "0000000000000001", "error": "remote timeout", "retries": 1}
    testing.go:1090: TempDir RemoveAll cleanup: remove C:\Users\circleci\AppData\Local\Temp\TestSendWrite1627281409\001\replicationq\0000000000000001\1: The process cannot access the file because it is being used by another process.

Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>

* test: fix failing TestStore_BadShard on Windows

=== FAIL: tsdb TestStore_BadShard (0.09s)
    logger.go:130: 2022-06-23T12:18:21.827Z	INFO	Using data dir	{"service": "store", "path": "C:\\Users\\circleci\\AppData\\Local\\Temp\\TestStore_BadShard1363295568\\001"}
    logger.go:130: 2022-06-23T12:18:21.827Z	INFO	Compaction settings	{"service": "store", "max_concurrent_compactions": 2, "throughput_bytes_per_second": 50331648, "throughput_bytes_per_second_burst": 50331648}
    logger.go:130: 2022-06-23T12:18:21.828Z	INFO	Open store (start)	{"service": "store", "op_name": "tsdb_open", "op_event": "start"}
    logger.go:130: 2022-06-23T12:18:21.828Z	INFO	Open store (end)	{"service": "store", "op_name": "tsdb_open", "op_event": "end", "op_elapsed": "77.3µs"}
    testing.go:1090: TempDir RemoveAll cleanup: remove C:\Users\circleci\AppData\Local\Temp\TestStore_BadShard1363295568\002\data\db0\rp0\1\index\0\L0-00000001.tsl: The process cannot access the file because it is being used by another process.

Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>

* test: fix failing TestPartition_PrependLogFile_Write_Fail and TestPartition_Compact_Write_Fail on Windows

=== FAIL: tsdb/index/tsi1 TestPartition_PrependLogFile_Write_Fail/write_MANIFEST (0.06s)
    testing.go:1090: TempDir RemoveAll cleanup: remove C:\Users\circleci\AppData\Local\Temp\TestPartition_PrependLogFile_Write_Failwrite_MANIFEST656030081\002\0\L0-00000003.tsl: The process cannot access the file because it is being used by another process.
    --- FAIL: TestPartition_PrependLogFile_Write_Fail/write_MANIFEST (0.06s)

=== FAIL: tsdb/index/tsi1 TestPartition_Compact_Write_Fail/write_MANIFEST (0.08s)
    testing.go:1090: TempDir RemoveAll cleanup: remove C:\Users\circleci\AppData\Local\Temp\TestPartition_Compact_Write_Failwrite_MANIFEST3398667527\002\0\L0-00000003.tsl: The process cannot access the file because it is being used by another process.
    --- FAIL: TestPartition_Compact_Write_Fail/write_MANIFEST (0.08s)

We must close the open file descriptor otherwise the temporary file
cannot be cleaned up on Windows.

Fixes: 619eb1cae6 ("fix: restore in-memory Manifest on write error")
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>

* test: fix failing TestReplicationStartMissingQueue on Windows

=== FAIL: TestReplicationStartMissingQueue (1.60s)
    logger.go:130: 2023-03-17T10:42:07.269Z	DEBUG	Created new durable queue for replication stream	{"id": "0000000000000001", "path": "C:\\Users\\circleci\\AppData\\Local\\Temp\\TestReplicationStartMissingQueue76668607\\001\\replicationq\\0000000000000001"}
    logger.go:130: 2023-03-17T10:42:07.305Z	INFO	Opened replication stream	{"id": "0000000000000001", "path": "C:\\Users\\circleci\\AppData\\Local\\Temp\\TestReplicationStartMissingQueue76668607\\001\\replicationq\\0000000000000001"}
    testing.go:1206: TempDir RemoveAll cleanup: remove C:\Users\circleci\AppData\Local\Temp\TestReplicationStartMissingQueue76668607\001\replicationq\0000000000000001\1: The process cannot access the file because it is being used by another process.

Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>

* test: update TestWAL_DiskSize

Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>

* test: fix failing TestWAL_DiskSize on Windows

=== FAIL: tsdb/engine/tsm1 TestWAL_DiskSize (2.65s)
    testing.go:1206: TempDir RemoveAll cleanup: remove C:\Users\circleci\AppData\Local\Temp\TestWAL_DiskSize2736073801\001\_00006.wal: The process cannot access the file because it is being used by another process.

Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>

---------

Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
2023-03-21 16:22:11 -04:00
Sam Arnold b970e359dc
feat: remaining storage metrics from OSS engine (#22938)
* fix: simplify disk size tracking

* refactor: EngineTags in tsdb package

* fix: fewer compaction buckets and dead code removal

* feat: shard metrics

* chore: formatting

* feat: tsdb store metrics

* feat: retention check metrics

* chore: fix go vet

* fix: review comments
2021-12-02 09:01:46 -05:00
Sam Arnold feb459c785
feat: metrics for cache subsystem (#22915)
* fix: drop complicated cache metrics and document remaining

* feat: metrics for cache
2021-11-23 10:11:22 -05:00
Dane Strandboge ca992e9fff
chore: use io/os over ioutil (#22656) 2021-10-12 16:55:07 -05:00
Daniel Moran 7169df3b51
refactor(tsm1): delete unused Write method on cache (#20890) 2021-03-09 09:09:20 -05:00
Sam Arnold 1068d1de6f
refactor: Remove unused function add and unused variable keysHint (#20803) 2021-02-25 08:31:00 -05:00
Daniel Moran 15b9531273
fix: correct various typos (#19987)
Co-authored-by: kumakichi <xyesan@gmail.com>
2020-11-11 13:54:21 -05:00
Stuart Carnie a24edb2b1c
chore: Skip tests on circleci
This is derived from 2fd8264 and 4f850b5, which skips tests on appveyor
2020-08-31 12:14:27 -07:00
Stuart Carnie dee8977d2c
chore: move v2/v1/tsdb → v2/tsdb 2020-08-26 10:46:47 -07:00
Mark Rushakoff f2898d1992 Wipe out workspace in preparation for v2 merge
"Knock knock."

"Who's there?"

"InfluxDB Veet."

...
2019-01-11 10:38:50 -08:00
Jacob Marble 0dc5393441 tsm/cache: Remove unused function parameter 2018-06-13 15:22:37 -07:00
Edd Robinson 90903fa6ed Remove unused code/cleanup engine package 2018-01-20 13:56:45 +00:00
Jason Wilder 92f86b1b8f Fix large memory spikes in cache
The cache defaulted to entry capacity size of 32.  This default
is fine for lower cardinalities, but causes big spikes in InUse
heap with higher cardinalities that can OOM the process.  Since
the hints had to be removed previously due to increased memory usage,
they are now completely removed.  For lower cardinalities, we do
grow the slice, but this has a small performance penalty compared
to the large memory usage/OOMs with larger cardinalities.
2018-01-10 07:56:46 -07:00
Jason Wilder e62f6d7cdf Fix Cache.DeleteRange not deleting all data
This fixes a regression in the Cache introduced in ca40c1ad3c where
not all the values in the cache entry would be removed.  Previously,
calling Exclude did not require the values to be sorted.  The change
in ca40c1ad3c relies on the values being sorted so it was possible for
it to find the wrong indexes in when calling FindRange and leave some
data that should be deleted.

Fixes #9161
2017-11-28 10:39:21 -07:00
Jason Wilder a5afaf7499 Fix cache mem size not including key size 2017-10-03 10:48:14 -06:00
Jason Wilder ddeba2c86b Split large snapshots and write concurrently 2017-09-19 15:27:25 -06:00
Jason Wilder a93a5e9bdf Include the size of the key in the cache size 2017-09-11 15:29:26 -06:00
Jason Wilder b9b648e2a0 Dynamically allocate cache store
The cache store can be memory intensive with many shards.  This
lazyily allocates it when needed and frees it when the cache is
empty and cold.
2017-09-07 16:35:08 -06:00
Joe LeGasse a95647b720 cleanup: remove poor usage of ',ok' with maps
There are several places in the code where comma-ok map retrieval was
being used poorly. Some were benign, like checking existence before
issuing an unconditional delete with no cleanup. Others were potentially
far more serious: assuming that if 'ok' was true, then the resulting
pointer retrieved from the map would be non-nil. `nil` is a perfectly
valid value to store in a map of pointers, and the comma-ok syntax is
meant for when membership is distinct from having a non-zero value.
There was only one or two cases that I saw that being used correctly for
maps of pointers.
2017-08-30 09:49:31 -04:00
Jason Wilder 778000435a Conver all keys from string to []byte in TSM engine
This switches all the interfaces that take string series key to
take a []byte.  This eliminates many small allocations where we
convert between to two repeatedly.  Eventually, this change should
propogate futher up the stack.
2017-07-28 11:00:50 -06:00
Jason Wilder e102fcca9c Use buffer writer for wal segments 2017-05-10 11:42:32 -06:00
Jason Wilder b4ea523910 Include snapshot size in the total cache size
This was causing a shard to appear idle when in fact a snapshot compaction
was running.  If the time was write, the compactions would be disabled and
the snapshot compaction would be aborted.
2017-05-03 16:31:58 -06:00
Jason Wilder 4f850b5cff Skip TestCache_Deduplicate_Concurrent on windows 2017-04-04 08:48:55 -06:00
Jason Wilder 61f80db1b9 Skip cardinaltiy dups on circle race test 2017-03-22 15:20:38 -06:00
Edd Robinson 73ed864e1d Add cache tests 2017-01-12 16:27:16 +00:00
Jason Wilder ae838ef323 Simplify Cache.Snapshot
This simplifies the cache.Snapshot func to swap the hot cache to
the snapshot cache instead of copy and appending entries.  This
reduces the amount of time the cache is write locked which should
reduce cache contention for the read only code paths.
2017-01-11 11:12:02 -07:00
Jason Wilder e91e45d71c Fix panic in cache benchmark 2016-12-19 14:17:01 -07:00
Edd Robinson d2923c7bf9 Add hints as to how to pre-allocate entry values
Currently, whenever a snapshot occurs the Cache is reset and so many
allocations are repeated, as the same type of data is re-added to
the Cache.

This commit allows the stores to keep track of the number of values
within an entry, and use that size as a hint when the same entry needs
to be recreated after a snapshot.

To avoid hints persisting over a long period of time they are deleting
after every snapshot, and rebuilt using the most recent entries only.
2016-12-14 18:23:36 +00:00
Edd Robinson f2b5c7f5be Reduce contention when adding entries 2016-12-14 18:23:36 +00:00
Edd Robinson 66edb32182 Sharded Cache using a hash ring 2016-12-14 18:23:36 +00:00
Jason Wilder 0b6f5441b9 Add config option to messages when limits exceeded
When a limit is exceeded, we return errors and sometimes log (if appropriate)
that a limit was exceeded.  The messages don't always provide an indication
as to where or how they are configured.

Instead, return the config option (easily searchable for) as well as the limit
currently set and the value that exceeded it when possible.
2016-10-28 14:54:45 -06:00
Jason Wilder 873189e0c2 Fix panic: interface conversion: tsm1.Value is *tsm1.FloatValue, not *tsm1.StringValue
If concurrent writes to the same shard occur, it's possible for different types to
be added to the cache for the same series.  The way the measurementFields map on the
shard is updated is racy in this scenario which would normally prevent this from occurring.
When this occurs, the snapshot compaction panics because it can't encode different types
in the same series.

To prevent this, we have the cache return an error a different type is added to existing
values in the cache.

Fixes #7498
2016-10-28 12:15:50 -06:00
Steven Hartland 3f16197243 Improve tsm1 cache performance
Reduce the cache lock contention by widening the cache lock scope in WriteMulti, while this sounds counter intuitive it was:
* 1 x Read Lock to read the size
* 1 x Read Lock per values
* 1 x Write Lock per values on race
* 1 x Write Lock to update the size

We now have:
* 1 x Write Lock

This also reduces contention on the entries Values lock too as we have the global cache lock.

Move the calculation of the added size before taking the lock as it takes time and doesn't need the lock.

This also fixes a race in WriteMulti due to the lock not being held across the entire operation, which could cause the cache size to have an invalid value if Snapshot has been run in the between the addition of the values and the size update.

Fix the cache benchmark which where benchmarking the creation of the cache not its operation and add a parallel test for more real world scenario, however this could still be improved.

Add a fast path newEntryValues values for the new case which avoids taking the values lock and all the other calculations.

Drop the lock before performing the sort in Cache.Keys().
2016-10-25 15:24:51 -06:00
Jason Wilder 838a29cca8 Fix race in cache
If cache.Deduplicate is called while writes are in-flight on the cache, a data race
could occur.

WARNING: DATA RACE
Write by goroutine 15:
  runtime.mapassign1()
      /usr/local/go/src/runtime/hashmap.go:429 +0x0
  github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Cache).entry()
      /Users/jason/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/cache.go:482 +0x27e
  github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Cache).WriteMulti()
      /Users/jason/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/cache.go:207 +0x3b2
  github.com/influxdata/influxdb/tsdb/engine/tsm1.TestCache_Deduplicate_Concurrent.func1()
      /Users/jason/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/cache_test.go:421 +0x73

Previous read by goroutine 16:
  runtime.mapiterinit()
      /usr/local/go/src/runtime/hashmap.go:607 +0x0
  github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Cache).Deduplicate()
      /Users/jason/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/cache.go:272 +0x7c
  github.com/influxdata/influxdb/tsdb/engine/tsm1.TestCache_Deduplicate_Concurrent.func2()
      /Users/jason/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/cache_test.go:429 +0x69

Goroutine 15 (running) created at:
  github.com/influxdata/influxdb/tsdb/engine/tsm1.TestCache_Deduplicate_Concurrent()
      /Users/jason/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/cache_test.go:423 +0x3f2
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:473 +0xdc

Goroutine 16 (finished) created at:
  github.com/influxdata/influxdb/tsdb/engine/tsm1.TestCache_Deduplicate_Concurrent()
      /Users/jason/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/cache_test.go:431 +0x43b
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:473 +0xdc
2016-06-06 15:45:01 -06:00
Jason Wilder bc76048371 Fix panic in cache.DeleteRange
Deleting keys that did not exist in the cache could cause a panic
because the entry returned would be nil and was not checked.
2016-06-06 14:48:53 -06:00
Jason Wilder aefd2ad08b Add DeleteSeries and DeleteSeriesRange 2016-04-27 13:09:53 -06:00
Jason Wilder 0de21ade40 Add delete range of values support to WAL and cache loader 2016-04-27 13:09:53 -06:00
Jason Wilder 4d71d2b01f Add support for deleting cache values using time range 2016-04-27 13:09:52 -06:00
Jason Wilder 000459e350 Fix deadlock when running backup
A deadlock occurs under write load if a backup is run in between the
time when a snapshot compactions has snapshotted the cache and successfully
written it to disk.  The issus is that the second snapshot call will block
on the commit lock while it is holding the engine write lock.  This causes
all writes to block as well as prevents the currently runnign snapshot
compaction from completing because it needs to acquire a read-lock.

This PR removes the commit lock and just returns an error if a snapshot is
in progress to all any locks being held to be released.  The caller can determine
whether to retry or giveup.
2016-03-14 12:36:48 -06:00
Jason Wilder 8d70d65a82 Convert time.Time to int64 2016-02-25 15:15:01 -07:00
Jon Seymour 7eabae68de tsm: cache: add a test for the write sequence {6,1,snapshot,7,2}
Consider the write sequence: 6,1,snapshot,7,2.

The hot cache gets deduplicated, so is 2,7.

Now consider the test if 1 >= 2, this is false, so needSort is not set to true.

The problem is the implicit assumption that the snapshot is always sorted
by the time that merged() runs, but this may not be true.

Signed-off-by: Jon Seymour <jon@wildducktheories.com>
2016-02-26 07:43:50 +11:00
Jon Seymour eb7eec078d tsm: cache: introduce commit lock to Cache
Currently two compactors can execute Engine.WriteSnapshot at once.

This isn't thread safe since both threads want to make modifications to
Cache.snapshot at the same time.

This commit introduces a lock which is acquired during Snapshot() and
released during ClearSnapshot(), ensuring that at most one thread
executes within Engine.WriteSnapshot() at once.

To ensure that we always release this lock, but only release the
snapshot resources on a successful commit, we modify ClearSnapshot() to
accept a boolean which indicates whether the write was successful or not
and guarantee to call this function if Snapshot() has been called.

Signed-off-by: Jon Seymour <jon@wildducktheories.com>
2016-02-25 12:10:37 +11:00
Jason Wilder 017c24c98e Simplify cache snapshotting
The Cache had support for taking multiple snapshots to support writing
multiple snapshots to TSM files concurrently if that happened to be
a bottleneck.  In practice, this is never a bottleneck and we only
run one snappshoting goroutine continously per shard which has worked
well for all workloads.

The multiple snapshot support introduces some unhandled failure scenarios
where wal segments could be removed without writing them to TSM files.  If
a snapshot compaction fails to write due to transient disk errors, subsequent
snapshots will continue, but the failed one will not be retried.  When the
subsequent ones succeeded, all closed wal segments are removed causing data
loss.

This change simplifies the snapshotting capability to ensure that there is only
ever one snapshot.  If one fails, the next snapshot will update the existing
snapshot and retry all of old and new data.

Fixes #5686
2016-02-23 09:38:51 -07:00
Mark Rushakoff 191de2670c Fix non-compiling test 2016-02-22 13:49:11 -08:00
Mark Rushakoff fc5c8597ab Merge pull request #5758 from influxdata/mr-disk-stats
Track cache, WAL, filestore stats within tsm1 engine
2016-02-22 13:01:55 -08:00
Jason Wilder aa2e878019 Fix cache not deduplicating points in some cases
The cache had some incorrect logic for determine when a series needed
to be deduplicated.  The logic was checking for unsorted points and
not considering duplicate points.  This would manifest itself as many
points (duplicate) points being returned from the cache and after a
snapshot compaction run, the points would disappear because snapshot
compaction always deduplicates and sorts the points.

Added a test that reproduces the issue.

Fixes #5719
2016-02-22 13:24:42 -07:00
Mark Rushakoff e76967efb6 Add stats to tsm1.Cache 2016-02-19 16:37:34 -08:00
INADA Naoki 898babf616 add float bench 2016-02-03 03:12:16 +09:00
Jason Wilder 631ecc23de Fix growing destination buffer during WAL entry encoding
The test to see if the destination buffer for encoding and decoding a WAL
entry was broken and would cause a panic if there were large batches that
would overflow the buffer size.

Fixes #5075
2015-12-10 11:46:40 -08:00
Paul Dix 1bee7d1512 Update TSM, remove old version, add config
* remove rolloverTSMFileSize constant that is no longer used
* remove the maxGenerationFileCount since it is no longer a limitation that's necessary with the new compaction scheme. We no longer read WAL segments as part of the compaction so memory is only used as we read in each individual key
* remove minFileCount and switch to a user configurable variable
* remove the mutex from WALSegmentWriter. There's never more than one open in the WAL at one time and it's not exported through any function so the lock on the WAL should be used. This simplified keeping track of the last write time and removed a bunch of unnecessary locks.
* update WALSegmentWriter.Write to take the compressed bytes so that encoding and compression can occur before the call to write (while we don't hold the WAL lock)
* remove a bunch of unnecessary locking in WAL.writeToLog
* Add check for TSM file magic number and vesion
* Remove old tsm, log, and unused cursor code
* Remove references to tsm1dev everywhere except in the inspector
* Clean up config options for compaction and snapshotting
* Remove old TSM configuration options
* Update the config.sample.toml with TSM options
* Update WAL compact to force if it has been cold for writes for a configurable period of time (1h by default)
2015-12-06 18:50:39 -05:00