Commit Graph

47 Commits (398660438f36d715b9370ed8da37428121b013c8)

Author SHA1 Message Date
Eng Zer Jun 903d30d658
test: use `T.TempDir` to create temporary test directory (#23258)
* test: use `T.TempDir` to create temporary test directory

This commit replaces `os.MkdirTemp` with `t.TempDir` in tests. The
directory created by `t.TempDir` is automatically removed when the test
and all its subtests complete.

Prior to this commit, temporary directory created using `os.MkdirTemp`
needs to be removed manually by calling `os.RemoveAll`, which is omitted
in some tests. The error handling boilerplate e.g.
	defer func() {
		if err := os.RemoveAll(dir); err != nil {
			t.Fatal(err)
		}
	}
is also tedious, but `t.TempDir` handles this for us nicely.

Reference: https://pkg.go.dev/testing#T.TempDir
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>

* test: fix failing TestSendWrite on Windows

=== FAIL: replications/internal TestSendWrite (0.29s)
    logger.go:130: 2022-06-23T13:00:54.290Z	DEBUG	Created new durable queue for replication stream	{"id": "0000000000000001", "path": "C:\\Users\\circleci\\AppData\\Local\\Temp\\TestSendWrite1627281409\\001\\replicationq\\0000000000000001"}
    logger.go:130: 2022-06-23T13:00:54.457Z	ERROR	Error in replication stream	{"replication_id": "0000000000000001", "error": "remote timeout", "retries": 1}
    testing.go:1090: TempDir RemoveAll cleanup: remove C:\Users\circleci\AppData\Local\Temp\TestSendWrite1627281409\001\replicationq\0000000000000001\1: The process cannot access the file because it is being used by another process.

Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>

* test: fix failing TestStore_BadShard on Windows

=== FAIL: tsdb TestStore_BadShard (0.09s)
    logger.go:130: 2022-06-23T12:18:21.827Z	INFO	Using data dir	{"service": "store", "path": "C:\\Users\\circleci\\AppData\\Local\\Temp\\TestStore_BadShard1363295568\\001"}
    logger.go:130: 2022-06-23T12:18:21.827Z	INFO	Compaction settings	{"service": "store", "max_concurrent_compactions": 2, "throughput_bytes_per_second": 50331648, "throughput_bytes_per_second_burst": 50331648}
    logger.go:130: 2022-06-23T12:18:21.828Z	INFO	Open store (start)	{"service": "store", "op_name": "tsdb_open", "op_event": "start"}
    logger.go:130: 2022-06-23T12:18:21.828Z	INFO	Open store (end)	{"service": "store", "op_name": "tsdb_open", "op_event": "end", "op_elapsed": "77.3µs"}
    testing.go:1090: TempDir RemoveAll cleanup: remove C:\Users\circleci\AppData\Local\Temp\TestStore_BadShard1363295568\002\data\db0\rp0\1\index\0\L0-00000001.tsl: The process cannot access the file because it is being used by another process.

Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>

* test: fix failing TestPartition_PrependLogFile_Write_Fail and TestPartition_Compact_Write_Fail on Windows

=== FAIL: tsdb/index/tsi1 TestPartition_PrependLogFile_Write_Fail/write_MANIFEST (0.06s)
    testing.go:1090: TempDir RemoveAll cleanup: remove C:\Users\circleci\AppData\Local\Temp\TestPartition_PrependLogFile_Write_Failwrite_MANIFEST656030081\002\0\L0-00000003.tsl: The process cannot access the file because it is being used by another process.
    --- FAIL: TestPartition_PrependLogFile_Write_Fail/write_MANIFEST (0.06s)

=== FAIL: tsdb/index/tsi1 TestPartition_Compact_Write_Fail/write_MANIFEST (0.08s)
    testing.go:1090: TempDir RemoveAll cleanup: remove C:\Users\circleci\AppData\Local\Temp\TestPartition_Compact_Write_Failwrite_MANIFEST3398667527\002\0\L0-00000003.tsl: The process cannot access the file because it is being used by another process.
    --- FAIL: TestPartition_Compact_Write_Fail/write_MANIFEST (0.08s)

We must close the open file descriptor otherwise the temporary file
cannot be cleaned up on Windows.

Fixes: 619eb1cae6 ("fix: restore in-memory Manifest on write error")
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>

* test: fix failing TestReplicationStartMissingQueue on Windows

=== FAIL: TestReplicationStartMissingQueue (1.60s)
    logger.go:130: 2023-03-17T10:42:07.269Z	DEBUG	Created new durable queue for replication stream	{"id": "0000000000000001", "path": "C:\\Users\\circleci\\AppData\\Local\\Temp\\TestReplicationStartMissingQueue76668607\\001\\replicationq\\0000000000000001"}
    logger.go:130: 2023-03-17T10:42:07.305Z	INFO	Opened replication stream	{"id": "0000000000000001", "path": "C:\\Users\\circleci\\AppData\\Local\\Temp\\TestReplicationStartMissingQueue76668607\\001\\replicationq\\0000000000000001"}
    testing.go:1206: TempDir RemoveAll cleanup: remove C:\Users\circleci\AppData\Local\Temp\TestReplicationStartMissingQueue76668607\001\replicationq\0000000000000001\1: The process cannot access the file because it is being used by another process.

Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>

* test: update TestWAL_DiskSize

Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>

* test: fix failing TestWAL_DiskSize on Windows

=== FAIL: tsdb/engine/tsm1 TestWAL_DiskSize (2.65s)
    testing.go:1206: TempDir RemoveAll cleanup: remove C:\Users\circleci\AppData\Local\Temp\TestWAL_DiskSize2736073801\001\_00006.wal: The process cannot access the file because it is being used by another process.

Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>

---------

Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
2023-03-21 16:22:11 -04:00
Jeffrey Smith II b819edf095
fix: rename replication fields for better clarity (#24126)
* fix: rename replication fields for better clarity

* fix: dont rename, only add new field
2023-03-09 13:11:43 -05:00
Jeffrey Smith II 77fd64a975
fix: handle replication missing queue (#24123)
* fix: replications should startup after backup/restore

* chore: refactor

* test: improve logging and handle test better
2023-03-09 13:10:53 -05:00
suitableZebraCaller ec7fdd3a58
fix: Show Replication Queue size and Replication TCP Errors (#23960)
* feat: Show remaining replication queue size

* fix: Show non-http related error messages

* fix: Show non-http related error messages with backoff

* fix: Updates for replication tests

* chore: formatting

* chore: formatting

* chore: formatting

* chore: formatting

* chore: lowercase json field

---------

Co-authored-by: Geoffrey <suitableZebraCaller@users.noreply.github.com>
Co-authored-by: Jeffrey Smith II <jeffreyssmith2nd@gmail.com>
2023-02-02 09:47:45 -05:00
Jeffrey Smith II f026d7bdaf
fix: Fixes migrating when a remote already exists (#23912)
* fix: handle migrating with already defined remotes

* test: add test to verify migrating already defined remotes

* fix: properly handle Up
2022-11-17 14:23:10 -05:00
Ole Kristian (Zee) 666cabb1f4
fix: fix wrong max age transformation from seconds (#23684)
* fix: fix wrong max age transformation from seconds

* refactor: clarify max age intent

* refactor: remove unnecessary duration
2022-11-16 16:18:43 -05:00
Dane Strandboge 6fc66acb0a
fix: do not require remoteOrgID in remote config/creation request (#23838) 2022-11-01 09:47:45 -05:00
Dane Strandboge 55b7d29e4f
fix: sql scan error on remote bucket id when replication to 1.x (#23826) 2022-10-19 14:51:48 -05:00
Jeffrey Smith II 6f50e70960
feat: replicate based on bucket name rather than id (#23638)
* feat: add the ability to replicate based on bucket name rather than bucket id.

- This adds compatibility with 1.x replication targets

* fix: improve error checking and add tests

* fix: add additional constraint to replications table

* fix: use OR not AND for constraint

* feat: delete invalid replications on downgrade

* fix: should be less than 2.4

* test: add test around down migration and cleanup migration code

* fix: use nil instead of platform.ID(1) for better consistency

* fix: fix tests

* fix: fix tests
2022-08-18 14:21:59 -04:00
Jeffrey Smith II 090f681737
feat: Add remotes and replications to telemetry (#23456)
* feat: start work on remotes/replications phone home data

* feat: add remotes/replications phone home data (no tests

* refactor: use erroring binary conversions

* style: gofmt

* refactor: improve some error handling

* style: cleanup

* feat: add tests

* refactor: just list remotes/replications rather than decrement

* chore: linting fix

Co-authored-by: DStrand1 <dstrandboge@influxdata.com>
2022-06-16 14:48:06 -04:00
Dane Strandboge 9e556864a3
fix: replications remote write failure can deadlock remote writer (#23458) 2022-06-16 11:57:24 -05:00
Jeffrey Smith II 692b0d5153
feat: add instance-id flag for identifying edge nodes (#23447)
* feat: add instance-id flag for identifying edge nodes

* refactor: rename tag to _instance_id
2022-06-16 12:18:11 -04:00
Dane Strandboge 9e20f9f3dc
feat: add signifier to replication user agent (#23370) 2022-05-31 11:50:53 -05:00
Dane Strandboge 82d1123e78
build: upgrade to Go 1.18.1 (#23252) 2022-04-13 15:24:27 -05:00
Dane Strandboge 359fcc46b5
feat: add maximum age to replication queues (#23206)
Co-authored-by: Sam Arnold <sarnold@influxdata.com>
2022-03-25 13:06:05 -05:00
Sam Arnold 7c0ec4dd2c
fix: replications replicates flux to() writes (#23188)
Fixes a few issues:
* flux needs to write to the replication service, instead of the engine directly.
* the replication service incorrectly had value receiver methods, I think this
was just an accident. Pointer receivers make things easier to reason about. Also
with value receivers flux was not picking up the replication config properly.
* The flux to() function previously did not receive the org properly for internal
writes. Previously this was not necessary as the write path only needs the bucket
ID at this level (after authentication). But now we need the org id to look up
replications properly.

Closes #23183
2022-03-14 12:17:58 -04:00
Sam Arnold e20b5e99a6
fix: remove nats for scraper processing (#23107)
* fix: remove nats for scraper processing

Scrapers now use go channels instead of NATS and interprocess communication.
This should fix #23085 .

Additionally, found and fixed #23106 .

* chore: fix formatting

* chore: fix static check and go.mod

* test: fix some flaky tests

* fix: mark NATS arguments as deprecated
2022-02-10 11:23:18 -05:00
William Baker c1d384de19
test: fix flaky enqueue test (#23035) 2022-01-10 08:04:59 -08:00
mcfarlm3 60234964d0
refactor: replications local write optimization (#22993)
* refactor: eliminate sqlite query in case of no configured replications

* refactor: updated write-related tests to reflect tracking of orgID and localBucket by the queue manager

* refactor: removed redundant trackedReplications field

* refactor: corrected slice init in GetReplications and added TestGetReplications

* refactor: eliminated tracked package and moved TrackedReplication struct to influxdb package via replication.go

* chore: ran make fmt

* fix: added closeRq function back in to address flaky tests

* refactor: small changes to queue manager test based on code review
2021-12-15 12:32:46 -08:00
William Baker 5a919b69d7
feat: enable remotes and replication streams feature (#22990) 2021-12-13 16:01:50 -06:00
William Baker 0e5b14fa5e
chore: increase replications batch size limits (#22983) 2021-12-13 11:02:38 -06:00
William Baker a7a5233432
feat: advance queue scanner periodically instead of every remote write (#22981) 2021-12-13 10:09:36 -06:00
William Baker e3ff434f81
test: fix flaky replications tests (#22973)
* fix: fix test and run 20 times

* fix: unfix and run test 20 times

* test: wait for rq run fn to return in tests
2021-12-08 14:48:25 -06:00
William Baker e5cbd279ee
fix: advance replications queue after successful remote writes (#22967)
* fix: advance replications queue after successful remote writes to prevent data duplication on errors

* fix: loop on sendwrite

* chore: remove flaky test

* chore: add TODO about future optimization
2021-12-08 12:52:46 -06:00
William Baker 6096ee2ad4
feat: replications metrics include failure to enqueue (#22962)
* feat: replications metrics include failure to enqueue
2021-12-02 14:42:55 -06:00
mcfarlm3 28bcd416b2
feat: batch replications remote writes to avoid payload limit errors (#22914)
* feat: batch replications remote writes appropriately to avoid payload limit errors

* chore: ran make fmt

* chore: fixed staticcheck failure

* refactor: removed batching code from queue manager

* refactor: batch writes before gzip compression

* fix: add in missing bracket after merge

* fix: removed duplicate lines of code from WritePoints function

* feat: add batching functionality for remote writes

* refactor: removed batch index variable
2021-12-02 12:04:10 -08:00
William Baker e4e16335f5
fix: replications remote writes do not block server shutdown (#22958)
* fix: replications remote writes do not block server shutdown

* fix: don't leak goroutine
2021-12-02 12:04:52 -06:00
William Baker 3460f1cc52
feat: replication remote writes do not block local writes (#22956)
* feat: replication remote writes do not block local writes
2021-12-01 15:37:10 -06:00
William Baker f05d0136f1
feat: metrics collection for replications remote writes (#22952)
* feat: metrics collection for replications remote writes

* fix: don't update metrics with 204 error code on successful writes
2021-12-01 12:41:24 -06:00
William Baker 9873ccd657
feat: remote write function for replications (#22942)
* feat: remote write function for replications

* chore: implement UpdateResponseInfo store method

* chore: only set gzip heading for non-empty requests

* fix: address review feedback
2021-11-30 15:33:42 -06:00
William Baker f47d514225
refactor: move replications store functionality to separate package (#22923)
* refactor: move replications store functionality to separate package

* fix: make opening all repls on startup work right
2021-11-24 11:45:19 -06:00
William Baker 3a81166812
feat: added metrics collection for replications (#22906)
* feat: added metrics collection for replications

* fix: fixed panic when restarting

* fix: fix panic pt2

* chore: self-review fixes

* chore: simplify test
2021-11-22 11:40:03 -06:00
Dane Strandboge 6ee472725f
refactor: use remote write func in NewDurableQueueManager (#22888) 2021-11-19 11:31:10 -06:00
William Baker ad52815e19
feat: add field for dropping data resulting in non-retryable errors to individual replications (#22885)
* feat: add field for dropping data resulting in non-retryable errors to individual replications
2021-11-16 13:41:54 -07:00
Dane Strandboge 40d9587ece
feat: add replications queue scanner (#22873)
Co-authored-by: “mcfarlm3” <“58636946+mcfarlm3@users.noreply.github.com”>
2021-11-16 10:30:52 -06:00
Daniel Moran 6b56af3c3f
feat: mirror writes to registered replications (#22833) 2021-11-10 08:25:47 -05:00
mcfarlm3 cd0243d2b4
feat: added replications queue management to launcher tasks (#22820)
* feat: added replications queue management to launcher tasks

* refactor: separated sql logic into replications service rather than durable queue manager

* refactor: extended replications feature flag to launcher code and minor change to startup function param

* chore: added unit test coverage for replications server startup queue management

* refactor: made error messages reusable and factored out unecessary string from queue management tests

* refactor: changed queue management error names to pass linter check
2021-11-09 11:32:07 -08:00
Daniel Moran 1aac92c5ee
refactor: remove replications.current_queue_size_bytes from sqlite (#22832)
Maintaining the current queue size in a SQL column would require
updating the DB on every queue operation. Avoid that contention by
instead looking up the current size on the in-memory durable queue
struct, which is already tracked & updated as data enters & leaves
the queue.
2021-11-05 14:35:12 -04:00
William Baker f7573f43a7
feat: sql migrator can do down migrations (#22806)
* feat: sql down migrations

* refactor: different name for up migrations

* chore: update migrations ref in svc tests

* build: add lint step to verify sql migration names match
2021-11-01 14:30:18 -06:00
mcfarlm3 8825cd5d50
feat: replication apis durable queue management (#22719)
* feat: added durable queue management to replications service

* refactor: improved mapping of replication streams to durable queues

* refactor: modified replication stream durable queues to use user-specified engine path

* chore: generated test mocks for replications DurableQueueManager

* chore: add test coverage for replications durable queue manager

* refactor: made changes based on code review, added mutex to durableQueueManager, improved error logging

* chore: ran make fmt

* refactor: further improvements to error logging
2021-10-26 12:14:29 -07:00
Daniel Moran 58139c47b2
feat: add auth to remotes & replications APIs (#22744) 2021-10-26 11:32:35 -04:00
Daniel Moran 7c19225bed
feat: implement replication validation (#22581) 2021-10-05 14:34:38 -04:00
Daniel Moran 153a89dba0
feat: deleting a bucket also deletes all associated replications (#22424) 2021-09-09 15:22:36 -04:00
Daniel Moran 1fa0ccf24a
refactor: move interfaces for remotes & replication services out of root package (#22417) 2021-09-07 16:21:29 -04:00
Daniel Moran 12c8fd28d2
feat: implement metadata management for replications (#22302) 2021-09-01 12:01:41 -04:00
Daniel Moran b37ad79e20
feat: add logging and metrics middlewares to replications API (#22291) 2021-08-24 14:56:56 -04:00
Daniel Moran 641c02f9a8
feat: add APIs for management of replication streams (#22287) 2021-08-24 14:19:03 -04:00