influxdb

Commit Graph

Author	SHA1	Message	Date
Dom Dwyer	a85dcd745b	refactor(catalog): expose deleted_at on Namespace Add the new catalog column to the Namespace representation/model.	2023-02-10 14:15:01 +01:00
dependabot[bot]	6327e3d9c0	chore(deps): Bump serde_json from 1.0.92 to 1.0.93 (#6918 ) Bumps [serde_json](https://github.com/serde-rs/json) from 1.0.92 to 1.0.93. - [Release notes](https://github.com/serde-rs/json/releases) - [Commits](https://github.com/serde-rs/json/compare/v1.0.92...v1.0.93) --- updated-dependencies: - dependency-name: serde_json dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2023-02-09 10:39:33 +00:00
dependabot[bot]	0ecde75af5	chore(deps): Bump object_store from 0.5.3 to 0.5.4 (#6900 ) Bumps [object_store](https://github.com/apache/arrow-rs) from 0.5.3 to 0.5.4. - [Release notes](https://github.com/apache/arrow-rs/releases) - [Changelog](https://github.com/apache/arrow-rs/blob/master/CHANGELOG-old.md) - [Commits](https://github.com/apache/arrow-rs/compare/object_store_0.5.3...object_store_0.5.4) --- updated-dependencies: - dependency-name: object_store dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2023-02-08 09:40:11 +00:00
Dom Dwyer	bf6ab7fd88	fix: error message typo columns -> tables for table limit error.	2023-02-06 18:03:26 +01:00
Dom Dwyer	3f2eb54bce	test(router): catalog service limit errors Assert the service limit error messages from the catalog.	2023-02-06 17:55:09 +01:00
Dom Dwyer	3881e11734	test(router): service limit error messages Assert the user-facing service limit error messages.	2023-02-06 17:43:37 +01:00
Dom Dwyer	114bafe9a1	perf(router): cached table limit enforcement Use the namespace schema cache in the router to enforce the per-namespace table limit (service protection limit), adding O(1) overhead to the existing column limit evaluation logic. Prior to this commit, each request that would breach the table limit would be (potentially partially) applied to the catalog and return an error. Every subsequent request creating a new table continued to cause a catalog query, unnecessarily adding load proportional to request counts. After this commit, catalog requests are sent when the router instance can determine (to the best of it's ability, see below) that the request will not cause the namespace to exceed the table limit. Because this uses cached schemas, the actual state set of tables may have changed - this will cause inconsistent enforcement and spurious errors in the same way it currently does for the column limit. For more details (and to track a resolution) see: https://github.com/influxdata/influxdb_iox/issues/5957	2023-02-06 17:43:26 +01:00
Dom Dwyer	dfa4ab2585	perf(router): fast-path column limit for new table When validating column limits for new tables, skip the column set generation and union operations against the empty existing column set.	2023-02-06 17:33:56 +01:00
Dom Dwyer	a633964f2b	feat(catalog): return max table limit in schema The maximum number of tables is part of the Namespace, which is already loaded in its entirety. This commit copies the value into the NamespaceSchema, making it available for the router to utilise.	2023-02-06 17:33:55 +01:00
dependabot[bot]	6f4e287a3a	chore(deps): Bump serde_json from 1.0.91 to 1.0.92 (#6860 ) Bumps [serde_json](https://github.com/serde-rs/json) from 1.0.91 to 1.0.92. - [Release notes](https://github.com/serde-rs/json/releases) - [Commits](https://github.com/serde-rs/json/compare/v1.0.91...v1.0.92) --- updated-dependencies: - dependency-name: serde_json dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2023-02-06 08:27:41 +00:00
Carol (Nichols \|\| Goulding)	30fea67701	fix: Move variables within format strings. Thanks clippy! Changes made automatically using `cargo clippy --fix`.	2023-02-03 13:06:17 -05:00
Dom Dwyer	7f363b55df	test(router): e2e namespace retention coverage Assert the correct handling of 0 and negative retention periods when interacting with the namespace create & update gRPC handlers.	2023-02-01 11:49:53 +01:00
dependabot[bot]	d0e6b16450	chore(deps): Bump bytes from 1.3.0 to 1.4.0 Bumps [bytes](https://github.com/tokio-rs/bytes) from 1.3.0 to 1.4.0. - [Release notes](https://github.com/tokio-rs/bytes/releases) - [Changelog](https://github.com/tokio-rs/bytes/blob/master/CHANGELOG.md) - [Commits](https://github.com/tokio-rs/bytes/compare/v1.3.0...v1.4.0) --- updated-dependencies: - dependency-name: bytes dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com>	2023-02-01 00:30:56 +00:00
dependabot[bot]	875b6a3e99	chore(deps): Bump futures from 0.3.25 to 0.3.26 (#6766 ) Bumps [futures](https://github.com/rust-lang/futures-rs) from 0.3.25 to 0.3.26. - [Release notes](https://github.com/rust-lang/futures-rs/releases) - [Changelog](https://github.com/rust-lang/futures-rs/blob/master/CHANGELOG.md) - [Commits](https://github.com/rust-lang/futures-rs/compare/0.3.25...0.3.26) --- updated-dependencies: - dependency-name: futures dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2023-01-31 11:33:50 +00:00
Dom Dwyer	0ddef54b09	fix(router): envoy network error translation Envoy will connect to an endpoint on demand, and return an application-level error if it fails with a gRPC status code of "Unavailable". It also embeds a metadata entry of {"server": "envoy"} - this commit uses the two signals (error status code + metadata entry) to drive an immediate reconnection when observed, assuming the connection is bad.	2023-01-30 12:15:38 +01:00
Dom Dwyer	353b1ad575	feat: configurable RPC write request timeout Allows the user to configure the timeout used for a single RPC write request, and changes the default to a more sensible value (30 -> 3 seconds).	2023-01-27 14:53:48 +01:00
Dom	5757674d5e	Merge branch 'main' into dom/service-limit-metric-labels	2023-01-27 10:04:56 +00:00
Dom Dwyer	8140313775	test: drive catalog in ns rejection test Use the actual catalog resolver, not the mock to assert the correct behaviour with a populated catalog.	2023-01-26 17:55:39 +01:00
Dom Dwyer	0aa5469ac6	test(e2e): explicit namespace creation Adds an end-to-end test of the router's gRPC NamespaceService covering creation and reading of new namespaces.	2023-01-26 17:32:12 +01:00
Dom Dwyer	7eaa8f59b0	fix: explicit namespace creation w/ existing ns Prior to this commit, namespaces that had been created on one router could not be used on another router until the latter was restarted. Effectively, newly created namespaces couldn't be used. After this commit, the catalog is also checked when a cache miss occurs, ensuring the router discovers new, not-yet-cached namespaces.	2023-01-26 17:32:12 +01:00
Dom Dwyer	105e354299	refactor: clean up namespace errors The namespace error was poorly refactored and duplicated the prefix string. The "rejected" case is now also tested.	2023-01-26 17:32:11 +01:00
Dom Dwyer	1a7679bcee	refactor: expose underlying gRPC implementations Changes the gRPC delegate to return the underlying service (type erased) implementations instead of the RPC service wrappers.	2023-01-26 17:32:11 +01:00
Dom Dwyer	ac8fa293cb	refactor(test): TestContext::write_lp() helper Adds a helper method to construct the HTTP write request.	2023-01-26 17:32:10 +01:00
Dom Dwyer	6f1869f9dc	test(router): initialise gRPC delegate in e2e Initialise the "rpc mode" gRPC handlers in the router e2e TestContext.	2023-01-26 17:32:10 +01:00
Dom Dwyer	3efc42baac	refactor(test): dedicated e2e TestContext module Moves the router's TestContext to its own file/module.	2023-01-26 17:32:10 +01:00
Dom Dwyer	c66f4a3d92	fix(router): restore NamespaceService This was removed in the RPC variant of the router - no idea why, we definitely should have it!	2023-01-26 15:10:22 +01:00
Dom Dwyer	b6018e1c39	feat(metrics): separate service limit counters Service limits are enforced on two values: * Number of tables in a namespace * Number of columns in a table This commit labels the existing service limit hit metric with the type of limit reached, and adds this information to the log lines emitted.	2023-01-26 14:48:33 +01:00
Dom	f0d7ee59c3	Merge branch 'main' into dom/circuit-fuzz	2023-01-25 12:42:43 +00:00
Dom Dwyer	6eb1773ec0	perf(router): faster balancer node recovery Ensure a "probe" node is always returned as the first candidate, driving it to recovery faster. This also includes a fix for the balancer metrics that would report probe candidate nodes as healthy nodes.	2023-01-25 13:18:24 +01:00
Dom Dwyer	f5d4171be0	test: CircuitBreaker recovery property fuzz test Adds a multi-threaded fuzz test that ensures a circuit breaker can always transition to the healthy state, regardless of the sequence of events prior.	2023-01-25 11:53:57 +01:00
Luke Bond	caea42665b	Merge branch 'main' into dom/rpc-endpoint-metrics	2023-01-25 10:44:18 +11:00
Dom	442e8a8b79	Merge branch 'main' into dom/ingester-rediscovery	2023-01-24 19:13:02 +00:00
Dom Dwyer	411f4bd08b	fix(router): force rediscovery of nodes Similar to https://github.com/influxdata/influxdb_iox/pull/6509, this forces a constant re-querying of the DNS address of an ingester to drive rediscovery. Unlike the above PR, this only reconnects when there are errors observed. This still isn't ideal - something is wrong with the discovery itself - this just papers over it.	2023-01-24 20:11:53 +01:00
Dom Dwyer	9132343dac	feat(metrics): export RPC upstream health state Adds a metric with a per-ingester label recording the current health state of the upstream ingester from the perspective of the router instance. Also logs periodically when one or more ingesters are offline.	2023-01-24 19:27:15 +01:00
Dom Dwyer	f26b54beec	refactor(router): set sensible RPC timeouts Copies these over from the client_util package.	2023-01-24 19:22:27 +01:00
Dom Dwyer	87b553fe9d	feat: WARN logs w/ endpoint for unhealthy upstream Changes the DEBUG log event to a WARN now that it includes the endpoint to which the event applies.	2023-01-24 19:19:31 +01:00
Dom Dwyer	085de40127	feat: lazy-connect to ingester gRPC endpoints Lazily establish connections in the background, instead of using tonic's connect_lazy(). connect_lazy() causes error handling to take a different path in tonic compared to "normal" connections, and this stops reconnections from being performed when the endpoint goes away (likely a bug). It also means the first few write requests won't have to wait while the connection is dialed, which brings down the P99 as a nice side-effect.	2023-01-24 16:44:55 +01:00
Dom Dwyer	8215f4126e	test: router balancer recovery Ensure a recovering node is yielded from the balancer.	2023-01-24 15:30:01 +01:00
Dom Dwyer	c6d6c50fbf	perf(router): circuit break ingester connections Adds on-path health checking / recording using the CircuitBreaker construct, stopping requests to unhealthy upstreams (minus the probe requests) until they recover. This removes the horrible gRPC balancer hack I added to get us deployed ASAP, and should eliminate latency spikes and elevated error responses observed during deployments as a result.	2023-01-24 15:30:01 +01:00
Dom Dwyer	107006c801	revert: influxdata/dom/rpc-balancer This reverts commit `a3805dbccf`, reversing changes made to `bcb1232c5d`.	2023-01-24 14:47:05 +01:00
Dom Dwyer	b32662ebf2	test: router balancer recovery Ensure a recovering node is yielded from the balancer.	2023-01-24 13:38:36 +01:00
Dom Dwyer	7596dc0826	perf(router): circuit break ingester connections Adds on-path health checking / recording using the CircuitBreaker construct, stopping requests to unhealthy upstreams (minus the probe requests) until they recover. This removes the horrible gRPC balancer hack I added to get us deployed ASAP, and should eliminate latency spikes and elevated error responses observed during deployments as a result.	2023-01-24 12:38:27 +01:00
Dom Dwyer	c3a2ac3a0d	refactor: prevent div by 0 Preserve the error ratio calculation but prevent a div by 0 by ensuring the divisor is always at least 1.	2023-01-24 12:09:00 +01:00
Dom Dwyer	c4b04a16c5	refactor: rename last_probe instant last_probe was "the instant at which the last set of probes started being sent" in my head, but Carol saw it as "first_probe - the time at which probes started being sent". Hopefully probe_window_started_at is less ambiguous.	2023-01-24 12:08:10 +01:00
Dom Dwyer	2f3fb48091	docs: document error count floor Describe the floor on the number of errors that must be observed before the circuit breaker will consider switching to the unhealthy state.	2023-01-24 12:08:09 +01:00
Carol (Nichols \|\| Goulding)	caf8dc9032	fix: Rename incorrect usage of 'close' to 'unhealthy' in test helper	2023-01-23 16:08:00 -05:00
Carol (Nichols \|\| Goulding)	081b4f15da	docs: Clarify my understanding of the circuit breaker based on chat with Dom	2023-01-23 16:07:02 -05:00
Dom Dwyer	67b73d90dd	feat: low-overhead circuit breaker Implements a "circuit breaker", a construct that tracks the error & success of requests to a remote node, and uses this information to allow or deny further requests. This circuit breaker stops sending requests to the remote when the error count exceeds 80% of requests in a 5 second window. Once this happens, up to 10 "probe" requests per second are allowed, and when they succeed, normal operation resumes (though concurrent requests may still be completing during the probe regime and are counted towards the probe results). In the happy path, this circuit breaker is very cheap (lock free; WFPO) to evaluate and record request results in, minimising the throughput penalty. Once the breaker enters an unhealthy state (hopefully a rare occurrence) it uses a mutex to manage the probe state (with a higher overhead) for simplicity; it's definitely possible to optimise this away if high latencies are observed during upstream outages when the circuit breaker is open/unhealthy.	2023-01-23 13:55:12 +01:00
Dom Dwyer	6ef68513d9	fix: gRPC balancer shutdown panic The gRPC node discovery hack spawns a task that outlives the gRPC balancer - once the balancer stops, the task should stop too (and not panic sending on the closed channel).	2023-01-11 16:42:39 +01:00
Dom Dwyer	9ab86fa154	fix(router2): drive ingester node (re)-discovery The tonic / tower load-balance implementation discards failed nodes, even when using a static list - this causes nodes that fail once to never be retried. This doesn't happen for the last node for some reason, and leads to all the load from one router hitting a single ingester instead of load balancing across all ingesters. This commit adds a hack to constantly tell the load balancer to probe all nodes, hopefully causing them to re-discover previously failed nodes. I don't have the time to do this properly :(	2023-01-05 14:06:29 +01:00

1 2 3 4 5

216 Commits (d7904060854c457a2027d35dcd55598060738158)