Commit Graph

189 Commits (7b69c84ceb5894e0a51f35ec6a8bb79a27c93533)

Author SHA1 Message Date
Dom f0d7ee59c3
Merge branch 'main' into dom/circuit-fuzz 2023-01-25 12:42:43 +00:00
Dom Dwyer 6eb1773ec0
perf(router): faster balancer node recovery
Ensure a "probe" node is always returned as the first candidate, driving
it to recovery faster.

This also includes a fix for the balancer metrics that would report
probe candidate nodes as healthy nodes.
2023-01-25 13:18:24 +01:00
Dom Dwyer f5d4171be0
test: CircuitBreaker recovery property fuzz test
Adds a multi-threaded fuzz test that ensures a circuit breaker can
always transition to the healthy state, regardless of the sequence of
events prior.
2023-01-25 11:53:57 +01:00
Luke Bond caea42665b
Merge branch 'main' into dom/rpc-endpoint-metrics 2023-01-25 10:44:18 +11:00
Dom 442e8a8b79
Merge branch 'main' into dom/ingester-rediscovery 2023-01-24 19:13:02 +00:00
Dom Dwyer 411f4bd08b
fix(router): force rediscovery of nodes
Similar to https://github.com/influxdata/influxdb_iox/pull/6509, this
forces a constant re-querying of the DNS address of an ingester to drive
rediscovery.

Unlike the above PR, this only reconnects when there are errors
observed. This still isn't ideal - something is wrong with the discovery
itself - this just papers over it.
2023-01-24 20:11:53 +01:00
Dom Dwyer 9132343dac
feat(metrics): export RPC upstream health state
Adds a metric with a per-ingester label recording the current health
state of the upstream ingester from the perspective of the router
instance.

Also logs periodically when one or more ingesters are offline.
2023-01-24 19:27:15 +01:00
Dom Dwyer f26b54beec
refactor(router): set sensible RPC timeouts
Copies these over from the client_util package.
2023-01-24 19:22:27 +01:00
Dom Dwyer 87b553fe9d
feat: WARN logs w/ endpoint for unhealthy upstream
Changes the DEBUG log event to a WARN now that it includes the endpoint
to which the event applies.
2023-01-24 19:19:31 +01:00
Dom Dwyer 085de40127
feat: lazy-connect to ingester gRPC endpoints
Lazily establish connections in the background, instead of using tonic's
connect_lazy().

connect_lazy() causes error handling to take a different path in tonic
compared to "normal" connections, and this stops reconnections from
being performed when the endpoint goes away (likely a bug).

It also means the first few write requests won't have to wait while the
connection is dialed, which brings down the P99 as a nice side-effect.
2023-01-24 16:44:55 +01:00
Dom Dwyer 8215f4126e
test: router balancer recovery
Ensure a recovering node is yielded from the balancer.
2023-01-24 15:30:01 +01:00
Dom Dwyer c6d6c50fbf
perf(router): circuit break ingester connections
Adds on-path health checking / recording using the CircuitBreaker
construct, stopping requests to unhealthy upstreams (minus the probe
requests) until they recover.

This removes the horrible gRPC balancer hack I added to get us deployed
ASAP, and should eliminate latency spikes and elevated error responses
observed during deployments as a result.
2023-01-24 15:30:01 +01:00
Dom Dwyer 107006c801
revert: influxdata/dom/rpc-balancer
This reverts commit a3805dbccf, reversing
changes made to bcb1232c5d.
2023-01-24 14:47:05 +01:00
Dom Dwyer b32662ebf2
test: router balancer recovery
Ensure a recovering node is yielded from the balancer.
2023-01-24 13:38:36 +01:00
Dom Dwyer 7596dc0826
perf(router): circuit break ingester connections
Adds on-path health checking / recording using the CircuitBreaker
construct, stopping requests to unhealthy upstreams (minus the probe
requests) until they recover.

This removes the horrible gRPC balancer hack I added to get us deployed
ASAP, and should eliminate latency spikes and elevated error responses
observed during deployments as a result.
2023-01-24 12:38:27 +01:00
Dom Dwyer c3a2ac3a0d
refactor: prevent div by 0
Preserve the error ratio calculation but prevent a div by 0 by ensuring
the divisor is always at least 1.
2023-01-24 12:09:00 +01:00
Dom Dwyer c4b04a16c5
refactor: rename last_probe instant
last_probe was "the instant at which the last set of probes started
being sent" in my head, but Carol saw it as "first_probe - the time at
which probes started being sent".

Hopefully probe_window_started_at is less ambiguous.
2023-01-24 12:08:10 +01:00
Dom Dwyer 2f3fb48091
docs: document error count floor
Describe the floor on the number of errors that must be observed before
the circuit breaker will consider switching to the unhealthy state.
2023-01-24 12:08:09 +01:00
Carol (Nichols || Goulding) caf8dc9032
fix: Rename incorrect usage of 'close' to 'unhealthy' in test helper 2023-01-23 16:08:00 -05:00
Carol (Nichols || Goulding) 081b4f15da
docs: Clarify my understanding of the circuit breaker based on chat with Dom 2023-01-23 16:07:02 -05:00
Dom Dwyer 67b73d90dd
feat: low-overhead circuit breaker
Implements a "circuit breaker", a construct that tracks the error &
success of requests to a remote node, and uses this information to allow
or deny further requests.

This circuit breaker stops sending requests to the remote when the error
count exceeds 80% of requests in a 5 second window. Once this happens,
up to 10 "probe" requests per second are allowed, and when they succeed,
normal operation resumes (though concurrent requests may still be
completing during the probe regime and are counted towards the probe
results).

In the happy path, this circuit breaker is very cheap (lock free; WFPO)
to evaluate and record request results in, minimising the throughput
penalty. Once the breaker enters an unhealthy state (hopefully a rare
occurrence) it uses a mutex to manage the probe state (with a higher
overhead) for simplicity; it's definitely possible to optimise this away
if high latencies are observed during upstream outages when the circuit
breaker is open/unhealthy.
2023-01-23 13:55:12 +01:00
Dom Dwyer 6ef68513d9
fix: gRPC balancer shutdown panic
The gRPC node discovery hack spawns a task that outlives the gRPC
balancer - once the balancer stops, the task should stop too (and not
panic sending on the closed channel).
2023-01-11 16:42:39 +01:00
Dom Dwyer 9ab86fa154
fix(router2): drive ingester node (re)-discovery
The tonic / tower load-balance implementation discards failed nodes,
even when using a static list - this causes nodes that fail once to
never be retried.

This doesn't happen for the last node for some reason, and leads to all
the load from one router hitting a single ingester instead of load
balancing across all ingesters.

This commit adds a hack to constantly tell the load balancer to probe
all nodes, hopefully causing them to re-discover previously failed
nodes. I don't have the time to do this properly :(
2023-01-05 14:06:29 +01:00
Dom Dwyer a5a26f5efb
fix(router2): lazily connect to ingesters
Allow the routers to start up without requiring full availability of all
downstream ingesters. Previously a single unavailable ingester prevented
the routers from starting up.

This has downsides:

  * Lazily initialising a connection will cause the first writes to have
    higher latency as the connection is established.
  * The routers MAY come up in a state that will never work (i.e. bad
    ingester addresses)
  * Using the opaque gRPC load balancing mechanism restricts the
    visibility into which nodes are up/down (hindering useful log
    messages) and prevents us from implementing more advanced circuit
    breaking / probing logic / load-balancing strategies.

This change is a quick fix - it leaves the round-robin handler in place,
load-balancing over a single tonic Channel, which internally
load-balances. This will need cleaning up.
2023-01-05 11:25:35 +01:00
dependabot[bot] 8478d41bcb
chore(deps): Bump paste from 1.0.10 to 1.0.11 (#6430)
Bumps [paste](https://github.com/dtolnay/paste) from 1.0.10 to 1.0.11.
- [Release notes](https://github.com/dtolnay/paste/releases)
- [Commits](https://github.com/dtolnay/paste/compare/1.0.10...1.0.11)

---
updated-dependencies:
- dependency-name: paste
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-12-19 10:31:05 +00:00
dependabot[bot] 7f2aa8b10c
chore(deps): Bump serde_json from 1.0.89 to 1.0.91
Bumps [serde_json](https://github.com/serde-rs/json) from 1.0.89 to 1.0.91.
- [Release notes](https://github.com/serde-rs/json/releases)
- [Commits](https://github.com/serde-rs/json/compare/v1.0.89...v1.0.91)

---
updated-dependencies:
- dependency-name: serde_json
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
2022-12-19 01:44:18 +00:00
dependabot[bot] e108a8b6c9
chore(deps): Bump paste from 1.0.9 to 1.0.10 (#6384)
Bumps [paste](https://github.com/dtolnay/paste) from 1.0.9 to 1.0.10.
- [Release notes](https://github.com/dtolnay/paste/releases)
- [Commits](https://github.com/dtolnay/paste/compare/1.0.9...1.0.10)

---
updated-dependencies:
- dependency-name: paste
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-12-13 06:03:05 +00:00
Luke Bond 551bb0ef6a
feat: allow enabling/disabling ns autocreation in router (#6346)
* feat: allow enabling/disabling ns autocreation in router

* fix: missed an import for something behind router2 compile flag
2022-12-07 16:12:00 +00:00
dependabot[bot] 1d38d400f0
chore(deps): Bump object_store from 0.5.1 to 0.5.2 (#6339)
* chore(deps): Bump object_store from 0.5.1 to 0.5.2

Bumps [object_store](https://github.com/apache/arrow-rs) from 0.5.1 to 0.5.2.
- [Release notes](https://github.com/apache/arrow-rs/releases)
- [Changelog](https://github.com/apache/arrow-rs/blob/master/CHANGELOG-old.md)
- [Commits](https://github.com/apache/arrow-rs/compare/object_store_0.5.1...object_store_0.5.2)

---
updated-dependencies:
- dependency-name: object_store
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

* chore: Run cargo hakari tasks

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-12-06 07:53:54 +00:00
Carol (Nichols || Goulding) a51848b361
fix: Use client_util GrpcConnection instead of tonic Channel (#6320)
* fix: Use client_util GrpcConnection instead of tonic Channel

* refactor: include server addr in error

Co-authored-by: Dom <dom@itsallbroken.com>
2022-12-02 15:57:42 +00:00
Carol (Nichols || Goulding) c008219692
feat: Add a feature flag to switch to the router RPC write path (#6247)
* feat: Add a feature flag to switch to the router RPC write path

Fixes #6242.

* refactor: Remove a weird arc clone/rename that's not needed

I'm sure this was needed at some point, but it doesn't make much sense.
I wasn't going to change this, but I'm now trying to minimize the
differences between this function and the write path init function, so
make this one better too.

* fix: Add the namespace autocreation to the RPC write path too

The topic/query pool don't really apply to this case, but use them
anyway to be able to use the existing catalog methods.

Also add a bunch of comments pointing out where the RPC write path
initializer and the old router's initializer are the same and where
they're different, so that perhaps it'll be easier to keep them in sync
while they both exist.

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-12-01 11:05:39 +00:00
Luke Bond d07658282c
feat: add router config parameter for retention (#6278)
* chore: remove unused/moved ns_autocreation dml handler

* feat(router): expose new ns retention as config

* fix: forgot to set default value for router retention arg

* chore: make new namespace retention param an option
2022-11-30 13:14:39 +00:00
dependabot[bot] caa595a6fc
chore(deps): Bump serde_json from 1.0.88 to 1.0.89 (#6203)
Bumps [serde_json](https://github.com/serde-rs/json) from 1.0.88 to 1.0.89.
- [Release notes](https://github.com/serde-rs/json/releases)
- [Commits](https://github.com/serde-rs/json/compare/v1.0.88...v1.0.89)

---
updated-dependencies:
- dependency-name: serde_json
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-11-22 09:28:31 +00:00
dependabot[bot] 04c00bbb62
chore(deps): Bump bytes from 1.2.1 to 1.3.0 (#6199)
Bumps [bytes](https://github.com/tokio-rs/bytes) from 1.2.1 to 1.3.0.
- [Release notes](https://github.com/tokio-rs/bytes/releases)
- [Changelog](https://github.com/tokio-rs/bytes/blob/master/CHANGELOG.md)
- [Commits](https://github.com/tokio-rs/bytes/commits)

---
updated-dependencies:
- dependency-name: bytes
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-11-22 08:23:24 +00:00
dependabot[bot] 52c50c16e1
chore(deps): Bump serde_json from 1.0.87 to 1.0.88
Bumps [serde_json](https://github.com/serde-rs/json) from 1.0.87 to 1.0.88.
- [Release notes](https://github.com/serde-rs/json/releases)
- [Commits](https://github.com/serde-rs/json/compare/v1.0.87...v1.0.88)

---
updated-dependencies:
- dependency-name: serde_json
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
2022-11-21 01:52:18 +00:00
Dom Dwyer af78f0d5db
refactor: remove names from DML init
Fixes conflicts introduced by #6170.
2022-11-18 17:16:33 +01:00
Dom Dwyer 72939f8bf0
feat(router): handler for direct write to ingester
This commit adds the (unused) RpcWrite implementation of the DmlHandler
trait that implements pushing a write over gRPC to a single, arbitrary
ingester. Requests are round-robin'ed across all available ingesters.

This DmlHandler implementation can be swapped out with the
ShardedWriteBuffer to change how writes are propagated to the ingester.
2022-11-18 17:08:20 +01:00
Carol (Nichols || Goulding) 02c3083192
fix: Remove table names from Dml operations 2022-11-18 10:40:38 -05:00
Carol (Nichols || Goulding) a225b81e59
docs: Clarify and make consistent schema validation type comments 2022-11-18 10:39:27 -05:00
Nga Tran 49a9565240
feat: gRPC that creates namespace (#6103)
* feat: create namespace API call in router

Co-authored-by: Nga Tran <nga-tran@live.com>

* chore: treat retention as ns except in CLI

* fix: overflow in nanosecond calc

* fix: retention test after changing it from hours to ns

* chore: comment clarification in cli; better response type for error in ns API

* fix: correct some rebase mistakes

* chore: merge namespace create & create_with_retention; renamed ns create test helper fn & const

* fix: ns autocreation test was wrong after rebase

* fix: mem catalog has default 1hr retention, accidently removed in rebase

* chore: remove mem catalogs default 1hr retention; make it settable in sets & router

Co-authored-by: Luke Bond <luke.n.bond@gmail.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-11-18 13:02:12 +00:00
Nga Tran 6f7b1e2e26
feat: reject writes that are outside the retention period (#6148)
* feat: reject writes that are outside the retention period

* feat: add retention validator into handler stack

* chore: Apply suggestions from code review

Co-authored-by: Dom <dom@itsallbroken.com>

* refactor: address review comments

* test: unit tests fot retention validation

* chore: address review comments

* test: more unit tests and integration tests

* refactor: make time inside retention period for emphemeral_mode test

* fix: 2 hours

Co-authored-by: Dom <dom@itsallbroken.com>
2022-11-17 20:55:58 +00:00
Dom cd33f25d8a
Merge branch 'main' into dom/correct-comment 2022-11-16 15:42:47 +00:00
Luke Bond 9365d933f1
chore: router namespace api (#6151)
* chore: move ns api from querier to router

* chore: add explanatory comment in querier about moved namespace API

* fix: add namespace service to router

* fix: querier returns unimplemented error for ns retention, not panic

* chore: reuse namespace -> proto in router ns api

* chore: grpc namespace - consume ns to avoid clone

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-11-16 15:25:49 +00:00
Dom Dwyer 8c38911e8c
docs: remove redundant comment
This comment remains by mistake - table_ids is now used.
2022-11-16 14:40:53 +01:00
Carol (Nichols || Goulding) 3943faf998
fix: Remove namespace from DmlWrite and DmlDelete constructors 2022-11-14 16:46:04 -05:00
Carol (Nichols || Goulding) f78195f7c7
fix: Remove namespace name field from DmlWrite and DmlDelete
But leave the argument in their constructors for now.

Not all numbers in tests can be 42, Dom.
2022-11-14 16:46:04 -05:00
dependabot[bot] a969754819
chore(deps): Bump chrono from 0.4.22 to 0.4.23 (#6129)
* chore(deps): Bump chrono from 0.4.22 to 0.4.23

Bumps [chrono](https://github.com/chronotope/chrono) from 0.4.22 to 0.4.23.
- [Release notes](https://github.com/chronotope/chrono/releases)
- [Changelog](https://github.com/chronotope/chrono/blob/main/CHANGELOG.md)
- [Commits](https://github.com/chronotope/chrono/compare/v0.4.22...v0.4.23)

---
updated-dependencies:
- dependency-name: chrono
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

* refactor: chrono future compat

Integer->timstamp conversions should not silently panic.

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Marco Neumann <marco@crepererum.net>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-11-14 13:34:09 +00:00
kodiakhq[bot] 05d7d1495e
Merge branch 'main' into dependabot/cargo/hashbrown-0.13.1 2022-11-11 21:26:40 +00:00
Carol (Nichols || Goulding) d965004e52
fix: Rename DmlError::DatabaseNotFound to NamespaceNotFound 2022-11-11 15:46:05 -05:00
Carol (Nichols || Goulding) bdff4e8848
fix: Consistently use 'namespace' instead of 'database' in comments and other internal text 2022-11-11 15:46:04 -05:00