Ensure a "probe" node is always returned as the first candidate, driving
it to recovery faster.
This also includes a fix for the balancer metrics that would report
probe candidate nodes as healthy nodes.
Similar to https://github.com/influxdata/influxdb_iox/pull/6509, this
forces a constant re-querying of the DNS address of an ingester to drive
rediscovery.
Unlike the above PR, this only reconnects when there are errors
observed. This still isn't ideal - something is wrong with the discovery
itself - this just papers over it.
Adds a metric with a per-ingester label recording the current health
state of the upstream ingester from the perspective of the router
instance.
Also logs periodically when one or more ingesters are offline.
Lazily establish connections in the background, instead of using tonic's
connect_lazy().
connect_lazy() causes error handling to take a different path in tonic
compared to "normal" connections, and this stops reconnections from
being performed when the endpoint goes away (likely a bug).
It also means the first few write requests won't have to wait while the
connection is dialed, which brings down the P99 as a nice side-effect.
Adds on-path health checking / recording using the CircuitBreaker
construct, stopping requests to unhealthy upstreams (minus the probe
requests) until they recover.
This removes the horrible gRPC balancer hack I added to get us deployed
ASAP, and should eliminate latency spikes and elevated error responses
observed during deployments as a result.
Adds on-path health checking / recording using the CircuitBreaker
construct, stopping requests to unhealthy upstreams (minus the probe
requests) until they recover.
This removes the horrible gRPC balancer hack I added to get us deployed
ASAP, and should eliminate latency spikes and elevated error responses
observed during deployments as a result.
last_probe was "the instant at which the last set of probes started
being sent" in my head, but Carol saw it as "first_probe - the time at
which probes started being sent".
Hopefully probe_window_started_at is less ambiguous.
Implements a "circuit breaker", a construct that tracks the error &
success of requests to a remote node, and uses this information to allow
or deny further requests.
This circuit breaker stops sending requests to the remote when the error
count exceeds 80% of requests in a 5 second window. Once this happens,
up to 10 "probe" requests per second are allowed, and when they succeed,
normal operation resumes (though concurrent requests may still be
completing during the probe regime and are counted towards the probe
results).
In the happy path, this circuit breaker is very cheap (lock free; WFPO)
to evaluate and record request results in, minimising the throughput
penalty. Once the breaker enters an unhealthy state (hopefully a rare
occurrence) it uses a mutex to manage the probe state (with a higher
overhead) for simplicity; it's definitely possible to optimise this away
if high latencies are observed during upstream outages when the circuit
breaker is open/unhealthy.
The gRPC node discovery hack spawns a task that outlives the gRPC
balancer - once the balancer stops, the task should stop too (and not
panic sending on the closed channel).
The tonic / tower load-balance implementation discards failed nodes,
even when using a static list - this causes nodes that fail once to
never be retried.
This doesn't happen for the last node for some reason, and leads to all
the load from one router hitting a single ingester instead of load
balancing across all ingesters.
This commit adds a hack to constantly tell the load balancer to probe
all nodes, hopefully causing them to re-discover previously failed
nodes. I don't have the time to do this properly :(
Allow the routers to start up without requiring full availability of all
downstream ingesters. Previously a single unavailable ingester prevented
the routers from starting up.
This has downsides:
* Lazily initialising a connection will cause the first writes to have
higher latency as the connection is established.
* The routers MAY come up in a state that will never work (i.e. bad
ingester addresses)
* Using the opaque gRPC load balancing mechanism restricts the
visibility into which nodes are up/down (hindering useful log
messages) and prevents us from implementing more advanced circuit
breaking / probing logic / load-balancing strategies.
This change is a quick fix - it leaves the round-robin handler in place,
load-balancing over a single tonic Channel, which internally
load-balances. This will need cleaning up.
* feat: Add a feature flag to switch to the router RPC write path
Fixes#6242.
* refactor: Remove a weird arc clone/rename that's not needed
I'm sure this was needed at some point, but it doesn't make much sense.
I wasn't going to change this, but I'm now trying to minimize the
differences between this function and the write path init function, so
make this one better too.
* fix: Add the namespace autocreation to the RPC write path too
The topic/query pool don't really apply to this case, but use them
anyway to be able to use the existing catalog methods.
Also add a bunch of comments pointing out where the RPC write path
initializer and the old router's initializer are the same and where
they're different, so that perhaps it'll be easier to keep them in sync
while they both exist.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* chore: remove unused/moved ns_autocreation dml handler
* feat(router): expose new ns retention as config
* fix: forgot to set default value for router retention arg
* chore: make new namespace retention param an option
This commit adds the (unused) RpcWrite implementation of the DmlHandler
trait that implements pushing a write over gRPC to a single, arbitrary
ingester. Requests are round-robin'ed across all available ingesters.
This DmlHandler implementation can be swapped out with the
ShardedWriteBuffer to change how writes are propagated to the ingester.
* feat: create namespace API call in router
Co-authored-by: Nga Tran <nga-tran@live.com>
* chore: treat retention as ns except in CLI
* fix: overflow in nanosecond calc
* fix: retention test after changing it from hours to ns
* chore: comment clarification in cli; better response type for error in ns API
* fix: correct some rebase mistakes
* chore: merge namespace create & create_with_retention; renamed ns create test helper fn & const
* fix: ns autocreation test was wrong after rebase
* fix: mem catalog has default 1hr retention, accidently removed in rebase
* chore: remove mem catalogs default 1hr retention; make it settable in sets & router
Co-authored-by: Luke Bond <luke.n.bond@gmail.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* feat: reject writes that are outside the retention period
* feat: add retention validator into handler stack
* chore: Apply suggestions from code review
Co-authored-by: Dom <dom@itsallbroken.com>
* refactor: address review comments
* test: unit tests fot retention validation
* chore: address review comments
* test: more unit tests and integration tests
* refactor: make time inside retention period for emphemeral_mode test
* fix: 2 hours
Co-authored-by: Dom <dom@itsallbroken.com>
* chore: move ns api from querier to router
* chore: add explanatory comment in querier about moved namespace API
* fix: add namespace service to router
* fix: querier returns unimplemented error for ns retention, not panic
* chore: reuse namespace -> proto in router ns api
* chore: grpc namespace - consume ns to avoid clone
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>