Use the namespace schema cache in the router to enforce the
per-namespace table limit (service protection limit), adding O(1)
overhead to the existing column limit evaluation logic.
Prior to this commit, each request that would breach the table limit
would be (potentially partially) applied to the catalog and return an
error. Every subsequent request creating a new table continued to cause
a catalog query, unnecessarily adding load proportional to request
counts.
After this commit, catalog requests are sent when the router instance
can determine (to the best of it's ability, see below) that the request
will not cause the namespace to exceed the table limit.
Because this uses cached schemas, the actual state set of tables may
have changed - this will cause inconsistent enforcement and spurious
errors in the same way it currently does for the column limit. For more
details (and to track a resolution) see:
https://github.com/influxdata/influxdb_iox/issues/5957
The maximum number of tables is part of the Namespace, which is already
loaded in its entirety. This commit copies the value into the
NamespaceSchema, making it available for the router to utilise.
Envoy will connect to an endpoint on demand, and return an
application-level error if it fails with a gRPC status code of
"Unavailable".
It also embeds a metadata entry of {"server": "envoy"} - this commit
uses the two signals (error status code + metadata entry) to drive an
immediate reconnection when observed, assuming the connection is bad.
Prior to this commit, namespaces that had been created on one router
could not be used on another router until the latter was restarted.
Effectively, newly created namespaces couldn't be used.
After this commit, the catalog is also checked when a cache miss occurs,
ensuring the router discovers new, not-yet-cached namespaces.
Service limits are enforced on two values:
* Number of tables in a namespace
* Number of columns in a table
This commit labels the existing service limit hit metric with the type
of limit reached, and adds this information to the log lines emitted.
Ensure a "probe" node is always returned as the first candidate, driving
it to recovery faster.
This also includes a fix for the balancer metrics that would report
probe candidate nodes as healthy nodes.
Similar to https://github.com/influxdata/influxdb_iox/pull/6509, this
forces a constant re-querying of the DNS address of an ingester to drive
rediscovery.
Unlike the above PR, this only reconnects when there are errors
observed. This still isn't ideal - something is wrong with the discovery
itself - this just papers over it.
Adds a metric with a per-ingester label recording the current health
state of the upstream ingester from the perspective of the router
instance.
Also logs periodically when one or more ingesters are offline.
Lazily establish connections in the background, instead of using tonic's
connect_lazy().
connect_lazy() causes error handling to take a different path in tonic
compared to "normal" connections, and this stops reconnections from
being performed when the endpoint goes away (likely a bug).
It also means the first few write requests won't have to wait while the
connection is dialed, which brings down the P99 as a nice side-effect.
Adds on-path health checking / recording using the CircuitBreaker
construct, stopping requests to unhealthy upstreams (minus the probe
requests) until they recover.
This removes the horrible gRPC balancer hack I added to get us deployed
ASAP, and should eliminate latency spikes and elevated error responses
observed during deployments as a result.
Adds on-path health checking / recording using the CircuitBreaker
construct, stopping requests to unhealthy upstreams (minus the probe
requests) until they recover.
This removes the horrible gRPC balancer hack I added to get us deployed
ASAP, and should eliminate latency spikes and elevated error responses
observed during deployments as a result.
last_probe was "the instant at which the last set of probes started
being sent" in my head, but Carol saw it as "first_probe - the time at
which probes started being sent".
Hopefully probe_window_started_at is less ambiguous.
Implements a "circuit breaker", a construct that tracks the error &
success of requests to a remote node, and uses this information to allow
or deny further requests.
This circuit breaker stops sending requests to the remote when the error
count exceeds 80% of requests in a 5 second window. Once this happens,
up to 10 "probe" requests per second are allowed, and when they succeed,
normal operation resumes (though concurrent requests may still be
completing during the probe regime and are counted towards the probe
results).
In the happy path, this circuit breaker is very cheap (lock free; WFPO)
to evaluate and record request results in, minimising the throughput
penalty. Once the breaker enters an unhealthy state (hopefully a rare
occurrence) it uses a mutex to manage the probe state (with a higher
overhead) for simplicity; it's definitely possible to optimise this away
if high latencies are observed during upstream outages when the circuit
breaker is open/unhealthy.
The gRPC node discovery hack spawns a task that outlives the gRPC
balancer - once the balancer stops, the task should stop too (and not
panic sending on the closed channel).
The tonic / tower load-balance implementation discards failed nodes,
even when using a static list - this causes nodes that fail once to
never be retried.
This doesn't happen for the last node for some reason, and leads to all
the load from one router hitting a single ingester instead of load
balancing across all ingesters.
This commit adds a hack to constantly tell the load balancer to probe
all nodes, hopefully causing them to re-discover previously failed
nodes. I don't have the time to do this properly :(