Similar to https://github.com/influxdata/influxdb_iox/pull/6509, this
forces a constant re-querying of the DNS address of an ingester to drive
rediscovery.
Unlike the above PR, this only reconnects when there are errors
observed. This still isn't ideal - something is wrong with the discovery
itself - this just papers over it.
Adds a metric with a per-ingester label recording the current health
state of the upstream ingester from the perspective of the router
instance.
Also logs periodically when one or more ingesters are offline.
* refactor: Move `flightsql` code into its own module
* fix: get schema from LogicalPlan
* refactor: use arrow_flight::sql::Any instead of prost_types::any
* fix: cleanup docs and avoid as_ref
* fix: Use Bytes
* fix: use Any::pack
* fix: doclink
It seems that prod was hanging last night. This is pretty hard to debug
and in general we should protect the compactor against hanging /
malformed partitions that take forever. This is similar to the fact that
the querier also has a timeout for every query. Let's see if this shows
anything in prod (and if not it's still a desired safety net).
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
Lazily establish connections in the background, instead of using tonic's
connect_lazy().
connect_lazy() causes error handling to take a different path in tonic
compared to "normal" connections, and this stops reconnections from
being performed when the endpoint goes away (likely a bug).
It also means the first few write requests won't have to wait while the
connection is dialed, which brings down the P99 as a nice side-effect.
With the upcoming divide-and-conquer approach, we have have multiple
commits per partition since we can divide it into multiple compaction
jobs. For metrics (and logs) however it is important to track the
overall process, so we shall also monitor the number of completed
partitions.
Adds on-path health checking / recording using the CircuitBreaker
construct, stopping requests to unhealthy upstreams (minus the probe
requests) until they recover.
This removes the horrible gRPC balancer hack I added to get us deployed
ASAP, and should eliminate latency spikes and elevated error responses
observed during deployments as a result.
Record latency histograms for DmlSink::apply() calls, configuring
ingester2 to report the overall write path latency, and separately the
buffer apply latency.
Adds metrics to track the distribution duration spent actively
persisting a batch of partition data (compacting, generating parquet,
uploading, DB entries, etc) and another tracking the duration of time an
entry spent in the persist queue.
Together these provide a measurement of the latency of persist requests,
and as they contain event counters, they also provide the throughput and
number of outstanding jobs.
Adds on-path health checking / recording using the CircuitBreaker
construct, stopping requests to unhealthy upstreams (minus the probe
requests) until they recover.
This removes the horrible gRPC balancer hack I added to get us deployed
ASAP, and should eliminate latency spikes and elevated error responses
observed during deployments as a result.
last_probe was "the instant at which the last set of probes started
being sent" in my head, but Carol saw it as "first_probe - the time at
which probes started being sent".
Hopefully probe_window_started_at is less ambiguous.
* refactor: planner as a component
Now everything except for the core algorithm structure is a component.
This also means that the driver no longer needs the whole config
structure.
* docs: explain V1
Implements a "circuit breaker", a construct that tracks the error &
success of requests to a remote node, and uses this information to allow
or deny further requests.
This circuit breaker stops sending requests to the remote when the error
count exceeds 80% of requests in a 5 second window. Once this happens,
up to 10 "probe" requests per second are allowed, and when they succeed,
normal operation resumes (though concurrent requests may still be
completing during the probe regime and are counted towards the probe
results).
In the happy path, this circuit breaker is very cheap (lock free; WFPO)
to evaluate and record request results in, minimising the throughput
penalty. Once the breaker enters an unhealthy state (hopefully a rare
occurrence) it uses a mutex to manage the probe state (with a higher
overhead) for simplicity; it's definitely possible to optimise this away
if high latencies are observed during upstream outages when the circuit
breaker is open/unhealthy.