Commit Graph

10616 Commits (105e3542991aef5f1654fe260f9fa53a32f622f2)

Author SHA1 Message Date
Dom 442e8a8b79
Merge branch 'main' into dom/ingester-rediscovery 2023-01-24 19:13:02 +00:00
Dom Dwyer 411f4bd08b
fix(router): force rediscovery of nodes
Similar to https://github.com/influxdata/influxdb_iox/pull/6509, this
forces a constant re-querying of the DNS address of an ingester to drive
rediscovery.

Unlike the above PR, this only reconnects when there are errors
observed. This still isn't ideal - something is wrong with the discovery
itself - this just papers over it.
2023-01-24 20:11:53 +01:00
kodiakhq[bot] 20ac3608ab Merge pull request #6687 from influxdata/dom/timeouts
refactor(router): set sensible RPC timeouts
2023-01-24 18:32:09 +00:00
Dom Dwyer 9132343dac
feat(metrics): export RPC upstream health state
Adds a metric with a per-ingester label recording the current health
state of the upstream ingester from the perspective of the router
instance.

Also logs periodically when one or more ingesters are offline.
2023-01-24 19:27:15 +01:00
kodiakhq[bot] 77b6b234d5
Merge branch 'main' into dom/timeouts 2023-01-24 18:24:42 +00:00
Andrew Lamb c3bc61f10e
refactor: Move `flightsql` code into its own module, add docs and tests (#6640)
* refactor: Move `flightsql`  code into its own module

* fix: get schema from LogicalPlan

* refactor: use arrow_flight::sql::Any instead of prost_types::any

* fix: cleanup docs and avoid as_ref

* fix: Use Bytes

* fix: use Any::pack

* fix: doclink
2023-01-24 18:24:32 +00:00
Dom Dwyer f26b54beec
refactor(router): set sensible RPC timeouts
Copies these over from the client_util package.
2023-01-24 19:22:27 +01:00
Dom Dwyer 87b553fe9d
feat: WARN logs w/ endpoint for unhealthy upstream
Changes the DEBUG log event to a WARN now that it includes the endpoint
to which the event applies.
2023-01-24 19:19:31 +01:00
Marco Neumann 4521516147
feat: add per-partition timeout (#6686)
It seems that prod was hanging last night. This is pretty hard to debug
and in general we should protect the compactor against hanging /
malformed partitions that take forever. This is similar to the fact that
the querier also has a timeout for every query. Let's see if this shows
anything in prod (and if not it's still a desired safety net).

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2023-01-24 16:53:47 +00:00
kodiakhq[bot] 6a2e0ae5cc
Merge pull request #6685 from influxdata/dom/rpc-balancer-2
perf(router): rpc balancer & circuit breaking take 2
2023-01-24 16:13:07 +00:00
Dom b0e5e860cb
Merge branch 'main' into dom/rpc-balancer-2 2023-01-24 16:04:32 +00:00
Dom Dwyer 085de40127
feat: lazy-connect to ingester gRPC endpoints
Lazily establish connections in the background, instead of using tonic's
connect_lazy().

connect_lazy() causes error handling to take a different path in tonic
compared to "normal" connections, and this stops reconnections from
being performed when the endpoint goes away (likely a bug).

It also means the first few write requests won't have to wait while the
connection is dialed, which brings down the P99 as a nice side-effect.
2023-01-24 16:44:55 +01:00
Marco Neumann 1c87d9667f
refactor: record partition completion (both Ok and Err) (#6680)
With the upcoming divide-and-conquer approach, we have have multiple
commits per partition since we can divide it into multiple compaction
jobs. For metrics (and logs) however it is important to track the
overall process, so we shall also monitor the number of completed
partitions.
2023-01-24 15:06:15 +00:00
kodiakhq[bot] dcc1eb9a21
Merge pull request #6679 from influxdata/dom/ingester-metrics
feat(ingester2): metrics
2023-01-24 14:57:12 +00:00
Dom Dwyer b775288c92
refactor: fix duration metric units in description
It's seconds, not nanoseconds.
2023-01-24 15:49:16 +01:00
Dom Dwyer 8215f4126e
test: router balancer recovery
Ensure a recovering node is yielded from the balancer.
2023-01-24 15:30:01 +01:00
Dom Dwyer c6d6c50fbf
perf(router): circuit break ingester connections
Adds on-path health checking / recording using the CircuitBreaker
construct, stopping requests to unhealthy upstreams (minus the probe
requests) until they recover.

This removes the horrible gRPC balancer hack I added to get us deployed
ASAP, and should eliminate latency spikes and elevated error responses
observed during deployments as a result.
2023-01-24 15:30:01 +01:00
Dom Dwyer d198756a29
feat(metrics): instrument DmlSink::apply()
Record latency histograms for DmlSink::apply() calls, configuring
ingester2 to report the overall write path latency, and separately the
buffer apply latency.
2023-01-24 15:07:17 +01:00
Dom Dwyer 28d575d90f
feat(tracing): emit spans for write path
Emit tracing spans for each component of the write path in ingester2.
2023-01-24 15:07:16 +01:00
Dom Dwyer c9a1c7435b
feat(metrics): instrumented query execution
Instrument the query path in ingester2, capturing the query latency +
counts, broken down by success/error.
2023-01-24 15:07:16 +01:00
Dom Dwyer 3541243fcb
feat(metrics): persist duration histograms
Adds metrics to track the distribution duration spent actively
persisting a batch of partition data (compacting, generating parquet,
uploading, DB entries, etc) and another tracking the duration of time an
entry spent in the persist queue.

Together these provide a measurement of the latency of persist requests,
and as they contain event counters, they also provide the throughput and
number of outstanding jobs.
2023-01-24 15:05:56 +01:00
Dom Dwyer 0637540aad
feat(metrics): cumulative persist job count
Tracks the cumulative number of persist jobs enqueued on a single
ingester (the total amount, so including now-completed jobs).
2023-01-24 15:05:56 +01:00
kodiakhq[bot] c63790740b
Merge pull request #6677 from influxdata/dom/revert-rpc-balancer
revert: influxdata/dom/rpc-balancer
2023-01-24 14:03:48 +00:00
Dom 71630e2efd
Merge branch 'main' into dom/revert-rpc-balancer 2023-01-24 13:56:21 +00:00
Marco Neumann 32df24e057
feat: compactor2 error classification (#6676)
* feat: add error kinds

* refactor: sink proper error type

* fix: ignore object store errors

See <https://github.com/influxdata/idpe/issues/16984>.

* feat: log error kind

* feat: per-kind error metric

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2023-01-24 13:50:19 +00:00
Dom Dwyer 107006c801
revert: influxdata/dom/rpc-balancer
This reverts commit a3805dbccf, reversing
changes made to bcb1232c5d.
2023-01-24 14:47:05 +01:00
Dom a3805dbccf
Merge pull request #6675 from influxdata/dom/rpc-balancer
perf(router): circuit break ingester connections
2023-01-24 12:48:11 +00:00
Dom Dwyer b32662ebf2
test: router balancer recovery
Ensure a recovering node is yielded from the balancer.
2023-01-24 13:38:36 +01:00
Dom Dwyer 7596dc0826
perf(router): circuit break ingester connections
Adds on-path health checking / recording using the CircuitBreaker
construct, stopping requests to unhealthy upstreams (minus the probe
requests) until they recover.

This removes the horrible gRPC balancer hack I added to get us deployed
ASAP, and should eliminate latency spikes and elevated error responses
observed during deployments as a result.
2023-01-24 12:38:27 +01:00
Marco Neumann bcb1232c5d
refactor: integrate "skipped" handling into the partition filter framework (#6673)
* refactor: pass partition ID to partition filter

* feat: add logging partition filter wrapper

* refactor: make partition filter async

* refactor: integrate "skipped" handling into the partition filter framework

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2023-01-24 11:34:06 +00:00
kodiakhq[bot] affbcc10e1
Merge pull request #6662 from influxdata/dom/circuit-breaker
feat: low-overhead circuit breaker
2023-01-24 11:19:52 +00:00
kodiakhq[bot] 1e0a52eeb6
Merge branch 'main' into dom/circuit-breaker 2023-01-24 11:12:58 +00:00
Dom Dwyer c3a2ac3a0d
refactor: prevent div by 0
Preserve the error ratio calculation but prevent a div by 0 by ensuring
the divisor is always at least 1.
2023-01-24 12:09:00 +01:00
Dom Dwyer c4b04a16c5
refactor: rename last_probe instant
last_probe was "the instant at which the last set of probes started
being sent" in my head, but Carol saw it as "first_probe - the time at
which probes started being sent".

Hopefully probe_window_started_at is less ambiguous.
2023-01-24 12:08:10 +01:00
Dom Dwyer 2f3fb48091
docs: document error count floor
Describe the floor on the number of errors that must be observed before
the circuit breaker will consider switching to the unhealthy state.
2023-01-24 12:08:09 +01:00
dependabot[bot] 0e304efc28
chore(deps): Bump toml from 0.5.11 to 0.6.0 (#6670)
Bumps [toml](https://github.com/toml-rs/toml) from 0.5.11 to 0.6.0.
- [Release notes](https://github.com/toml-rs/toml/releases)
- [Commits](https://github.com/toml-rs/toml/compare/toml-v0.5.11...toml-v0.6.0)

---
updated-dependencies:
- dependency-name: toml
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-01-24 10:08:20 +00:00
dependabot[bot] 681d4d940f
chore(deps): Bump clap from 4.1.1 to 4.1.3 (#6669)
Bumps [clap](https://github.com/clap-rs/clap) from 4.1.1 to 4.1.3.
- [Release notes](https://github.com/clap-rs/clap/releases)
- [Changelog](https://github.com/clap-rs/clap/blob/master/CHANGELOG.md)
- [Commits](https://github.com/clap-rs/clap/compare/clap_complete-v4.1.1...v4.1.3)

---
updated-dependencies:
- dependency-name: clap
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2023-01-24 09:53:06 +00:00
Luke Bond e3fc873b2e
feat: enable object store metrics on ingester2 (#6672)
Signed-off-by: Luke Bond <luke.n.bond@gmail.com>

Signed-off-by: Luke Bond <luke.n.bond@gmail.com>
2023-01-24 01:59:58 +00:00
Andrew Lamb 1b882e0062
fix: `error arrow/ipc: could not read message schema: EOF` (#6668)
* chore: Test for schema from query

* fix: Send schema even for no RecordBatches

* fix: docs
2023-01-23 22:23:34 +00:00
Carol (Nichols || Goulding) caf8dc9032
fix: Rename incorrect usage of 'close' to 'unhealthy' in test helper 2023-01-23 16:08:00 -05:00
Carol (Nichols || Goulding) 081b4f15da
docs: Clarify my understanding of the circuit breaker based on chat with Dom 2023-01-23 16:07:02 -05:00
Nga Tran 06d4a5fe4e
refactor: ignore partitions in table skipped compactions (#6666)
* refactor: ignore partitions in table skipped compactions

* refactor: continue ignoring partitions in skipped compaction

* test: skip partition
2023-01-23 19:53:05 +00:00
Marco Neumann e2cfe809d2
refactor: planner as a component (#6665)
* refactor: planner as a component

Now everything except for the core algorithm structure is a component.
This also means that the driver no longer needs the whole config
structure.

* docs: explain V1
2023-01-23 16:02:01 +00:00
Marco Neumann c9821720ab
test: ensure Arrow/DataFusion panics don't crash compactor (#6664)
Closes #6644.
2023-01-23 15:30:16 +00:00
Marco Neumann cb02262b9d
refactor: extract "exec DF plan" and "store stream to file" components (#6663)
* refactor: extract `PartitionInfo`

* refactor: extract DF exec component

* feat: add some error conversions

* refactor: make fn public

* refactor: extract file sink component

* fix: clippy
2023-01-23 14:40:35 +00:00
Dom Dwyer 67b73d90dd
feat: low-overhead circuit breaker
Implements a "circuit breaker", a construct that tracks the error &
success of requests to a remote node, and uses this information to allow
or deny further requests.

This circuit breaker stops sending requests to the remote when the error
count exceeds 80% of requests in a 5 second window. Once this happens,
up to 10 "probe" requests per second are allowed, and when they succeed,
normal operation resumes (though concurrent requests may still be
completing during the probe regime and are counted towards the probe
results).

In the happy path, this circuit breaker is very cheap (lock free; WFPO)
to evaluate and record request results in, minimising the throughput
penalty. Once the breaker enters an unhealthy state (hopefully a rare
occurrence) it uses a mutex to manage the probe state (with a higher
overhead) for simplicity; it's definitely possible to optimise this away
if high latencies are observed during upstream outages when the circuit
breaker is open/unhealthy.
2023-01-23 13:55:12 +01:00
Andrew Lamb 9a61f36a53
chore: Update datafusion again (#6656)
* chore: Update datafusion pin

* chore: Run cargo hakari tasks

Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
2023-01-23 12:13:07 +00:00
Andrew Lamb b09691dc6b
chore: Upgrade datafusion (again, I know) (#6639)
* chore: Update datafusion

* chore: Run cargo hakari tasks

Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2023-01-23 11:24:22 +00:00
Dom acab3d95f3
Merge pull request #6661 from influxdata/dependabot/cargo/toml-0.5.11
chore(deps): Bump toml from 0.5.10 to 0.5.11
2023-01-23 11:09:56 +00:00
dependabot[bot] d1379e9747
chore(deps): Bump toml from 0.5.10 to 0.5.11
Bumps [toml](https://github.com/toml-rs/toml) from 0.5.10 to 0.5.11.
- [Release notes](https://github.com/toml-rs/toml/releases)
- [Commits](https://github.com/toml-rs/toml/compare/toml-v0.5.10...toml-v0.5.11)

---
updated-dependencies:
- dependency-name: toml
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
2023-01-23 08:43:52 +00:00