influxdb

Commit Graph

Author	SHA1	Message	Date
Dom	442e8a8b79	Merge branch 'main' into dom/ingester-rediscovery	2023-01-24 19:13:02 +00:00
Dom Dwyer	411f4bd08b	fix(router): force rediscovery of nodes Similar to https://github.com/influxdata/influxdb_iox/pull/6509, this forces a constant re-querying of the DNS address of an ingester to drive rediscovery. Unlike the above PR, this only reconnects when there are errors observed. This still isn't ideal - something is wrong with the discovery itself - this just papers over it.	2023-01-24 20:11:53 +01:00
kodiakhq[bot]	20ac3608ab	Merge pull request #6687 from influxdata/dom/timeouts refactor(router): set sensible RPC timeouts	2023-01-24 18:32:09 +00:00
Dom Dwyer	9132343dac	feat(metrics): export RPC upstream health state Adds a metric with a per-ingester label recording the current health state of the upstream ingester from the perspective of the router instance. Also logs periodically when one or more ingesters are offline.	2023-01-24 19:27:15 +01:00
kodiakhq[bot]	77b6b234d5	Merge branch 'main' into dom/timeouts	2023-01-24 18:24:42 +00:00
Andrew Lamb	c3bc61f10e	refactor: Move `flightsql` code into its own module, add docs and tests (#6640 ) * refactor: Move `flightsql` code into its own module * fix: get schema from LogicalPlan * refactor: use arrow_flight::sql::Any instead of prost_types::any * fix: cleanup docs and avoid as_ref * fix: Use Bytes * fix: use Any::pack * fix: doclink	2023-01-24 18:24:32 +00:00
Dom Dwyer	f26b54beec	refactor(router): set sensible RPC timeouts Copies these over from the client_util package.	2023-01-24 19:22:27 +01:00
Dom Dwyer	87b553fe9d	feat: WARN logs w/ endpoint for unhealthy upstream Changes the DEBUG log event to a WARN now that it includes the endpoint to which the event applies.	2023-01-24 19:19:31 +01:00
Marco Neumann	4521516147	feat: add per-partition timeout (#6686 ) It seems that prod was hanging last night. This is pretty hard to debug and in general we should protect the compactor against hanging / malformed partitions that take forever. This is similar to the fact that the querier also has a timeout for every query. Let's see if this shows anything in prod (and if not it's still a desired safety net). Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2023-01-24 16:53:47 +00:00
kodiakhq[bot]	6a2e0ae5cc	Merge pull request #6685 from influxdata/dom/rpc-balancer-2 perf(router): rpc balancer & circuit breaking take 2	2023-01-24 16:13:07 +00:00
Dom	b0e5e860cb	Merge branch 'main' into dom/rpc-balancer-2	2023-01-24 16:04:32 +00:00
Dom Dwyer	085de40127	feat: lazy-connect to ingester gRPC endpoints Lazily establish connections in the background, instead of using tonic's connect_lazy(). connect_lazy() causes error handling to take a different path in tonic compared to "normal" connections, and this stops reconnections from being performed when the endpoint goes away (likely a bug). It also means the first few write requests won't have to wait while the connection is dialed, which brings down the P99 as a nice side-effect.	2023-01-24 16:44:55 +01:00
Marco Neumann	1c87d9667f	refactor: record partition completion (both Ok and Err) (#6680 ) With the upcoming divide-and-conquer approach, we have have multiple commits per partition since we can divide it into multiple compaction jobs. For metrics (and logs) however it is important to track the overall process, so we shall also monitor the number of completed partitions.	2023-01-24 15:06:15 +00:00
kodiakhq[bot]	dcc1eb9a21	Merge pull request #6679 from influxdata/dom/ingester-metrics feat(ingester2): metrics	2023-01-24 14:57:12 +00:00
Dom Dwyer	b775288c92	refactor: fix duration metric units in description It's seconds, not nanoseconds.	2023-01-24 15:49:16 +01:00
Dom Dwyer	8215f4126e	test: router balancer recovery Ensure a recovering node is yielded from the balancer.	2023-01-24 15:30:01 +01:00
Dom Dwyer	c6d6c50fbf	perf(router): circuit break ingester connections Adds on-path health checking / recording using the CircuitBreaker construct, stopping requests to unhealthy upstreams (minus the probe requests) until they recover. This removes the horrible gRPC balancer hack I added to get us deployed ASAP, and should eliminate latency spikes and elevated error responses observed during deployments as a result.	2023-01-24 15:30:01 +01:00
Dom Dwyer	d198756a29	feat(metrics): instrument DmlSink::apply() Record latency histograms for DmlSink::apply() calls, configuring ingester2 to report the overall write path latency, and separately the buffer apply latency.	2023-01-24 15:07:17 +01:00
Dom Dwyer	28d575d90f	feat(tracing): emit spans for write path Emit tracing spans for each component of the write path in ingester2.	2023-01-24 15:07:16 +01:00
Dom Dwyer	c9a1c7435b	feat(metrics): instrumented query execution Instrument the query path in ingester2, capturing the query latency + counts, broken down by success/error.	2023-01-24 15:07:16 +01:00
Dom Dwyer	3541243fcb	feat(metrics): persist duration histograms Adds metrics to track the distribution duration spent actively persisting a batch of partition data (compacting, generating parquet, uploading, DB entries, etc) and another tracking the duration of time an entry spent in the persist queue. Together these provide a measurement of the latency of persist requests, and as they contain event counters, they also provide the throughput and number of outstanding jobs.	2023-01-24 15:05:56 +01:00
Dom Dwyer	0637540aad	feat(metrics): cumulative persist job count Tracks the cumulative number of persist jobs enqueued on a single ingester (the total amount, so including now-completed jobs).	2023-01-24 15:05:56 +01:00
kodiakhq[bot]	c63790740b	Merge pull request #6677 from influxdata/dom/revert-rpc-balancer revert: influxdata/dom/rpc-balancer	2023-01-24 14:03:48 +00:00
Dom	71630e2efd	Merge branch 'main' into dom/revert-rpc-balancer	2023-01-24 13:56:21 +00:00
Marco Neumann	32df24e057	feat: compactor2 error classification (#6676 ) * feat: add error kinds * refactor: sink proper error type * fix: ignore object store errors See <https://github.com/influxdata/idpe/issues/16984>. * feat: log error kind * feat: per-kind error metric Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2023-01-24 13:50:19 +00:00
Dom Dwyer	107006c801	revert: influxdata/dom/rpc-balancer This reverts commit `a3805dbccf`, reversing changes made to `bcb1232c5d`.	2023-01-24 14:47:05 +01:00
Dom	a3805dbccf	Merge pull request #6675 from influxdata/dom/rpc-balancer perf(router): circuit break ingester connections	2023-01-24 12:48:11 +00:00
Dom Dwyer	b32662ebf2	test: router balancer recovery Ensure a recovering node is yielded from the balancer.	2023-01-24 13:38:36 +01:00
Dom Dwyer	7596dc0826	perf(router): circuit break ingester connections Adds on-path health checking / recording using the CircuitBreaker construct, stopping requests to unhealthy upstreams (minus the probe requests) until they recover. This removes the horrible gRPC balancer hack I added to get us deployed ASAP, and should eliminate latency spikes and elevated error responses observed during deployments as a result.	2023-01-24 12:38:27 +01:00
Marco Neumann	bcb1232c5d	refactor: integrate "skipped" handling into the partition filter framework (#6673 ) * refactor: pass partition ID to partition filter * feat: add logging partition filter wrapper * refactor: make partition filter async * refactor: integrate "skipped" handling into the partition filter framework Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2023-01-24 11:34:06 +00:00
kodiakhq[bot]	affbcc10e1	Merge pull request #6662 from influxdata/dom/circuit-breaker feat: low-overhead circuit breaker	2023-01-24 11:19:52 +00:00
kodiakhq[bot]	1e0a52eeb6	Merge branch 'main' into dom/circuit-breaker	2023-01-24 11:12:58 +00:00
Dom Dwyer	c3a2ac3a0d	refactor: prevent div by 0 Preserve the error ratio calculation but prevent a div by 0 by ensuring the divisor is always at least 1.	2023-01-24 12:09:00 +01:00
Dom Dwyer	c4b04a16c5	refactor: rename last_probe instant last_probe was "the instant at which the last set of probes started being sent" in my head, but Carol saw it as "first_probe - the time at which probes started being sent". Hopefully probe_window_started_at is less ambiguous.	2023-01-24 12:08:10 +01:00
Dom Dwyer	2f3fb48091	docs: document error count floor Describe the floor on the number of errors that must be observed before the circuit breaker will consider switching to the unhealthy state.	2023-01-24 12:08:09 +01:00
dependabot[bot]	0e304efc28	chore(deps): Bump toml from 0.5.11 to 0.6.0 (#6670 ) Bumps [toml](https://github.com/toml-rs/toml) from 0.5.11 to 0.6.0. - [Release notes](https://github.com/toml-rs/toml/releases) - [Commits](https://github.com/toml-rs/toml/compare/toml-v0.5.11...toml-v0.6.0) --- updated-dependencies: - dependency-name: toml dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2023-01-24 10:08:20 +00:00
dependabot[bot]	681d4d940f	chore(deps): Bump clap from 4.1.1 to 4.1.3 (#6669 ) Bumps [clap](https://github.com/clap-rs/clap) from 4.1.1 to 4.1.3. - [Release notes](https://github.com/clap-rs/clap/releases) - [Changelog](https://github.com/clap-rs/clap/blob/master/CHANGELOG.md) - [Commits](https://github.com/clap-rs/clap/compare/clap_complete-v4.1.1...v4.1.3) --- updated-dependencies: - dependency-name: clap dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2023-01-24 09:53:06 +00:00
Luke Bond	e3fc873b2e	feat: enable object store metrics on ingester2 (#6672 ) Signed-off-by: Luke Bond <luke.n.bond@gmail.com> Signed-off-by: Luke Bond <luke.n.bond@gmail.com>	2023-01-24 01:59:58 +00:00
Andrew Lamb	1b882e0062	fix: `error arrow/ipc: could not read message schema: EOF` (#6668 ) * chore: Test for schema from query * fix: Send schema even for no RecordBatches * fix: docs	2023-01-23 22:23:34 +00:00
Carol (Nichols \|\| Goulding)	caf8dc9032	fix: Rename incorrect usage of 'close' to 'unhealthy' in test helper	2023-01-23 16:08:00 -05:00
Carol (Nichols \|\| Goulding)	081b4f15da	docs: Clarify my understanding of the circuit breaker based on chat with Dom	2023-01-23 16:07:02 -05:00
Nga Tran	06d4a5fe4e	refactor: ignore partitions in table skipped compactions (#6666 ) * refactor: ignore partitions in table skipped compactions * refactor: continue ignoring partitions in skipped compaction * test: skip partition	2023-01-23 19:53:05 +00:00
Marco Neumann	e2cfe809d2	refactor: planner as a component (#6665 ) * refactor: planner as a component Now everything except for the core algorithm structure is a component. This also means that the driver no longer needs the whole config structure. * docs: explain V1	2023-01-23 16:02:01 +00:00
Marco Neumann	c9821720ab	test: ensure Arrow/DataFusion panics don't crash compactor (#6664 ) Closes #6644.	2023-01-23 15:30:16 +00:00
Marco Neumann	cb02262b9d	refactor: extract "exec DF plan" and "store stream to file" components (#6663 ) * refactor: extract `PartitionInfo` * refactor: extract DF exec component * feat: add some error conversions * refactor: make fn public * refactor: extract file sink component * fix: clippy	2023-01-23 14:40:35 +00:00
Dom Dwyer	67b73d90dd	feat: low-overhead circuit breaker Implements a "circuit breaker", a construct that tracks the error & success of requests to a remote node, and uses this information to allow or deny further requests. This circuit breaker stops sending requests to the remote when the error count exceeds 80% of requests in a 5 second window. Once this happens, up to 10 "probe" requests per second are allowed, and when they succeed, normal operation resumes (though concurrent requests may still be completing during the probe regime and are counted towards the probe results). In the happy path, this circuit breaker is very cheap (lock free; WFPO) to evaluate and record request results in, minimising the throughput penalty. Once the breaker enters an unhealthy state (hopefully a rare occurrence) it uses a mutex to manage the probe state (with a higher overhead) for simplicity; it's definitely possible to optimise this away if high latencies are observed during upstream outages when the circuit breaker is open/unhealthy.	2023-01-23 13:55:12 +01:00
Andrew Lamb	9a61f36a53	chore: Update datafusion again (#6656 ) * chore: Update datafusion pin * chore: Run cargo hakari tasks Co-authored-by: CircleCI[bot] <circleci@influxdata.com>	2023-01-23 12:13:07 +00:00
Andrew Lamb	b09691dc6b	chore: Upgrade datafusion (again, I know) (#6639 ) * chore: Update datafusion * chore: Run cargo hakari tasks Co-authored-by: CircleCI[bot] <circleci@influxdata.com> Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2023-01-23 11:24:22 +00:00
Dom	acab3d95f3	Merge pull request #6661 from influxdata/dependabot/cargo/toml-0.5.11 chore(deps): Bump toml from 0.5.10 to 0.5.11	2023-01-23 11:09:56 +00:00
dependabot[bot]	d1379e9747	chore(deps): Bump toml from 0.5.10 to 0.5.11 Bumps [toml](https://github.com/toml-rs/toml) from 0.5.10 to 0.5.11. - [Release notes](https://github.com/toml-rs/toml/releases) - [Commits](https://github.com/toml-rs/toml/compare/toml-v0.5.10...toml-v0.5.11) --- updated-dependencies: - dependency-name: toml dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com>	2023-01-23 08:43:52 +00:00

1 2 3 4 5 ...

10616 Commits (105e3542991aef5f1654fe260f9fa53a32f622f2) All Branches Search

10616 Commits (105e3542991aef5f1654fe260f9fa53a32f622f2)

All Branches