Dom Dwyer
c9a1c7435b
feat(metrics): instrumented query execution
...
Instrument the query path in ingester2, capturing the query latency +
counts, broken down by success/error.
2023-01-24 15:07:16 +01:00
Dom Dwyer
3541243fcb
feat(metrics): persist duration histograms
...
Adds metrics to track the distribution duration spent actively
persisting a batch of partition data (compacting, generating parquet,
uploading, DB entries, etc) and another tracking the duration of time an
entry spent in the persist queue.
Together these provide a measurement of the latency of persist requests,
and as they contain event counters, they also provide the throughput and
number of outstanding jobs.
2023-01-24 15:05:56 +01:00
Dom Dwyer
0637540aad
feat(metrics): cumulative persist job count
...
Tracks the cumulative number of persist jobs enqueued on a single
ingester (the total amount, so including now-completed jobs).
2023-01-24 15:05:56 +01:00
kodiakhq[bot]
c63790740b
Merge pull request #6677 from influxdata/dom/revert-rpc-balancer
...
revert: influxdata/dom/rpc-balancer
2023-01-24 14:03:48 +00:00
Dom
71630e2efd
Merge branch 'main' into dom/revert-rpc-balancer
2023-01-24 13:56:21 +00:00
Marco Neumann
32df24e057
feat: compactor2 error classification ( #6676 )
...
* feat: add error kinds
* refactor: sink proper error type
* fix: ignore object store errors
See <https://github.com/influxdata/idpe/issues/16984 >.
* feat: log error kind
* feat: per-kind error metric
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2023-01-24 13:50:19 +00:00
Dom Dwyer
107006c801
revert: influxdata/dom/rpc-balancer
...
This reverts commit a3805dbccf
, reversing
changes made to bcb1232c5d
.
2023-01-24 14:47:05 +01:00
Dom
a3805dbccf
Merge pull request #6675 from influxdata/dom/rpc-balancer
...
perf(router): circuit break ingester connections
2023-01-24 12:48:11 +00:00
Dom Dwyer
b32662ebf2
test: router balancer recovery
...
Ensure a recovering node is yielded from the balancer.
2023-01-24 13:38:36 +01:00
Dom Dwyer
7596dc0826
perf(router): circuit break ingester connections
...
Adds on-path health checking / recording using the CircuitBreaker
construct, stopping requests to unhealthy upstreams (minus the probe
requests) until they recover.
This removes the horrible gRPC balancer hack I added to get us deployed
ASAP, and should eliminate latency spikes and elevated error responses
observed during deployments as a result.
2023-01-24 12:38:27 +01:00
Marco Neumann
bcb1232c5d
refactor: integrate "skipped" handling into the partition filter framework ( #6673 )
...
* refactor: pass partition ID to partition filter
* feat: add logging partition filter wrapper
* refactor: make partition filter async
* refactor: integrate "skipped" handling into the partition filter framework
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2023-01-24 11:34:06 +00:00
kodiakhq[bot]
affbcc10e1
Merge pull request #6662 from influxdata/dom/circuit-breaker
...
feat: low-overhead circuit breaker
2023-01-24 11:19:52 +00:00
kodiakhq[bot]
1e0a52eeb6
Merge branch 'main' into dom/circuit-breaker
2023-01-24 11:12:58 +00:00
Dom Dwyer
c3a2ac3a0d
refactor: prevent div by 0
...
Preserve the error ratio calculation but prevent a div by 0 by ensuring
the divisor is always at least 1.
2023-01-24 12:09:00 +01:00
Dom Dwyer
c4b04a16c5
refactor: rename last_probe instant
...
last_probe was "the instant at which the last set of probes started
being sent" in my head, but Carol saw it as "first_probe - the time at
which probes started being sent".
Hopefully probe_window_started_at is less ambiguous.
2023-01-24 12:08:10 +01:00
Dom Dwyer
2f3fb48091
docs: document error count floor
...
Describe the floor on the number of errors that must be observed before
the circuit breaker will consider switching to the unhealthy state.
2023-01-24 12:08:09 +01:00
dependabot[bot]
0e304efc28
chore(deps): Bump toml from 0.5.11 to 0.6.0 ( #6670 )
...
Bumps [toml](https://github.com/toml-rs/toml ) from 0.5.11 to 0.6.0.
- [Release notes](https://github.com/toml-rs/toml/releases )
- [Commits](https://github.com/toml-rs/toml/compare/toml-v0.5.11...toml-v0.6.0 )
---
updated-dependencies:
- dependency-name: toml
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-01-24 10:08:20 +00:00
dependabot[bot]
681d4d940f
chore(deps): Bump clap from 4.1.1 to 4.1.3 ( #6669 )
...
Bumps [clap](https://github.com/clap-rs/clap ) from 4.1.1 to 4.1.3.
- [Release notes](https://github.com/clap-rs/clap/releases )
- [Changelog](https://github.com/clap-rs/clap/blob/master/CHANGELOG.md )
- [Commits](https://github.com/clap-rs/clap/compare/clap_complete-v4.1.1...v4.1.3 )
---
updated-dependencies:
- dependency-name: clap
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2023-01-24 09:53:06 +00:00
Luke Bond
e3fc873b2e
feat: enable object store metrics on ingester2 ( #6672 )
...
Signed-off-by: Luke Bond <luke.n.bond@gmail.com>
Signed-off-by: Luke Bond <luke.n.bond@gmail.com>
2023-01-24 01:59:58 +00:00
Andrew Lamb
1b882e0062
fix: `error arrow/ipc: could not read message schema: EOF` ( #6668 )
...
* chore: Test for schema from query
* fix: Send schema even for no RecordBatches
* fix: docs
2023-01-23 22:23:34 +00:00
Carol (Nichols || Goulding)
caf8dc9032
fix: Rename incorrect usage of 'close' to 'unhealthy' in test helper
2023-01-23 16:08:00 -05:00
Carol (Nichols || Goulding)
081b4f15da
docs: Clarify my understanding of the circuit breaker based on chat with Dom
2023-01-23 16:07:02 -05:00
Nga Tran
06d4a5fe4e
refactor: ignore partitions in table skipped compactions ( #6666 )
...
* refactor: ignore partitions in table skipped compactions
* refactor: continue ignoring partitions in skipped compaction
* test: skip partition
2023-01-23 19:53:05 +00:00
Marco Neumann
e2cfe809d2
refactor: planner as a component ( #6665 )
...
* refactor: planner as a component
Now everything except for the core algorithm structure is a component.
This also means that the driver no longer needs the whole config
structure.
* docs: explain V1
2023-01-23 16:02:01 +00:00
Marco Neumann
c9821720ab
test: ensure Arrow/DataFusion panics don't crash compactor ( #6664 )
...
Closes #6644 .
2023-01-23 15:30:16 +00:00
Marco Neumann
cb02262b9d
refactor: extract "exec DF plan" and "store stream to file" components ( #6663 )
...
* refactor: extract `PartitionInfo`
* refactor: extract DF exec component
* feat: add some error conversions
* refactor: make fn public
* refactor: extract file sink component
* fix: clippy
2023-01-23 14:40:35 +00:00
Dom Dwyer
67b73d90dd
feat: low-overhead circuit breaker
...
Implements a "circuit breaker", a construct that tracks the error &
success of requests to a remote node, and uses this information to allow
or deny further requests.
This circuit breaker stops sending requests to the remote when the error
count exceeds 80% of requests in a 5 second window. Once this happens,
up to 10 "probe" requests per second are allowed, and when they succeed,
normal operation resumes (though concurrent requests may still be
completing during the probe regime and are counted towards the probe
results).
In the happy path, this circuit breaker is very cheap (lock free; WFPO)
to evaluate and record request results in, minimising the throughput
penalty. Once the breaker enters an unhealthy state (hopefully a rare
occurrence) it uses a mutex to manage the probe state (with a higher
overhead) for simplicity; it's definitely possible to optimise this away
if high latencies are observed during upstream outages when the circuit
breaker is open/unhealthy.
2023-01-23 13:55:12 +01:00
Andrew Lamb
9a61f36a53
chore: Update datafusion again ( #6656 )
...
* chore: Update datafusion pin
* chore: Run cargo hakari tasks
Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
2023-01-23 12:13:07 +00:00
Andrew Lamb
b09691dc6b
chore: Upgrade datafusion (again, I know) ( #6639 )
...
* chore: Update datafusion
* chore: Run cargo hakari tasks
Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2023-01-23 11:24:22 +00:00
Dom
acab3d95f3
Merge pull request #6661 from influxdata/dependabot/cargo/toml-0.5.11
...
chore(deps): Bump toml from 0.5.10 to 0.5.11
2023-01-23 11:09:56 +00:00
dependabot[bot]
d1379e9747
chore(deps): Bump toml from 0.5.10 to 0.5.11
...
Bumps [toml](https://github.com/toml-rs/toml ) from 0.5.10 to 0.5.11.
- [Release notes](https://github.com/toml-rs/toml/releases )
- [Commits](https://github.com/toml-rs/toml/compare/toml-v0.5.10...toml-v0.5.11 )
---
updated-dependencies:
- dependency-name: toml
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com>
2023-01-23 08:43:52 +00:00
dependabot[bot]
0114e7ee50
chore(deps): Bump async-trait from 0.1.61 to 0.1.63 ( #6660 )
...
Bumps [async-trait](https://github.com/dtolnay/async-trait ) from 0.1.61 to 0.1.63.
- [Release notes](https://github.com/dtolnay/async-trait/releases )
- [Commits](https://github.com/dtolnay/async-trait/compare/0.1.61...0.1.63 )
---
updated-dependencies:
- dependency-name: async-trait
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-01-23 08:41:27 +00:00
Nga Tran
411b3db928
fix: Get shard id from a constant (topic, shard_index) to avoid error of shard_id FK violation ( #6658 )
...
* fix: ake shard_id FK always 1
* fix: use const shard_index to read its ID
* refactor: read shard_id during compactor initiation
2023-01-22 16:49:06 +00:00
Nga Tran
840923abab
refactor: execute compaction plan ( #6654 )
...
* chore: address review comment of previous PR
* refactor: execute compact plan
* refactor: we will now compact all L0 and L1 files of a partition and split them as needed
* chore: comnents
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2023-01-20 22:34:50 +00:00
Christopher M. Wolff
6f39ae342e
feat: create a GapFillExec type ( #6641 )
...
* refactor: make gap fill rule avoid aliasing
* feat: create a GapFillExec type
* refactor: remove unneeded sort node from GapFill rule
* chore: code review feedback
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2023-01-20 17:44:00 +00:00
Marco Neumann
4f1beba482
feat: filter out L2 files from compaction ( #6653 )
...
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2023-01-20 15:44:13 +00:00
kodiakhq[bot]
ec7c7634e4
Merge pull request #6626 from influxdata/cn/test-old-with-new
...
test: Add old ingester and old parquet states into the query_tests2 framework
2023-01-20 15:27:43 +00:00
kodiakhq[bot]
24ca1e6f8c
Merge branch 'main' into cn/test-old-with-new
2023-01-20 15:20:40 +00:00
Marco Neumann
111e582d71
feat: improve compactor2 metrics and logging ( #6652 )
...
Closes #6647 .
We can always create tickets for concrete issues/wishes or create
on-demand PRs.
2023-01-20 15:08:00 +00:00
Nga Tran
8aeded32d6
refactor: reorganize compact_files core function ( #6636 )
...
* refactor: reorganize compact_files core function
* chore: smore more in-progress structure
* refactor: further reorganization for compact_files
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2023-01-20 13:41:23 +00:00
Andrew Lamb
d808c57cdc
chore: Remove `iox_arrow_flight` ( #6621 )
...
* chore: Remove iox_arrow_flight
* fix: hack around tonic status errors
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2023-01-20 11:16:19 +00:00
dependabot[bot]
8ac93a58a6
chore(deps): Bump reqwest from 0.11.13 to 0.11.14 ( #6643 )
...
* chore(deps): Bump reqwest from 0.11.13 to 0.11.14
Bumps [reqwest](https://github.com/seanmonstar/reqwest ) from 0.11.13 to 0.11.14.
- [Release notes](https://github.com/seanmonstar/reqwest/releases )
- [Changelog](https://github.com/seanmonstar/reqwest/blob/master/CHANGELOG.md )
- [Commits](https://github.com/seanmonstar/reqwest/compare/v0.11.13...v0.11.14 )
---
updated-dependencies:
- dependency-name: reqwest
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com>
* chore: Run cargo hakari tasks
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2023-01-20 07:21:58 +00:00
dependabot[bot]
33045489f9
chore(deps): Bump rustyline from 10.1.0 to 10.1.1 ( #6642 )
...
* chore(deps): Bump rustyline from 10.1.0 to 10.1.1
Bumps [rustyline](https://github.com/kkawakam/rustyline ) from 10.1.0 to 10.1.1.
- [Release notes](https://github.com/kkawakam/rustyline/releases )
- [Changelog](https://github.com/kkawakam/rustyline/blob/master/History.md )
- [Commits](https://github.com/kkawakam/rustyline/compare/v10.1.0...v10.1.1 )
---
updated-dependencies:
- dependency-name: rustyline
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com>
* chore: Run cargo hakari tasks
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
2023-01-20 07:13:12 +00:00
Carol (Nichols || Goulding)
6afd782b3f
fix: Move query_tests2 into influxdb_iox/tests so that the code rebuilds
2023-01-19 16:44:31 -05:00
Carol (Nichols || Goulding)
fb6774e40e
fix: Rather than overloading ChunkStage, make a separate IoxArchitecture enum
2023-01-19 16:44:30 -05:00
Carol (Nichols || Goulding)
8783623a19
docs: This method doesn't block until the data is persisted
2023-01-19 16:44:30 -05:00
Carol (Nichols || Goulding)
af203f7a6d
docs: Explain why the tests set the number of query threads
2023-01-19 16:44:30 -05:00
Carol (Nichols || Goulding)
bc67ca37a9
fix: Make sure tests using the Kafka architecture WaitForReadable
2023-01-19 16:44:30 -05:00
Carol (Nichols || Goulding)
59914906b6
fix: Only reset persist everything flag if data has been persisted
2023-01-19 16:44:30 -05:00
Carol (Nichols || Goulding)
3dbaeedca6
feat: Try implementing the persist api in a diffferent way
2023-01-19 16:44:30 -05:00