Commit Graph

10652 Commits (11233e3b3b0e9c96ec88920844a4a80cbccfe2b4)

Author SHA1 Message Date
Andrew Lamb c100737a81 chore: Do not send dictionary encoded data to clients 2023-01-26 06:35:15 -05:00
Nga Tran b8a80869d4
feat: introduce a new way of max_sequence_number for ingester, compactor and querier (#6692)
* feat: introduce a new way of max_sequence_number for ingester, compactor and querier

* chore: cleanup

* feat: new column max_l0_created_at to order files for deduplication

* chore: cleanup

* chore: debug info for chnaging cpu.parquet

* fix: update test parquet file

Co-authored-by: Marco Neumann <marco@crepererum.net>
2023-01-26 10:52:47 +00:00
Marco Neumann ed694d3be4
feat: introduce scratchpad store for compactor (#6706)
* feat: introduce scratchpad store for compactor

Use an intermediate in-memory store (can be a disk later if we want) to
stage all inputs and outputs of the compaction. The reasons are:

- **fewer IO ops:** DataFusion's streaming IO requires slightly more
  IO requests (at least 2 per file) due to the way it is optimized to
  read as little as possible. It first reads the metadata and then
  decides which content to fetch. In the compaction case this is (esp.
  w/o delete predicates) EVERYTHING. So in contrast to the querier,
  there is no advantage of this approach. In contrary this easily adds
  100ms latency to every single input file.
- **less traffic:** For divide&conquer partitions (i.e. when we need to
  run multiple compaction steps to deal with them) it is kinda pointless
  to upload an intermediate result just to download it again. The
  scratchpad avoids that.
- **higher throughput:** We want to limit the number of concurrent
  DataFusion jobs because we don't wanna blow up the whole process by
  having too much in-flight arrow data at the same time. However while
  we perform the actual computation, we were waiting for object store
  IO. This was limiting our throughput substantially.
- **shadow mode:** De-coupling the stores in this way makes it easier to
  implement #6645.

Note that we assume here that the input parquet files are WAY SMALLER
than the uncompressed Arrow data during compaction itself.

Closes #6650.

* fix: panic on shutdown

* refactor: remove shadow scratchpad (for now)

* refactor: make scratchpad safe to use
2023-01-26 10:03:08 +00:00
Andrew Lamb 7853a19953
feat: JDBC integration tests with FlightSQL (#6693)
* feat: basic JDBC integration test

* fix: do not run test without env set

* docs: add maven link

* refactor: clean up java with switch statement
2023-01-25 22:21:18 +00:00
Andrew Lamb 2db8443a64
refactor: split flightsql crate into smaller modules (#6703)
* refactor: split flightsql crate into smaller modules

* refactor: automatically derive from Impl
2023-01-25 21:12:48 +00:00
Carol (Nichols || Goulding) 57b5b639d6
test: Port all field columns query_tests to end-to-end tests (#6707)
* test: Port a test that's not actually supported through the full gRPC API

* test: Port remaining field column/measurement fields tests

* test: Remove unsupported measurement predicate and clarify purposes of tests

Andrew confirmed that the only way to invoke a Measurement Fields
request is with a measurement/table name specified: <0249b5018e/generated_types/protos/influxdata/platform/storage/service.proto (L43)>

so testing with a `_measurement` predicate is not valid.

I thought this test would become redundant with some other tests, but
they're actually still different enough; I took this opportunity to
better highlight the differences in the test names.

* refactor: Move all measurement fields tests to their own file

* test: Remove field columns tests that are now covered in end-to-end measurement fields tests
2023-01-25 19:49:29 +00:00
kodiakhq[bot] 0249b5018e
Merge pull request #6655 from influxdata/cn/one-test
test: Start of porting InfluxRpc query_tests
2023-01-25 15:56:44 +00:00
kodiakhq[bot] 98c60f9dc5
Merge branch 'main' into cn/one-test 2023-01-25 15:49:51 +00:00
Dom 7c7d737d0e
Merge pull request #6702 from influxdata/dom/persist-enqueue-durations
refactor: appropriate queue wait histogram buckets
2023-01-25 15:49:14 +00:00
Carol (Nichols || Goulding) f803c31e84
fix: Limit tests in CI to 8 threads to not use up Postgres connections
This is only needed until we switch over to ingester2 completely.

Old ingester tests need to be run on non-shared servers because I'm
unable to implement persistence per-namespace. Rather than spending time
figuring that out, limit the parallelization to limit the Postgres
connections that CI uses at one time.
2023-01-25 10:37:05 -05:00
Carol (Nichols || Goulding) 4658510102
fix: For Ingester2, persist a particular namespace on demand and share MiniClusters
This should hopefully help CI from running out of Postgres
connections 😬

The old architecture will still need to be non-shared and persist
everything.
2023-01-25 10:36:56 -05:00
Dom Dwyer df87ca3f17
refactor: appropriate queue wait histogram buckets
Changes the bucket values for the queue wait duration metric to be more
appropriately scaled.
2023-01-25 16:31:49 +01:00
Carol (Nichols || Goulding) f310e01b1a
test: Start of porting InfluxRpc query_tests
Make a new trait, `InfluxRpcTest`, that types can implement to define
how to run a test on a specific Storage gRPC API. `InfluxRpcTest` takes
care of iterating through the two architectures, running the setups, and
creating the custom test step.

Implementers of the trait can define aspects of the tests that differ
per run, to make the parameters of the test clearer and highlight what
different tests are testing.
2023-01-25 10:27:42 -05:00
Dom 8ee6c1ec68
Merge pull request #6701 from influxdata/dom/persist-config
feat: export persist config metrics
2023-01-25 15:25:33 +00:00
Dom dd445de275
Merge branch 'main' into dom/persist-config 2023-01-25 14:56:48 +00:00
Marco Neumann 7306ea9424
feat: divide&conquer framework (#6697)
Allows compactor2 to run a fixed-point loop (until all work is done) and
in every loop in can run mulitiple jobs.

The jobs are currently organized by "branches". This is because our
upcoming OOM handling may split a branch further if it doesn't complete.

Also note that the current config resembles the state prior to this PR.
So the FP-loop will only iterate ONCE and then runs out of L0 files. A
more advanced setup can be built using the framework though.
2023-01-25 14:45:20 +00:00
Dom Dwyer 7b69c84ceb
feat: export persist config metrics
Export the configured maximum persist parallelism, and the maximum queue
depth, so they can be used to compute % saturation in alerts /
dashboards.
2023-01-25 14:57:09 +01:00
Dom c928eddaab
Merge pull request #6698 from influxdata/dom/circuit-fuzz
test: CircuitBreaker recovery property fuzz test
2023-01-25 12:49:38 +00:00
Dom f0d7ee59c3
Merge branch 'main' into dom/circuit-fuzz 2023-01-25 12:42:43 +00:00
Dom e6876db431
Merge pull request #6700 from influxdata/dom/probe-at-most-one
perf(router): faster balancer node recovery
2023-01-25 12:42:34 +00:00
Dom b34bb46833
Merge branch 'main' into dom/circuit-fuzz 2023-01-25 12:29:46 +00:00
Dom eb67a1fa3f
Merge branch 'main' into dom/probe-at-most-one 2023-01-25 12:23:26 +00:00
Dom Dwyer 6eb1773ec0
perf(router): faster balancer node recovery
Ensure a "probe" node is always returned as the first candidate, driving
it to recovery faster.

This also includes a fix for the balancer metrics that would report
probe candidate nodes as healthy nodes.
2023-01-25 13:18:24 +01:00
Andrew Lamb 0c55a0f257
feat: Implement basic prepared statement support in IOx (#6667)
* feat: allow override of flightsql namespace

* feat: Implement DoAction endpoint

* refactor: Remove try_unpack

* fix: remove unused code / more clone
2023-01-25 12:00:43 +00:00
Andrew Lamb 6caf31acf3
chore: Move garbage collection configuration into clap_blocks (#6678)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2023-01-25 11:31:48 +00:00
dependabot[bot] f72a999fb3
chore(deps): Bump clap from 4.1.3 to 4.1.4 (#6694)
Bumps [clap](https://github.com/clap-rs/clap) from 4.1.3 to 4.1.4.
- [Release notes](https://github.com/clap-rs/clap/releases)
- [Changelog](https://github.com/clap-rs/clap/blob/master/CHANGELOG.md)
- [Commits](https://github.com/clap-rs/clap/compare/v4.1.3...v4.1.4)

---
updated-dependencies:
- dependency-name: clap
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2023-01-25 11:03:41 +00:00
Dom 40c7c8b2e2
Merge branch 'main' into dom/circuit-fuzz 2023-01-25 10:57:19 +00:00
Andrew Lamb 509c80bc55
docs: document how the garbage collector works (#6682)
* docs: document how the garbage collector works

* fix: Updates

* docs: Update docs/garbage_collector.md

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2023-01-25 10:54:43 +00:00
Dom Dwyer f5d4171be0
test: CircuitBreaker recovery property fuzz test
Adds a multi-threaded fuzz test that ensures a circuit breaker can
always transition to the healthy state, regardless of the sequence of
events prior.
2023-01-25 11:53:57 +01:00
Marco Neumann 40e6a1a437
feat: job semaphore (#6696)
* refactor: avoid too-many-arguments

* refactor: extract `fetch_partition_info`

* feat: job semaphore
2023-01-25 10:35:07 +00:00
Dom 75fc4ba17f
Merge pull request #6695 from influxdata/dependabot/cargo/ahash-0.8.3
chore(deps): Bump ahash from 0.8.2 to 0.8.3
2023-01-25 09:28:04 +00:00
dependabot[bot] cae3071776
chore(deps): Bump ahash from 0.8.2 to 0.8.3
Bumps [ahash](https://github.com/tkaitchuck/ahash) from 0.8.2 to 0.8.3.
- [Release notes](https://github.com/tkaitchuck/ahash/releases)
- [Commits](https://github.com/tkaitchuck/ahash/compare/v0.8.2...v0.8.3)

---
updated-dependencies:
- dependency-name: ahash
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
2023-01-25 01:08:55 +00:00
kodiakhq[bot] 33e29e890a
Merge pull request #6688 from influxdata/dom/rpc-endpoint-metrics
feat(metrics): router upstream RPC endpoint metrics
2023-01-24 23:51:38 +00:00
Luke Bond caea42665b
Merge branch 'main' into dom/rpc-endpoint-metrics 2023-01-25 10:44:18 +11:00
Christopher M. Wolff 9a942ceff5
refactor: propagate gapfill stride to exec (#6690)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2023-01-24 20:49:29 +00:00
Dom 39dd455297
Merge pull request #6689 from influxdata/dom/ingester-rediscovery
fix(router): force rediscovery of nodes
2023-01-24 19:21:17 +00:00
Dom 442e8a8b79
Merge branch 'main' into dom/ingester-rediscovery 2023-01-24 19:13:02 +00:00
Dom Dwyer 411f4bd08b
fix(router): force rediscovery of nodes
Similar to https://github.com/influxdata/influxdb_iox/pull/6509, this
forces a constant re-querying of the DNS address of an ingester to drive
rediscovery.

Unlike the above PR, this only reconnects when there are errors
observed. This still isn't ideal - something is wrong with the discovery
itself - this just papers over it.
2023-01-24 20:11:53 +01:00
kodiakhq[bot] 20ac3608ab Merge pull request #6687 from influxdata/dom/timeouts
refactor(router): set sensible RPC timeouts
2023-01-24 18:32:09 +00:00
Dom Dwyer 9132343dac
feat(metrics): export RPC upstream health state
Adds a metric with a per-ingester label recording the current health
state of the upstream ingester from the perspective of the router
instance.

Also logs periodically when one or more ingesters are offline.
2023-01-24 19:27:15 +01:00
kodiakhq[bot] 77b6b234d5
Merge branch 'main' into dom/timeouts 2023-01-24 18:24:42 +00:00
Andrew Lamb c3bc61f10e
refactor: Move `flightsql` code into its own module, add docs and tests (#6640)
* refactor: Move `flightsql`  code into its own module

* fix: get schema from LogicalPlan

* refactor: use arrow_flight::sql::Any instead of prost_types::any

* fix: cleanup docs and avoid as_ref

* fix: Use Bytes

* fix: use Any::pack

* fix: doclink
2023-01-24 18:24:32 +00:00
Dom Dwyer f26b54beec
refactor(router): set sensible RPC timeouts
Copies these over from the client_util package.
2023-01-24 19:22:27 +01:00
Dom Dwyer 87b553fe9d
feat: WARN logs w/ endpoint for unhealthy upstream
Changes the DEBUG log event to a WARN now that it includes the endpoint
to which the event applies.
2023-01-24 19:19:31 +01:00
Marco Neumann 4521516147
feat: add per-partition timeout (#6686)
It seems that prod was hanging last night. This is pretty hard to debug
and in general we should protect the compactor against hanging /
malformed partitions that take forever. This is similar to the fact that
the querier also has a timeout for every query. Let's see if this shows
anything in prod (and if not it's still a desired safety net).

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2023-01-24 16:53:47 +00:00
kodiakhq[bot] 6a2e0ae5cc
Merge pull request #6685 from influxdata/dom/rpc-balancer-2
perf(router): rpc balancer & circuit breaking take 2
2023-01-24 16:13:07 +00:00
Dom b0e5e860cb
Merge branch 'main' into dom/rpc-balancer-2 2023-01-24 16:04:32 +00:00
Dom Dwyer 085de40127
feat: lazy-connect to ingester gRPC endpoints
Lazily establish connections in the background, instead of using tonic's
connect_lazy().

connect_lazy() causes error handling to take a different path in tonic
compared to "normal" connections, and this stops reconnections from
being performed when the endpoint goes away (likely a bug).

It also means the first few write requests won't have to wait while the
connection is dialed, which brings down the P99 as a nice side-effect.
2023-01-24 16:44:55 +01:00
Marco Neumann 1c87d9667f
refactor: record partition completion (both Ok and Err) (#6680)
With the upcoming divide-and-conquer approach, we have have multiple
commits per partition since we can divide it into multiple compaction
jobs. For metrics (and logs) however it is important to track the
overall process, so we shall also monitor the number of completed
partitions.
2023-01-24 15:06:15 +00:00
kodiakhq[bot] dcc1eb9a21
Merge pull request #6679 from influxdata/dom/ingester-metrics
feat(ingester2): metrics
2023-01-24 14:57:12 +00:00