Commit Graph

216 Commits (d7904060854c457a2027d35dcd55598060738158)

Author SHA1 Message Date
Dom Dwyer a85dcd745b
refactor(catalog): expose deleted_at on Namespace
Add the new catalog column to the Namespace representation/model.
2023-02-10 14:15:01 +01:00
dependabot[bot] 6327e3d9c0
chore(deps): Bump serde_json from 1.0.92 to 1.0.93 (#6918)
Bumps [serde_json](https://github.com/serde-rs/json) from 1.0.92 to 1.0.93.
- [Release notes](https://github.com/serde-rs/json/releases)
- [Commits](https://github.com/serde-rs/json/compare/v1.0.92...v1.0.93)

---
updated-dependencies:
- dependency-name: serde_json
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2023-02-09 10:39:33 +00:00
dependabot[bot] 0ecde75af5
chore(deps): Bump object_store from 0.5.3 to 0.5.4 (#6900)
Bumps [object_store](https://github.com/apache/arrow-rs) from 0.5.3 to 0.5.4.
- [Release notes](https://github.com/apache/arrow-rs/releases)
- [Changelog](https://github.com/apache/arrow-rs/blob/master/CHANGELOG-old.md)
- [Commits](https://github.com/apache/arrow-rs/compare/object_store_0.5.3...object_store_0.5.4)

---
updated-dependencies:
- dependency-name: object_store
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2023-02-08 09:40:11 +00:00
Dom Dwyer bf6ab7fd88
fix: error message typo
columns -> tables for table limit error.
2023-02-06 18:03:26 +01:00
Dom Dwyer 3f2eb54bce
test(router): catalog service limit errors
Assert the service limit error messages from the catalog.
2023-02-06 17:55:09 +01:00
Dom Dwyer 3881e11734
test(router): service limit error messages
Assert the user-facing service limit error messages.
2023-02-06 17:43:37 +01:00
Dom Dwyer 114bafe9a1
perf(router): cached table limit enforcement
Use the namespace schema cache in the router to enforce the
per-namespace table limit (service protection limit), adding O(1)
overhead to the existing column limit evaluation logic.

Prior to this commit, each request that would breach the table limit
would be (potentially partially) applied to the catalog and return an
error. Every subsequent request creating a new table continued to cause
a catalog query, unnecessarily adding load proportional to request
counts.

After this commit, catalog requests are sent when the router instance
can determine (to the best of it's ability, see below) that the request
will not cause the namespace to exceed the table limit.

Because this uses cached schemas, the actual state set of tables may
have changed - this will cause inconsistent enforcement and spurious
errors in the same way it currently does for the column limit. For more
details (and to track a resolution) see:

    https://github.com/influxdata/influxdb_iox/issues/5957
2023-02-06 17:43:26 +01:00
Dom Dwyer dfa4ab2585
perf(router): fast-path column limit for new table
When validating column limits for new tables, skip the column set
generation and union operations against the empty existing column set.
2023-02-06 17:33:56 +01:00
Dom Dwyer a633964f2b
feat(catalog): return max table limit in schema
The maximum number of tables is part of the Namespace, which is already
loaded in its entirety. This commit copies the value into the
NamespaceSchema, making it available for the router to utilise.
2023-02-06 17:33:55 +01:00
dependabot[bot] 6f4e287a3a
chore(deps): Bump serde_json from 1.0.91 to 1.0.92 (#6860)
Bumps [serde_json](https://github.com/serde-rs/json) from 1.0.91 to 1.0.92.
- [Release notes](https://github.com/serde-rs/json/releases)
- [Commits](https://github.com/serde-rs/json/compare/v1.0.91...v1.0.92)

---
updated-dependencies:
- dependency-name: serde_json
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-02-06 08:27:41 +00:00
Carol (Nichols || Goulding) 30fea67701
fix: Move variables within format strings. Thanks clippy!
Changes made automatically using `cargo clippy --fix`.
2023-02-03 13:06:17 -05:00
Dom Dwyer 7f363b55df
test(router): e2e namespace retention coverage
Assert the correct handling of 0 and negative retention periods when
interacting with the namespace create & update gRPC handlers.
2023-02-01 11:49:53 +01:00
dependabot[bot] d0e6b16450
chore(deps): Bump bytes from 1.3.0 to 1.4.0
Bumps [bytes](https://github.com/tokio-rs/bytes) from 1.3.0 to 1.4.0.
- [Release notes](https://github.com/tokio-rs/bytes/releases)
- [Changelog](https://github.com/tokio-rs/bytes/blob/master/CHANGELOG.md)
- [Commits](https://github.com/tokio-rs/bytes/compare/v1.3.0...v1.4.0)

---
updated-dependencies:
- dependency-name: bytes
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2023-02-01 00:30:56 +00:00
dependabot[bot] 875b6a3e99
chore(deps): Bump futures from 0.3.25 to 0.3.26 (#6766)
Bumps [futures](https://github.com/rust-lang/futures-rs) from 0.3.25 to 0.3.26.
- [Release notes](https://github.com/rust-lang/futures-rs/releases)
- [Changelog](https://github.com/rust-lang/futures-rs/blob/master/CHANGELOG.md)
- [Commits](https://github.com/rust-lang/futures-rs/compare/0.3.25...0.3.26)

---
updated-dependencies:
- dependency-name: futures
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2023-01-31 11:33:50 +00:00
Dom Dwyer 0ddef54b09
fix(router): envoy network error translation
Envoy will connect to an endpoint on demand, and return an
application-level error if it fails with a gRPC status code of
"Unavailable".

It also embeds a metadata entry of {"server": "envoy"} - this commit
uses the two signals (error status code + metadata entry) to drive an
immediate reconnection when observed, assuming the connection is bad.
2023-01-30 12:15:38 +01:00
Dom Dwyer 353b1ad575
feat: configurable RPC write request timeout
Allows the user to configure the timeout used for a single RPC write
request, and changes the default to a more sensible value (30 -> 3
seconds).
2023-01-27 14:53:48 +01:00
Dom 5757674d5e
Merge branch 'main' into dom/service-limit-metric-labels 2023-01-27 10:04:56 +00:00
Dom Dwyer 8140313775
test: drive catalog in ns rejection test
Use the actual catalog resolver, not the mock to assert the correct
behaviour with a populated catalog.
2023-01-26 17:55:39 +01:00
Dom Dwyer 0aa5469ac6
test(e2e): explicit namespace creation
Adds an end-to-end test of the router's gRPC NamespaceService covering
creation and reading of new namespaces.
2023-01-26 17:32:12 +01:00
Dom Dwyer 7eaa8f59b0
fix: explicit namespace creation w/ existing ns
Prior to this commit, namespaces that had been created on one router
could not be used on another router until the latter was restarted.
Effectively, newly created namespaces couldn't be used.

After this commit, the catalog is also checked when a cache miss occurs,
ensuring the router discovers new, not-yet-cached namespaces.
2023-01-26 17:32:12 +01:00
Dom Dwyer 105e354299
refactor: clean up namespace errors
The namespace error was poorly refactored and duplicated the prefix
string. The "rejected" case is now also tested.
2023-01-26 17:32:11 +01:00
Dom Dwyer 1a7679bcee
refactor: expose underlying gRPC implementations
Changes the gRPC delegate to return the underlying service (type erased)
implementations instead of the RPC service wrappers.
2023-01-26 17:32:11 +01:00
Dom Dwyer ac8fa293cb
refactor(test): TestContext::write_lp() helper
Adds a helper method to construct the HTTP write request.
2023-01-26 17:32:10 +01:00
Dom Dwyer 6f1869f9dc
test(router): initialise gRPC delegate in e2e
Initialise the "rpc mode" gRPC handlers in the router e2e TestContext.
2023-01-26 17:32:10 +01:00
Dom Dwyer 3efc42baac
refactor(test): dedicated e2e TestContext module
Moves the router's TestContext to its own file/module.
2023-01-26 17:32:10 +01:00
Dom Dwyer c66f4a3d92
fix(router): restore NamespaceService
This was removed in the RPC variant of the router - no idea why, we
definitely should have it!
2023-01-26 15:10:22 +01:00
Dom Dwyer b6018e1c39
feat(metrics): separate service limit counters
Service limits are enforced on two values:

    * Number of tables in a namespace
    * Number of columns in a table

This commit labels the existing service limit hit metric with the type
of limit reached, and adds this information to the log lines emitted.
2023-01-26 14:48:33 +01:00
Dom f0d7ee59c3
Merge branch 'main' into dom/circuit-fuzz 2023-01-25 12:42:43 +00:00
Dom Dwyer 6eb1773ec0
perf(router): faster balancer node recovery
Ensure a "probe" node is always returned as the first candidate, driving
it to recovery faster.

This also includes a fix for the balancer metrics that would report
probe candidate nodes as healthy nodes.
2023-01-25 13:18:24 +01:00
Dom Dwyer f5d4171be0
test: CircuitBreaker recovery property fuzz test
Adds a multi-threaded fuzz test that ensures a circuit breaker can
always transition to the healthy state, regardless of the sequence of
events prior.
2023-01-25 11:53:57 +01:00
Luke Bond caea42665b
Merge branch 'main' into dom/rpc-endpoint-metrics 2023-01-25 10:44:18 +11:00
Dom 442e8a8b79
Merge branch 'main' into dom/ingester-rediscovery 2023-01-24 19:13:02 +00:00
Dom Dwyer 411f4bd08b
fix(router): force rediscovery of nodes
Similar to https://github.com/influxdata/influxdb_iox/pull/6509, this
forces a constant re-querying of the DNS address of an ingester to drive
rediscovery.

Unlike the above PR, this only reconnects when there are errors
observed. This still isn't ideal - something is wrong with the discovery
itself - this just papers over it.
2023-01-24 20:11:53 +01:00
Dom Dwyer 9132343dac
feat(metrics): export RPC upstream health state
Adds a metric with a per-ingester label recording the current health
state of the upstream ingester from the perspective of the router
instance.

Also logs periodically when one or more ingesters are offline.
2023-01-24 19:27:15 +01:00
Dom Dwyer f26b54beec
refactor(router): set sensible RPC timeouts
Copies these over from the client_util package.
2023-01-24 19:22:27 +01:00
Dom Dwyer 87b553fe9d
feat: WARN logs w/ endpoint for unhealthy upstream
Changes the DEBUG log event to a WARN now that it includes the endpoint
to which the event applies.
2023-01-24 19:19:31 +01:00
Dom Dwyer 085de40127
feat: lazy-connect to ingester gRPC endpoints
Lazily establish connections in the background, instead of using tonic's
connect_lazy().

connect_lazy() causes error handling to take a different path in tonic
compared to "normal" connections, and this stops reconnections from
being performed when the endpoint goes away (likely a bug).

It also means the first few write requests won't have to wait while the
connection is dialed, which brings down the P99 as a nice side-effect.
2023-01-24 16:44:55 +01:00
Dom Dwyer 8215f4126e
test: router balancer recovery
Ensure a recovering node is yielded from the balancer.
2023-01-24 15:30:01 +01:00
Dom Dwyer c6d6c50fbf
perf(router): circuit break ingester connections
Adds on-path health checking / recording using the CircuitBreaker
construct, stopping requests to unhealthy upstreams (minus the probe
requests) until they recover.

This removes the horrible gRPC balancer hack I added to get us deployed
ASAP, and should eliminate latency spikes and elevated error responses
observed during deployments as a result.
2023-01-24 15:30:01 +01:00
Dom Dwyer 107006c801
revert: influxdata/dom/rpc-balancer
This reverts commit a3805dbccf, reversing
changes made to bcb1232c5d.
2023-01-24 14:47:05 +01:00
Dom Dwyer b32662ebf2
test: router balancer recovery
Ensure a recovering node is yielded from the balancer.
2023-01-24 13:38:36 +01:00
Dom Dwyer 7596dc0826
perf(router): circuit break ingester connections
Adds on-path health checking / recording using the CircuitBreaker
construct, stopping requests to unhealthy upstreams (minus the probe
requests) until they recover.

This removes the horrible gRPC balancer hack I added to get us deployed
ASAP, and should eliminate latency spikes and elevated error responses
observed during deployments as a result.
2023-01-24 12:38:27 +01:00
Dom Dwyer c3a2ac3a0d
refactor: prevent div by 0
Preserve the error ratio calculation but prevent a div by 0 by ensuring
the divisor is always at least 1.
2023-01-24 12:09:00 +01:00
Dom Dwyer c4b04a16c5
refactor: rename last_probe instant
last_probe was "the instant at which the last set of probes started
being sent" in my head, but Carol saw it as "first_probe - the time at
which probes started being sent".

Hopefully probe_window_started_at is less ambiguous.
2023-01-24 12:08:10 +01:00
Dom Dwyer 2f3fb48091
docs: document error count floor
Describe the floor on the number of errors that must be observed before
the circuit breaker will consider switching to the unhealthy state.
2023-01-24 12:08:09 +01:00
Carol (Nichols || Goulding) caf8dc9032
fix: Rename incorrect usage of 'close' to 'unhealthy' in test helper 2023-01-23 16:08:00 -05:00
Carol (Nichols || Goulding) 081b4f15da
docs: Clarify my understanding of the circuit breaker based on chat with Dom 2023-01-23 16:07:02 -05:00
Dom Dwyer 67b73d90dd
feat: low-overhead circuit breaker
Implements a "circuit breaker", a construct that tracks the error &
success of requests to a remote node, and uses this information to allow
or deny further requests.

This circuit breaker stops sending requests to the remote when the error
count exceeds 80% of requests in a 5 second window. Once this happens,
up to 10 "probe" requests per second are allowed, and when they succeed,
normal operation resumes (though concurrent requests may still be
completing during the probe regime and are counted towards the probe
results).

In the happy path, this circuit breaker is very cheap (lock free; WFPO)
to evaluate and record request results in, minimising the throughput
penalty. Once the breaker enters an unhealthy state (hopefully a rare
occurrence) it uses a mutex to manage the probe state (with a higher
overhead) for simplicity; it's definitely possible to optimise this away
if high latencies are observed during upstream outages when the circuit
breaker is open/unhealthy.
2023-01-23 13:55:12 +01:00
Dom Dwyer 6ef68513d9
fix: gRPC balancer shutdown panic
The gRPC node discovery hack spawns a task that outlives the gRPC
balancer - once the balancer stops, the task should stop too (and not
panic sending on the closed channel).
2023-01-11 16:42:39 +01:00
Dom Dwyer 9ab86fa154
fix(router2): drive ingester node (re)-discovery
The tonic / tower load-balance implementation discards failed nodes,
even when using a static list - this causes nodes that fail once to
never be retried.

This doesn't happen for the last node for some reason, and leads to all
the load from one router hitting a single ingester instead of load
balancing across all ingesters.

This commit adds a hack to constantly tell the load balancer to probe
all nodes, hopefully causing them to re-discover previously failed
nodes. I don't have the time to do this properly :(
2023-01-05 14:06:29 +01:00