When an upstream ingester goes offline, the "circuit breaker" detects it
as unhealthy, and prevents further requests being sent to it.
Periodically a small number of requests are allowed ("probe requests")
to check for recovery.
If a write request is selected as a "probe request", it SHOULD be sent -
a limited number writes are selected as probes, and enough have to be
successful to drive recovery. If no probes are ever sent/successful, the
upstream will never be marked as healthy.
Additionally the RPC handler applies an optimisation: if the number of
ingesters selected to service a write is less than the number needed to
successfully reach the desired replication factor, no requests are sent
and an error is returned immediately, preventing unnecessary system load
for writes that would never succeed.
This optimisation conflicts with the probe request requirement when a
replication factor of >= 2 is specified:
* All ingesters are offline
* Write comes in
* UpstreamSnapshot is populated with a probe request for 1 ingester
only - no other healthy candidate ingesters exist.
* Optimisation applied: 1 probe candidate < 2 needed for replication
This results in a probe request never being sent, and in turn, never
allowing further requests to the recovered upstream.
This fix changes the optimisation, applying it only when there are no
probes in the candidate ingester list - the write will always fail, but
it will drive detection of recovered ingesters and maintain liveness of
the system.
Prior to this commit, the NamespaceCache was only implemented for
Arc<MemoryNamespaceCache> instead of the cache type itself.
In the vast majority of cases, this Arc wrapper is completely
unnecessary - it adds both runtime overhead, and code/type complexity.
This commit impls NamespaceCache for any Arc-wrapped NamespaceCache, and
removes all unnecessary Arc wrapping of the MemoryNamespaceCache.
* chore(deps): Bump chrono from 0.4.26 to 0.4.27
Bumps [chrono](https://github.com/chronotope/chrono) from 0.4.26 to 0.4.27.
- [Release notes](https://github.com/chronotope/chrono/releases)
- [Changelog](https://github.com/chronotope/chrono/blob/main/CHANGELOG.md)
- [Commits](https://github.com/chronotope/chrono/compare/v0.4.26...v0.4.27)
---
updated-dependencies:
- dependency-name: chrono
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com>
* chore: Run cargo hakari tasks
* fix: Update deprecated chrono methods to their now-recommended versions
`chrono::DateTime::<Tz>::from_utc` has been deprecated and it's now
recommended to use `chrono::DateTime::from_naive_utc_and_offset`
instead.
<https://github.com/chronotope/chrono/pull/1175>
Note that the `Timestamp` type in `influxdb_influxql_parser` is an alias
for `chrono::DateTime<FixedOffset>`.
---------
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
Co-authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
Adds a "mst" (merkle search tree) submodule in anti_entropy, and moves
all the MST code into it.
This makes space for a gossip-based sync primitive to live here too.
Separate the management of the Merkle Search Tree state into an actor
to manage concurrent access.
This moves hashing and tree updates off of the hot request path, and
into an asynchronous background process, practically eliminating
overhead for maintaining the MST structure.
This decoupling will allow convergence runs between peers to proceed
without causing contention on the lock in the request hot path.
Don't fail to compile / run tests because of an unreachable pub, or
missing debug impl - just emit a compiler warning.
This lets the compilation complete, but isn't accepted in PRs as CI runs
with "deny warnings".
Adds an integration test asserting the derived MST content hashes
accurately track updates to an underlying cache entry merge
implementation.
This ensures the merge implementation, and content hashes do not become
out-of-sync.
Adds a (currently unused) NamespaceCache decorator that observes the
post-merge content of the cache to maintain a content hash.
This makes use of a Merkle Search Tree
(https://inria.hal.science/hal-02303490) to track the CRDT content of
the cache in a deterministic structure that allows for synchronisation
between peers (itself a CRDT).
The hash of two routers' NamespaceCache will be equal iff their cache
contents are equal - this can be used to (very cheaply) identify
out-of-sync routers, and trigger convergence. The MST structure used
here provides functionality to compare two compact MST representations,
and identify subsets of the cache that are out-of-sync, allowing for
cheap convergence.
Note this content hash only covers the tables, their set of columns, and
those column schemas - this reflects the fact that only these values may
currently be converged by gossip. Future work will enable full
convergence of all fields.
Now there's a Topic, there's no need for a giant "all message types"
enum.
As part of this shift, the gossip_message::GossipMessage used for schema
gossiping is sounding overly generic. This commit changes the name to
schema_message::SchemaMessage and updates the code.
This is a backwards-compatible change (and if anything goes wrong, the
"old" routers simply log a warning if a message is unreadable).
Adds "topic" support, allowing a node to subscribe to one or more types
of application payloads independently.
A gossip node is optionally initialised with a set of topics (defaulting
to "all topics") and this set of topic interests is propagated
throughout the cluster via the usual PEX mechanism, alongside the
existing connection & identity information.
When broadcasting an application payload, the sender only transmits it
to nodes that had registered an interest in this payload type. This
prevents wasted network bandwidth and CPU for all nodes, and allows
multiple, distinct payload types to be propagated independently to keep
subsystems that rely on gossip decoupled from each other (no giant,
brittle payload enum type).
Configuring the `ERROR_WINDOW` of the router's on-path health check
did not provide a consistent improvement for low write volume clusters.
Now that the `NUM_PROBES` parameter is configurable, this can be
un-exposed to simplify configuration options and clean up boiler plate.
This commit allows schema gossiping to be enabled on router nodes.
Enabling gossiping allows any schema changes made on router A to be sent
to the N-1 other routers, populating their internal caches in
anticipation of handling a similar request.
By populating their cache, they avoid incurring a catalog lookup to
populate their local state upon a cache miss, therefore reducing request
latency, and reducing catalog load.
Enabling gossip on the routers automatically enables schema gossiping -
enabling gossip remains optional, and off by default.
This commit adds the SchemaChangeObserver, the delegate which is handed
a schema diff, and is responsible for computing the gossip message and
handing it off to the gossip system.
This sits between the cache layer, and the gossip layer, converting
schema things into gossip things.
This isn't connected up, so no messages will be sent.
This commit adds the NamespaceSchemaGossip type, a decorator of
[`NamespaceCache`] implementations utilising peer gossiping to provide
best-effort convergence of the local cache state.
This decorator will sit in the NamespaceCache stack, allowing it to
receive incoming schema gossip messages, and update the local cache
through the regular NamespaceCache abstraction methods.
This currently implements the message handlers only - no messages are
sent yet!
This benchmark covers two axis of performance for calls to the
namespace cache's `put_schema()` stack. These are the cost of adding
varying numbers of new columns to an existing table in the namespace, as
well as adding new tables with their own set of columns to an existing
namespace.
Adds the supporting types required to integrate the generic gossip crate
into a schema-specific broadcast primitive.
This commit implements the two "halves":
* GossipMessageDispatcher: async processing of incoming gossip msgs
* Handle: the send-side handle for async sending of gossip msgs
These types are responsible for converting into/from the serialised
bytes sent over the gossip primitive into application-level / protobuf
types.
This allows routers to be configured to mark downstreams as healthy/
unhealthy with a requirement for the number of probe requests
which can/must be collected to transition the health checkers circuit
state to healthy/unhealthy.
Allows the router to optionally enable and start the gossip subsystem
(disabled by default).
No code uses the gossip system, so no application-level messages are
exchanged, but this allows the gossip subsystem to run and exchange
control frames / perform discovery / etc.