Changes the UpstreamSnapshot to be suitable for concurrent use. This
type contains the core logic to enable a caller to uphold the
responsibility of ensuring replicated writes land on distinct ingesters
in the presence of concurrent replication.
The clients within the snapshot are returned to at most one concurrent
caller at a time, by tracking the state of each client as a FSM:
┌────────────────┐
┌─▶│ Available │
│ └────────────────┘
│ │
drop next()
│ │
│ ▼
│ ┌────────────────┐
└──│ Yielded │
└────────────────┘
│
remove
│
▼
┌────────────────┐
│ Used │
└────────────────┘
Once a client has been yielded it will not be yielded again until it is
dropped (transitioning the FSM from "yielded" to "available" again,
returning it to the candidate pool of clients) or removed (transitioning
to "used", permanently preventing it from being yielded to another
caller).
Changes then UpstreamSnapshot to return owned clients, instead of
references to those clients.
This will allow the snapshot to have a 'static lifetime, suitable for
use across tasks.
Because the number of candidate upstreams is checked to exceed the
number of desired data copies before starting the write loop, and
because the parallelism of the write loop matches the number of desired
data copies, it's not possible for any thread to observe an empty
snapshot.
This commit removes the unreachable error condition for clarity.
Adds a property-based test of the RPC write handler's replication logic,
ensuring:
1. If the number of healthy upstreams is 0, NoHealthyUpstreams is
returned and no requests are attempted.
2. Given N healthy upstreams (> 0) and a replication factor of R:
if N < R, "not enough replicas" is returned and no requests are
attempted.
3. Upstreams that return an error are retried until the entire
write succeeds or times out.
4. Writes are replicated to R distinct upstreams successfully, or
an error is returned.
5. One an upstream write is ack'd as successful, it is never
requested again.
6. An upstream reporting as unhealthy at the start of the write is
never requested (excluding probe requests).
These properties describe a mixture of invariants (don't replicate your
two copies of a write to the same ingester) and expected behaviour of
the replication logic (optimisations like "don't try writes when you
already know they'll fail").
This passes for the single-threaded replication logic used at the time
of this commit, and will be used to validate correctness of a concurrent
replication implementation - a concurrent approach should uphold these
properties the same way a single-threaded implementation does.
Renames NoUpstreams -> NoHealthyUpstreams as it's confusing because we
also have "not enough replicas" which could be no upstreams? This has a
slightly clearer meaning.