Changes the proptest for the router's partitioner handler to use a
timestamp generation strategy that more accurately models the
distribution of timestamps in real-world requests.
Asserts the partitioning code within the router (that drives the
low-level partitioning logic) generates partitions with rows with
timestamps that belong in those partitions.
* feat: Allow passing service protection limits in create db gRPC call
* fix: Move the impl into the catalog namespace trait
---------
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
- Create data_types::partition_template::ValidationError
- Make creation of NamespacePartitionTemplateOverride and
TablePartitionTemplateOverride fallible
- Move SerializationWrapper into a module to make its inner field
private to force creation through one fallible constructor; this is
where the validation logic will go to be shared among all uses of
partition templates
Adds a benchmark that exercises the router's partitioning DmlHandler
implementation against a set of three files (very small, small, medium)
with 4 different partitioning schemes:
* Single tag, which occurs in all rows
* Single tag, which does not occur in any row
* Default strftime formatter (YYYY-MM-DD)
* Long and complicated strftime formatter
This covers the entire partitioning overhead - building the formatters,
evaluating each row, grouping the values into per-partition buckets, and
returning to the caller, where it normally would be passed to the next
handler in the pipeline.
Note that only one template part is evaluated in each case - this
measures the overhead of each type of formatter. In reality, we'd expect
partitioning with custom schemes to utilise more than one part,
increasing the cost of partitioning proportionally. This is a
lower-bound measurement!
An integration test asserting that a router returns an error when
attempting to partition a write with an invalid strftime partition
formatter, rather than panicking.
Changes the partitioning logic to be fallible. This prevents an invalid
partition template from causing a panic, previously possible through two
known code paths:
* TagValue formatter referencing a non-tag column
* Time formatter using an invalid strftime format string
If either occurs, the write attempt is now aborted and an error returned
to the user with a HTTP 500 status code.
Additionally unexpected partitioner errors now map to a catch-all error
instead of panicking.
This commit changes the format of partition keys when generated with
non-default partition key templates ONLY. A prior fixture test is
unchanged by this commit, ensuring the default partition keys remain
the same.
When a custom partition key template is provided, it may specify one or
more parts, with the TagValue template causing values extracted from tag
columns to appear in the derived partition key.
This commit changes the generated partition key in the following ways:
* The delimiter of multi-part partition keys; the character used to
delimit partition key parts is changed from "/" to "|" (the pipe
character) as it is less likely to occur in user-provided input,
reducing the encoding overhead.
* The format of the extracted TagValue values (see below).
Building on the work of custom partition key overrides, where an
immutable partition template is resolved and set at table creation time,
the changes in this PR enable the derived partition key to be
unambiguously reversed into the set of tag (column_name, column_value)
tuples it was generated from for use in query pruning logic. This is
implemented by the build_column_values() method in this commit, which
requires both the template, and the derived partition key.
Prior to this commit, a partition key value extracted from a tag column
was in the form "tagname_x" where "x" is the value and "tagname" is the
name of the tag column it was extracted from. After this commit, the
partition key value is in the form "x"; the column name is removed from
the derived string to reduce the catalog storage overhead (a key driver
of COGS). In the case of a NULL tag value, the sentinel value "!" is
inserted instead of the prior "tagname_" marker. In the case of an empty
string tag value (""), the sentinel "^" value is inserted instead of the
"tagname_-" marker, ensuring the distinction between an empty value and
a not-present tag is preserved.
Additionally tag values utilise percent encoding to encode reserved
characters (part delimiter, empty sentinel character, % itself) to
eliminate deserialisation ambiguity.
Examples of how this has changed derived partition keys, for a template
of [Time(YYYY-MM-DD), TagValue(region), TagValue(bananas)]:
Write: time=1970-01-01,region=west,other=ignored
Old: "1970-01-01-region_west-bananas"
New: "1970-01-01|west|!"
Write: time=1970-01-01,other=ignored
Old: "1970-01-01-region-bananas"
New: "1970-01-01|!|!"
Move the import into the submodule itself, rather than re-exporting it
at the crate level.
This will make it possible to link to the specific module/logic.
This commit fixes loads of crates (47!) had unused dependencies, or
mis-configured dependencies (test deps as normal deps).
I added the "unused_crate_dependencies" to all crates to help prevent
this mess from growing again!
https://doc.rust-lang.org/beta/nightly-rustc/rustc_lint_defs/builtin/static.UNUSED_CRATE_DEPENDENCIES.html
This has the minor downside of false-positives when specifying
dev-dependencies for test/bench binaries - these are files in /test or
/benches (not normal tests). This commit includes a workaround,
importing them in lib.rs (gated by a feature flag). I think the
trade-off of better dependency management is worth it!
The old tests used partial error string matching, with the whole error
message! So when I added more to the error message, the fixture tests
didn't fail.
This changes the tests to match the full string, and validate the
timestamps are included.
Include the minimum acceptable timestamp (the retention bound) and the
observed timestamp that exceeds this bound in the retention enforcement
write error response.
The replication flag defines the total number of copies of each write -
slightly less confusing than the additional copies it was previously,
and matches with the actual code.
Changes the UpstreamSnapshot to be suitable for concurrent use. This
type contains the core logic to enable a caller to uphold the
responsibility of ensuring replicated writes land on distinct ingesters
in the presence of concurrent replication.
The clients within the snapshot are returned to at most one concurrent
caller at a time, by tracking the state of each client as a FSM:
┌────────────────┐
┌─▶│ Available │
│ └────────────────┘
│ │
drop next()
│ │
│ ▼
│ ┌────────────────┐
└──│ Yielded │
└────────────────┘
│
remove
│
▼
┌────────────────┐
│ Used │
└────────────────┘
Once a client has been yielded it will not be yielded again until it is
dropped (transitioning the FSM from "yielded" to "available" again,
returning it to the candidate pool of clients) or removed (transitioning
to "used", permanently preventing it from being yielded to another
caller).
Changes then UpstreamSnapshot to return owned clients, instead of
references to those clients.
This will allow the snapshot to have a 'static lifetime, suitable for
use across tasks.
Because the number of candidate upstreams is checked to exceed the
number of desired data copies before starting the write loop, and
because the parallelism of the write loop matches the number of desired
data copies, it's not possible for any thread to observe an empty
snapshot.
This commit removes the unreachable error condition for clarity.
Adds a property-based test of the RPC write handler's replication logic,
ensuring:
1. If the number of healthy upstreams is 0, NoHealthyUpstreams is
returned and no requests are attempted.
2. Given N healthy upstreams (> 0) and a replication factor of R:
if N < R, "not enough replicas" is returned and no requests are
attempted.
3. Upstreams that return an error are retried until the entire
write succeeds or times out.
4. Writes are replicated to R distinct upstreams successfully, or
an error is returned.
5. One an upstream write is ack'd as successful, it is never
requested again.
6. An upstream reporting as unhealthy at the start of the write is
never requested (excluding probe requests).
These properties describe a mixture of invariants (don't replicate your
two copies of a write to the same ingester) and expected behaviour of
the replication logic (optimisations like "don't try writes when you
already know they'll fail").
This passes for the single-threaded replication logic used at the time
of this commit, and will be used to validate correctness of a concurrent
replication implementation - a concurrent approach should uphold these
properties the same way a single-threaded implementation does.
Renames NoUpstreams -> NoHealthyUpstreams as it's confusing because we
also have "not enough replicas" which could be no upstreams? This has a
slightly clearer meaning.