Commit Graph

118 Commits (6246275c4adb30ebacc21eb1bb38f92facb42260)

Author SHA1 Message Date
dependabot[bot] faa8d44492
chore(deps): Bump thiserror from 1.0.43 to 1.0.44 (#8315)
Bumps [thiserror](https://github.com/dtolnay/thiserror) from 1.0.43 to 1.0.44.
- [Release notes](https://github.com/dtolnay/thiserror/releases)
- [Commits](https://github.com/dtolnay/thiserror/compare/1.0.43...1.0.44)

---
updated-dependencies:
- dependency-name: thiserror
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2023-07-24 10:18:44 +00:00
dependabot[bot] e33a078128
chore(deps): Bump paste from 1.0.13 to 1.0.14 (#8244)
Bumps [paste](https://github.com/dtolnay/paste) from 1.0.13 to 1.0.14.
- [Release notes](https://github.com/dtolnay/paste/releases)
- [Commits](https://github.com/dtolnay/paste/compare/1.0.13...1.0.14)

---
updated-dependencies:
- dependency-name: paste
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-07-17 16:10:02 +00:00
Carol (Nichols || Goulding) 10a0f8e3bf
fix: Remove ::default() when constructing unit structs
As recommended by https://rust-lang.github.io/rust-clippy/master/index.html#default_constructed_unit_structs
2023-07-14 10:50:55 -04:00
dependabot[bot] 057ee40cb9
chore(deps): Bump thiserror from 1.0.41 to 1.0.43 (#8181)
Bumps [thiserror](https://github.com/dtolnay/thiserror) from 1.0.41 to 1.0.43.
- [Release notes](https://github.com/dtolnay/thiserror/releases)
- [Commits](https://github.com/dtolnay/thiserror/compare/1.0.41...1.0.43)

---
updated-dependencies:
- dependency-name: thiserror
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-07-07 09:25:12 +00:00
dependabot[bot] 3827257f94
chore(deps): Bump thiserror from 1.0.40 to 1.0.41 (#8149)
Bumps [thiserror](https://github.com/dtolnay/thiserror) from 1.0.40 to 1.0.41.
- [Release notes](https://github.com/dtolnay/thiserror/releases)
- [Commits](https://github.com/dtolnay/thiserror/compare/1.0.40...1.0.41)

---
updated-dependencies:
- dependency-name: thiserror
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Dom <dom@itsallbroken.com>
2023-07-05 09:25:14 +00:00
dependabot[bot] 9a03d9c9fe
chore(deps): Bump paste from 1.0.12 to 1.0.13 (#8139)
Bumps [paste](https://github.com/dtolnay/paste) from 1.0.12 to 1.0.13.
- [Release notes](https://github.com/dtolnay/paste/releases)
- [Commits](https://github.com/dtolnay/paste/compare/1.0.12...1.0.13)

---
updated-dependencies:
- dependency-name: paste
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-07-04 07:57:41 +00:00
dependabot[bot] 74a48a8f63
chore(deps): Bump itertools from 0.10.5 to 0.11.0 (#8060)
* chore(deps): Bump itertools from 0.10.5 to 0.11.0

Bumps [itertools](https://github.com/rust-itertools/itertools) from 0.10.5 to 0.11.0.
- [Changelog](https://github.com/rust-itertools/itertools/blob/master/CHANGELOG.md)
- [Commits](https://github.com/rust-itertools/itertools/compare/v0.10.5...v0.11.0)

---
updated-dependencies:
- dependency-name: itertools
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* chore: Run cargo hakari tasks

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2023-06-23 08:11:56 +00:00
Dom 7b51aed69a
test: use run_len not max_run_len
Co-authored-by: Carol (Nichols || Goulding) <193874+carols10cents@users.noreply.github.com>
2023-06-16 15:54:31 +01:00
Dom Dwyer 8dd159456a
test: assert partitioner row counts
Assert the number of rows yielded by the partitioner matches the number
of input rows.
2023-06-16 14:14:03 +02:00
Dom Dwyer f1058dccf6
test: proptest model timestamp distribution
Changes the partitioner proptest to use a timestamp generation strategy
that more accurately models the distribution of timestamps in real-world
requests.
2023-06-16 12:12:24 +02:00
Dom Dwyer b28a65c372
fix: clean up buffer after error
When rendering a timestamp fails, remove the "timestamp -> generated
key" mapping from the cache.

I'm certain this is impossible to reach for multiple reasons, but it
should do the right thing anyway, in case those reasons change.
2023-06-15 14:54:47 +02:00
Dom Dwyer 018bf79620
test: stftime integration
This property test generates randomised inputs and validates that all
rows with the ranges emitted by the partitioner render to the expected
timestamp-derived partition key when using a known-good implementation.

This includes generating timestamps that bypass the YYYY-MM-DD precision
reduction optimisation to increase coverage of cache miss code paths.
2023-06-15 14:54:45 +02:00
Dom Dwyer 2acbaefa18
fix: correct dedupe of strftime values
This fixes the root cause of influxdata/idpe#17765; the code was
performing a "is this the last value you saw" check by comparing it to
the last generated partition key which is not the same thing - a cache
hit would not generate a new key, and therefore would not return the
correct answer after.

The end result is that for a subset of writes with a problematic
sequence of timestamps would cause the wrong partition key to be
assigned. Because all users are using the default YYYY-MM-DD
partitioning scheme, the impact was relatively low, as most of the time
that partition key had the same YYYY-MM-DD representation as the last.
2023-06-15 14:54:45 +02:00
Dom Dwyer df88b542f1
revert: revert: "Merge pull request #7953 from influxdata/dom/partition-key-dedupe"
This reverts commit 3c0388fdea.
2023-06-15 14:54:44 +02:00
Dom Dwyer 8bb631e86b
refactor: fix revert conflicts
This fixes the non-compiling revert code.
2023-06-14 16:07:14 +02:00
Dom Dwyer 3c0388fdea
revert: "Merge pull request #7953 from influxdata/dom/partition-key-dedupe"
This reverts commit 5bce4477b7, reversing
changes made to 64fa17b3be.
2023-06-14 16:07:14 +02:00
Marco Neumann 335d9f7357
chore: minimize proptest features (#7993) 2023-06-14 12:28:18 +00:00
Carol (Nichols || Goulding) eef84d9df3
test: Use zip rather than indexing (also check that lengths match) 2023-06-12 12:21:05 -04:00
Carol (Nichols || Goulding) eb01d93d7f
fix: Clarify unit of value in a panic message 2023-06-12 12:02:56 -04:00
Carol (Nichols || Goulding) 0fd32706a3
fix: Improve test assertion failure messages 2023-06-12 12:01:15 -04:00
Carol (Nichols || Goulding) 5decbae0d5
docs: Clarify some partition template docs 2023-06-12 11:56:24 -04:00
Dom Dwyer fc49b3ec19
feat: restrict partition template length
Partition templates should not contain more than 8 parts, which when
combined with a per-part byte limit, bounds the maximum size of a
partition key.

This commit causes the router to refuse to service a write request that
contains > 8 parts in the template - this causes a panic, as it's a
broken system invariant and should be an unreachable state. Templates
are pre-validated at creation time to contain no more than 8 parts, and
are immutable:

    https://github.com/influxdata/influxdb_iox/pull/7930
2023-06-09 13:44:33 +02:00
Dom Dwyer 050093df1e
feat: truncate partition key parts at 200 bytes
This commit ensures all partition key parts are less than or equal to
200 bytes long.

If a string exceeds the 200 byte limit, it is truncated (avoiding
splitting unicode code-points or graphemes) and then a single "#"
sentinel value is appended. When reversed from the string, these column
values are indicated to be suitable for prefix-matching only - a
property that is encoded into the type system.

This commit takes a conservative approach of not splitting graphemes as
outlined in the module documentation, but this could be relaxed in the
future if needed.
2023-06-09 13:44:32 +02:00
Dom 93fe5949e9
Merge branch 'main' into dom/partition-key-dedupe 2023-06-08 16:12:31 +01:00
Dom Dwyer 60d3ae403f
fix: panic when using %#z time formatter
Props to proptesting for this one - the prop_arbitrary_strftime_format()
randomly generated the formatting sequence "%#z" which turns out to be
an undocumented way of causing a panic in chrono:

    088b69372e/src/format/mod.rs (L673)

In fact, the docs actually list is as a usable sequence!
2023-06-08 14:28:03 +02:00
Dom Dwyer 08ecb7fba3
perf: partition key generation dedupe
This commit changes the partitioner to skip generating partition keys
for successive rows that would generate identical partition keys.

Often successive rows in a batch will map to the same partition key -
for example, if multiple measurements are taken at the same time, then
the strftime formatter will output the same partition key part for each
row.

This commit changes the partitioner to only generate the first key
string in such a batch of identical keys. This is cheap to pre-compute,
as we only allow tag & time columns to be partitioned, both of which are
64-bit integers (dictionary key & timestamp respectively), making it
cheaper to check equality than to allocate & generate the partition key
string and check that.

Combined with the default YYYY-MM-DD precision reduction optimisation in
a prior commit, this optimisation is particularly effective for writes
with timestamps that span a single day (the typical case).

This change doubles the rows/s throughput for a modest 1,000 line batch,
with improvements across the board. I'd expect the performance benefit
to increase as the batch size increases, and/or as more partition
template parts are added.
2023-06-08 11:18:51 +02:00
Dom Dwyer 60cbf53087
refactor: strftime last value equality matcher
Allows the StftimeFormatter to perform an equality match against a
timestamp and the last rendered timestamp, potentially after applying
the precision reduction optimisation if appropriate.
2023-06-08 11:15:13 +02:00
Carol (Nichols || Goulding) d0db1194e2
feat: Validate custom partition templates on their creation
Make sure custom partition templates have:

- At least one part
- No more than 8 parts
- Only nonempty, valid strftime formats
2023-06-07 11:38:12 -04:00
Carol (Nichols || Goulding) ac26ceef91
feat: Make a place to do partition template validation
- Create data_types::partition_template::ValidationError
- Make creation of NamespacePartitionTemplateOverride and
  TablePartitionTemplateOverride fallible
- Move SerializationWrapper into a module to make its inner field
  private to force creation through one fallible constructor; this is
  where the validation logic will go to be shared among all uses of
  partition templates
2023-06-07 11:38:12 -04:00
Dom Dwyer 0b5b6a8e19
perf: strftime partitioner caching
This commit extracts the strftime formatting logic into it's own type,
and implements a small ring-buffer based cache containing the last 5
observed timestamps (lazily initialised).

This optimisation leverages the fact that the typical write to IOx is a
batch containing many hundreds or thousands of rows - often these rows
are measurements of multiple variables at the same timestamp; for
example, a metric scrape system will periodically read a set of metrics
and assign them all the same timestamp (the "scrape timestamp").

Because of the above, batches often contain multiple rows with the same
timestamp, so we can reduce the overhead by cacheing the resulting
partition key value for any given timestamp, eliminating the need to
re-compute it for these successive identical values. We retain the last
5 observed timestamps (FIFO) to provide a degree of "look back".

Alone the above is effective for measurements all with the same exact
timestamp, often a subset of a batch. However a further optimisation is
possible: because the default partitioning scheme (YYYY-MM-DD) operates
at a granularity of days, the timestamp precision can be reduced
(discarding hours, minutes, seconds, etc) without effecting the
resulting partition key. Therefore when the default partitioning scheme
is used, this commit will normalise timestamps to match this reduced
precision before caching, causing the cache hit & string re-use rate to
rise to 100% for batches containing measurements that span < 6 days.

This brings a ~5x improvement against a modest batch size of 1,000
lines, showing improvement across all batch sizes and partitioning
schemes (default & custom). I'd expect the performance improvement to be
even greater for larger batches.
2023-06-06 17:13:26 +02:00
Dom Dwyer ea3dcba308
perf: preallocate partition key strings
Partition keys tend to be approximately the same size each time (and in
the default case, always exactly the same size).

This simple change reduces allocations by pre-sizing the next partition
key string to match that of the previous. This should reduce the number
of allocations needed to grow the string for ~10% throughput increase.
2023-06-05 11:31:03 +02:00
Dom Dwyer 8e61dc5aef
refactor: remove InvalidStrftime value
It's big, it's annoying, it's already available to the user.
2023-06-05 11:31:02 +02:00
Dom Dwyer 47214ec9a0
fix: prevent panics in partitioning logic
Changes the partitioning logic to be fallible. This prevents an invalid
partition template from causing a panic, previously possible through two
known code paths:

    * TagValue formatter referencing a non-tag column
    * Time formatter using an invalid strftime format string

If either occurs, the write attempt is now aborted and an error returned
to the user with a HTTP 500 status code.

Additionally unexpected partitioner errors now map to a catch-all error
instead of panicking.
2023-06-01 17:44:44 +02:00
Dom Dwyer 6bb4f20d7c
refactor: remove redundant test
test_partition_key was recreated below via a test generator.
2023-06-01 17:44:43 +02:00
Dom Dwyer 37bb5e0585
test: arbitrary reversible partition keys
This test constructs a partition key from an arbitrary selection of
pre-defined parts, and uses the resulting template to partition a write
containing an arbitrary selection of pre-defined tag columns.

Once a partition key is derived, the test asserts build_column_values()
reverses it into the original set of tag (column_name, value) tuples
present in the write.
2023-05-30 15:58:26 +02:00
Dom Dwyer 27bef292a3
feat: unambiguously reversible partition keys
This commit changes the format of partition keys when generated with
non-default partition key templates ONLY. A prior fixture test is
unchanged by this commit, ensuring the default partition keys remain
the same.

When a custom partition key template is provided, it may specify one or
more parts, with the TagValue template causing values extracted from tag
columns to appear in the derived partition key.

This commit changes the generated partition key in the following ways:

    * The delimiter of multi-part partition keys; the character used to
      delimit partition key parts is changed from "/" to "|" (the pipe
      character) as it is less likely to occur in user-provided input,
      reducing the encoding overhead.

    * The format of the extracted TagValue values (see below).

Building on the work of custom partition key overrides, where an
immutable partition template is resolved and set at table creation time,
the changes in this PR enable the derived partition key to be
unambiguously reversed into the set of tag (column_name, column_value)
tuples it was generated from for use in query pruning logic. This is
implemented by the build_column_values() method in this commit, which
requires both the template, and the derived partition key.

Prior to this commit, a partition key value extracted from a tag column
was in the form "tagname_x" where "x" is the value and "tagname" is the
name of the tag column it was extracted from. After this commit, the
partition key value is in the form "x"; the column name is removed from
the derived string to reduce the catalog storage overhead (a key driver
of COGS). In the case of a NULL tag value, the sentinel value "!" is
inserted instead of the prior "tagname_" marker. In the case of an empty
string tag value (""), the sentinel "^" value is inserted instead of the
"tagname_-" marker, ensuring the distinction between an empty value and
a not-present tag is preserved.

Additionally tag values utilise percent encoding to encode reserved
characters (part delimiter, empty sentinel character, % itself) to
eliminate deserialisation ambiguity.

Examples of how this has changed derived partition keys, for a template
of [Time(YYYY-MM-DD), TagValue(region), TagValue(bananas)]:

    Write: time=1970-01-01,region=west,other=ignored
        Old: "1970-01-01-region_west-bananas"
        New: "1970-01-01|west|!"

    Write: time=1970-01-01,other=ignored
        Old: "1970-01-01-region-bananas"
        New: "1970-01-01|!|!"
2023-05-30 15:58:25 +02:00
Dom Dwyer 57ba3c8cf5
test: default partition key fixture
This test asserts the partition key of a write derived from the default
partition key template (YYYY-MM-DD).

This test ensures that the default partition keys do not change with
subsequent changes, as these values are what are used today.
2023-05-30 15:55:08 +02:00
Dom Dwyer 9e0570f2bf
refactor: explicit submod for partition_template
Move the import into the submodule itself, rather than re-exporting it
at the crate level.

This will make it possible to link to the specific module/logic.
2023-05-30 15:13:20 +02:00
Carol (Nichols || Goulding) aab0acc16a
fix: Panic if attempting to partition on a non-tag column 2023-05-24 10:34:31 -04:00
Carol (Nichols || Goulding) 9c0faa66f0
feat: Set a table partition template explicitly or from the namespace
And use the table partition template when partitioning writes to that
table.
2023-05-24 10:34:30 -04:00
Carol (Nichols || Goulding) afb3838437
feat: Optionally supply the namespace partition template when creating a namespace 2023-05-24 10:10:34 -04:00
Dom Dwyer 928a4d163e
build: remove unused dependencies from crates
This commit fixes loads of crates (47!) had unused dependencies, or
mis-configured dependencies (test deps as normal deps).

I added the "unused_crate_dependencies" to all crates to help prevent
this mess from growing again!

    https://doc.rust-lang.org/beta/nightly-rustc/rustc_lint_defs/builtin/static.UNUSED_CRATE_DEPENDENCIES.html

This has the minor downside of false-positives when specifying
dev-dependencies for test/bench binaries - these are files in /test or
/benches (not normal tests). This commit includes a workaround,
importing them in lib.rs (gated by a feature flag). I think the
trade-off of better dependency management is worth it!
2023-05-23 14:55:43 +02:00
Andrew Lamb 6344fe8c3f
chore: Add rationale for `clippy::future_not_send` (#7822)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2023-05-18 16:58:56 +00:00
kayagokalp 81eb663122 refactor: accept impl Into<String> for schema methods 2023-05-11 01:44:14 +03:00
Carol (Nichols || Goulding) 2aa8713d1d
fix: Remove partition TemplatePart::Table; partitioning is already per-table 2023-05-09 14:54:57 +02:00
Carol (Nichols || Goulding) ef9ef75e56
fix: Remove unsupported TemplatePart variants (#7746)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2023-05-04 16:20:18 +00:00
Andrew Lamb d8b0139ea9
chore: Update datafusion + arrow/parquet/arrow-flight to 36 (#7354)
* chore: Update datafusion +  arrow/parquet/arrow-flight to 36

* refactor: update optimize for new API

* refactor: update parquet for new API

* chore: Update more dependencies

* chore: Update to use the new buffer creation APIs

* chore: Run cargo hakari tasks

* fix: bad len

* fix: update for API change

---------

Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2023-03-29 13:41:59 +00:00
Carol (Nichols || Goulding) cc7c44f76a
chore: Upgrade to Rust 1.68 (#7175)
* chore: Upgrade to Rust 1.68

* fix: Remove unnecessary into_iter, thanks Clippy!

* fix: Use the size of the type, not a reference to the type... oops.

Thanks clippy!

* fix: Return block directly instead of creating a variable

Thanks clippy!

---------

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2023-03-12 13:22:20 +00:00
Carol (Nichols || Goulding) faae5eb438 chore: Rerun cargo hakari manage-deps 2023-02-27 11:56:15 +01:00
Andrew Lamb f93baf7693
chore: Update DataFusion and `arrow` / `arrow-flight` / `parquet` to `33.0.0` (#7045)
* chore: Update DataFusion and arrow/arrow-flight/parquet to 33.0.0

* fix: Update test output

* fix: update more test output

* fix: Update querier test output

* chore: Run cargo hakari tasks

* test: fix formatting

Fix formatting of batch pretty printing.

* test: fix formatting

Fix formatting of batch pretty printing.

* test: fix formatting for selector tests

---------

Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
Co-authored-by: Dom Dwyer <dom@itsallbroken.com>
Co-authored-by: Christopher Wolff <chris.wolff@influxdata.com>
2023-02-22 21:24:20 +00:00