influxdb

Commit Graph

Author	SHA1	Message	Date
Carol (Nichols \|\| Goulding)	eef84d9df3	test: Use zip rather than indexing (also check that lengths match)	2023-06-12 12:21:05 -04:00
Carol (Nichols \|\| Goulding)	eb01d93d7f	fix: Clarify unit of value in a panic message	2023-06-12 12:02:56 -04:00
Carol (Nichols \|\| Goulding)	0fd32706a3	fix: Improve test assertion failure messages	2023-06-12 12:01:15 -04:00
Carol (Nichols \|\| Goulding)	5decbae0d5	docs: Clarify some partition template docs	2023-06-12 11:56:24 -04:00
Dom Dwyer	fc49b3ec19	feat: restrict partition template length Partition templates should not contain more than 8 parts, which when combined with a per-part byte limit, bounds the maximum size of a partition key. This commit causes the router to refuse to service a write request that contains > 8 parts in the template - this causes a panic, as it's a broken system invariant and should be an unreachable state. Templates are pre-validated at creation time to contain no more than 8 parts, and are immutable: https://github.com/influxdata/influxdb_iox/pull/7930	2023-06-09 13:44:33 +02:00
Dom Dwyer	050093df1e	feat: truncate partition key parts at 200 bytes This commit ensures all partition key parts are less than or equal to 200 bytes long. If a string exceeds the 200 byte limit, it is truncated (avoiding splitting unicode code-points or graphemes) and then a single "#" sentinel value is appended. When reversed from the string, these column values are indicated to be suitable for prefix-matching only - a property that is encoded into the type system. This commit takes a conservative approach of not splitting graphemes as outlined in the module documentation, but this could be relaxed in the future if needed.	2023-06-09 13:44:32 +02:00
Dom	93fe5949e9	Merge branch 'main' into dom/partition-key-dedupe	2023-06-08 16:12:31 +01:00
Dom Dwyer	60d3ae403f	fix: panic when using %#z time formatter Props to proptesting for this one - the prop_arbitrary_strftime_format() randomly generated the formatting sequence "%#z" which turns out to be an undocumented way of causing a panic in chrono: `088b69372e/src/format/mod.rs (L673)` In fact, the docs actually list is as a usable sequence!	2023-06-08 14:28:03 +02:00
Dom Dwyer	08ecb7fba3	perf: partition key generation dedupe This commit changes the partitioner to skip generating partition keys for successive rows that would generate identical partition keys. Often successive rows in a batch will map to the same partition key - for example, if multiple measurements are taken at the same time, then the strftime formatter will output the same partition key part for each row. This commit changes the partitioner to only generate the first key string in such a batch of identical keys. This is cheap to pre-compute, as we only allow tag & time columns to be partitioned, both of which are 64-bit integers (dictionary key & timestamp respectively), making it cheaper to check equality than to allocate & generate the partition key string and check that. Combined with the default YYYY-MM-DD precision reduction optimisation in a prior commit, this optimisation is particularly effective for writes with timestamps that span a single day (the typical case). This change doubles the rows/s throughput for a modest 1,000 line batch, with improvements across the board. I'd expect the performance benefit to increase as the batch size increases, and/or as more partition template parts are added.	2023-06-08 11:18:51 +02:00
Dom Dwyer	60cbf53087	refactor: strftime last value equality matcher Allows the StftimeFormatter to perform an equality match against a timestamp and the last rendered timestamp, potentially after applying the precision reduction optimisation if appropriate.	2023-06-08 11:15:13 +02:00
Carol (Nichols \|\| Goulding)	d0db1194e2	feat: Validate custom partition templates on their creation Make sure custom partition templates have: - At least one part - No more than 8 parts - Only nonempty, valid strftime formats	2023-06-07 11:38:12 -04:00
Carol (Nichols \|\| Goulding)	ac26ceef91	feat: Make a place to do partition template validation - Create data_types::partition_template::ValidationError - Make creation of NamespacePartitionTemplateOverride and TablePartitionTemplateOverride fallible - Move SerializationWrapper into a module to make its inner field private to force creation through one fallible constructor; this is where the validation logic will go to be shared among all uses of partition templates	2023-06-07 11:38:12 -04:00
Dom Dwyer	0b5b6a8e19	perf: strftime partitioner caching This commit extracts the strftime formatting logic into it's own type, and implements a small ring-buffer based cache containing the last 5 observed timestamps (lazily initialised). This optimisation leverages the fact that the typical write to IOx is a batch containing many hundreds or thousands of rows - often these rows are measurements of multiple variables at the same timestamp; for example, a metric scrape system will periodically read a set of metrics and assign them all the same timestamp (the "scrape timestamp"). Because of the above, batches often contain multiple rows with the same timestamp, so we can reduce the overhead by cacheing the resulting partition key value for any given timestamp, eliminating the need to re-compute it for these successive identical values. We retain the last 5 observed timestamps (FIFO) to provide a degree of "look back". Alone the above is effective for measurements all with the same exact timestamp, often a subset of a batch. However a further optimisation is possible: because the default partitioning scheme (YYYY-MM-DD) operates at a granularity of days, the timestamp precision can be reduced (discarding hours, minutes, seconds, etc) without effecting the resulting partition key. Therefore when the default partitioning scheme is used, this commit will normalise timestamps to match this reduced precision before caching, causing the cache hit & string re-use rate to rise to 100% for batches containing measurements that span < 6 days. This brings a ~5x improvement against a modest batch size of 1,000 lines, showing improvement across all batch sizes and partitioning schemes (default & custom). I'd expect the performance improvement to be even greater for larger batches.	2023-06-06 17:13:26 +02:00
Dom Dwyer	ea3dcba308	perf: preallocate partition key strings Partition keys tend to be approximately the same size each time (and in the default case, always exactly the same size). This simple change reduces allocations by pre-sizing the next partition key string to match that of the previous. This should reduce the number of allocations needed to grow the string for ~10% throughput increase.	2023-06-05 11:31:03 +02:00
Dom Dwyer	8e61dc5aef	refactor: remove InvalidStrftime value It's big, it's annoying, it's already available to the user.	2023-06-05 11:31:02 +02:00
Dom Dwyer	47214ec9a0	fix: prevent panics in partitioning logic Changes the partitioning logic to be fallible. This prevents an invalid partition template from causing a panic, previously possible through two known code paths: * TagValue formatter referencing a non-tag column * Time formatter using an invalid strftime format string If either occurs, the write attempt is now aborted and an error returned to the user with a HTTP 500 status code. Additionally unexpected partitioner errors now map to a catch-all error instead of panicking.	2023-06-01 17:44:44 +02:00
Dom Dwyer	6bb4f20d7c	refactor: remove redundant test test_partition_key was recreated below via a test generator.	2023-06-01 17:44:43 +02:00
Dom Dwyer	37bb5e0585	test: arbitrary reversible partition keys This test constructs a partition key from an arbitrary selection of pre-defined parts, and uses the resulting template to partition a write containing an arbitrary selection of pre-defined tag columns. Once a partition key is derived, the test asserts build_column_values() reverses it into the original set of tag (column_name, value) tuples present in the write.	2023-05-30 15:58:26 +02:00
Dom Dwyer	27bef292a3	feat: unambiguously reversible partition keys This commit changes the format of partition keys when generated with non-default partition key templates ONLY. A prior fixture test is unchanged by this commit, ensuring the default partition keys remain the same. When a custom partition key template is provided, it may specify one or more parts, with the TagValue template causing values extracted from tag columns to appear in the derived partition key. This commit changes the generated partition key in the following ways: * The delimiter of multi-part partition keys; the character used to delimit partition key parts is changed from "/" to "\|" (the pipe character) as it is less likely to occur in user-provided input, reducing the encoding overhead. * The format of the extracted TagValue values (see below). Building on the work of custom partition key overrides, where an immutable partition template is resolved and set at table creation time, the changes in this PR enable the derived partition key to be unambiguously reversed into the set of tag (column_name, column_value) tuples it was generated from for use in query pruning logic. This is implemented by the build_column_values() method in this commit, which requires both the template, and the derived partition key. Prior to this commit, a partition key value extracted from a tag column was in the form "tagname_x" where "x" is the value and "tagname" is the name of the tag column it was extracted from. After this commit, the partition key value is in the form "x"; the column name is removed from the derived string to reduce the catalog storage overhead (a key driver of COGS). In the case of a NULL tag value, the sentinel value "!" is inserted instead of the prior "tagname_" marker. In the case of an empty string tag value (""), the sentinel "^" value is inserted instead of the "tagname_-" marker, ensuring the distinction between an empty value and a not-present tag is preserved. Additionally tag values utilise percent encoding to encode reserved characters (part delimiter, empty sentinel character, % itself) to eliminate deserialisation ambiguity. Examples of how this has changed derived partition keys, for a template of [Time(YYYY-MM-DD), TagValue(region), TagValue(bananas)]: Write: time=1970-01-01,region=west,other=ignored Old: "1970-01-01-region_west-bananas" New: "1970-01-01\|west\|!" Write: time=1970-01-01,other=ignored Old: "1970-01-01-region-bananas" New: "1970-01-01\|!\|!"	2023-05-30 15:58:25 +02:00
Dom Dwyer	57ba3c8cf5	test: default partition key fixture This test asserts the partition key of a write derived from the default partition key template (YYYY-MM-DD). This test ensures that the default partition keys do not change with subsequent changes, as these values are what are used today.	2023-05-30 15:55:08 +02:00
Dom Dwyer	9e0570f2bf	refactor: explicit submod for partition_template Move the import into the submodule itself, rather than re-exporting it at the crate level. This will make it possible to link to the specific module/logic.	2023-05-30 15:13:20 +02:00
Carol (Nichols \|\| Goulding)	aab0acc16a	fix: Panic if attempting to partition on a non-tag column	2023-05-24 10:34:31 -04:00
Carol (Nichols \|\| Goulding)	9c0faa66f0	feat: Set a table partition template explicitly or from the namespace And use the table partition template when partitioning writes to that table.	2023-05-24 10:34:30 -04:00
Carol (Nichols \|\| Goulding)	afb3838437	feat: Optionally supply the namespace partition template when creating a namespace	2023-05-24 10:10:34 -04:00
Dom Dwyer	928a4d163e	build: remove unused dependencies from crates This commit fixes loads of crates (47!) had unused dependencies, or mis-configured dependencies (test deps as normal deps). I added the "unused_crate_dependencies" to all crates to help prevent this mess from growing again! https://doc.rust-lang.org/beta/nightly-rustc/rustc_lint_defs/builtin/static.UNUSED_CRATE_DEPENDENCIES.html This has the minor downside of false-positives when specifying dev-dependencies for test/bench binaries - these are files in /test or /benches (not normal tests). This commit includes a workaround, importing them in lib.rs (gated by a feature flag). I think the trade-off of better dependency management is worth it!	2023-05-23 14:55:43 +02:00
Andrew Lamb	6344fe8c3f	chore: Add rationale for `clippy::future_not_send` (#7822 ) Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2023-05-18 16:58:56 +00:00
kayagokalp	81eb663122	refactor: accept impl Into<String> for schema methods	2023-05-11 01:44:14 +03:00
Carol (Nichols \|\| Goulding)	2aa8713d1d	fix: Remove partition TemplatePart::Table; partitioning is already per-table	2023-05-09 14:54:57 +02:00
Carol (Nichols \|\| Goulding)	ef9ef75e56	fix: Remove unsupported TemplatePart variants (#7746 ) Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2023-05-04 16:20:18 +00:00
Andrew Lamb	d8b0139ea9	chore: Update datafusion + arrow/parquet/arrow-flight to 36 (#7354 ) * chore: Update datafusion + arrow/parquet/arrow-flight to 36 * refactor: update optimize for new API * refactor: update parquet for new API * chore: Update more dependencies * chore: Update to use the new buffer creation APIs * chore: Run cargo hakari tasks * fix: bad len * fix: update for API change --------- Co-authored-by: CircleCI[bot] <circleci@influxdata.com> Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2023-03-29 13:41:59 +00:00
Carol (Nichols \|\| Goulding)	cc7c44f76a	chore: Upgrade to Rust 1.68 (#7175 ) * chore: Upgrade to Rust 1.68 * fix: Remove unnecessary into_iter, thanks Clippy! * fix: Use the size of the type, not a reference to the type... oops. Thanks clippy! * fix: Return block directly instead of creating a variable Thanks clippy! --------- Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2023-03-12 13:22:20 +00:00
Carol (Nichols \|\| Goulding)	faae5eb438	chore: Rerun cargo hakari manage-deps	2023-02-27 11:56:15 +01:00
Andrew Lamb	f93baf7693	chore: Update DataFusion and `arrow` / `arrow-flight` / `parquet` to `33.0.0` (#7045 ) * chore: Update DataFusion and arrow/arrow-flight/parquet to 33.0.0 * fix: Update test output * fix: update more test output * fix: Update querier test output * chore: Run cargo hakari tasks * test: fix formatting Fix formatting of batch pretty printing. * test: fix formatting Fix formatting of batch pretty printing. * test: fix formatting for selector tests --------- Co-authored-by: CircleCI[bot] <circleci@influxdata.com> Co-authored-by: Dom Dwyer <dom@itsallbroken.com> Co-authored-by: Christopher Wolff <chris.wolff@influxdata.com>	2023-02-22 21:24:20 +00:00
Dom Dwyer	a1764ee7cb	refactor: ExactSizeIterator for columns iter Adds ExactSizeIterator bounds to the MutableBatch::column() iter, allowing O(1) length discovery / pre-allocation optimisations for container collection.	2023-02-06 17:33:56 +01:00
Carol (Nichols \|\| Goulding)	30fea67701	fix: Move variables within format strings. Thanks clippy! Changes made automatically using `cargo clippy --fix`.	2023-02-03 13:06:17 -05:00
Paul Dix	84698b3532	feat: add size_data to mutable batch (#6425 ) This method will be used in the new ingestion pipeline to approximate how much memory a butable batch will take to convert to arrow and persist. It is meant only as a very rough estimate to trigger persistence for hot partitions. Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-12-16 17:20:16 +00:00
Jake Goulding	cc17e5a54b	refactor: use a workspace dependency for hashbrown	2022-11-11 13:25:39 -05:00
dependabot[bot]	5024523f00	chore(deps): Bump hashbrown from 0.12.3 to 0.13.1 Bumps [hashbrown](https://github.com/rust-lang/hashbrown) from 0.12.3 to 0.13.1. - [Release notes](https://github.com/rust-lang/hashbrown/releases) - [Changelog](https://github.com/rust-lang/hashbrown/blob/master/CHANGELOG.md) - [Commits](https://github.com/rust-lang/hashbrown/compare/v0.12.3...v0.13.1) --- updated-dependencies: - dependency-name: hashbrown dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com>	2022-11-11 13:24:56 -05:00
Andrew Lamb	4fb2843d05	refactor: Rename `schema::selection::Selection` to `schema::projection::Projection` (#6037 ) * chore: Rename `schema::selection::Selection` to `schema::projection::Projection` * fix: docs Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-11-02 18:15:04 +00:00
Carol (Nichols \|\| Goulding)	3145e2c05b	feat: Use workspace dep inheritance for the arrow crate	2022-10-26 10:34:29 -04:00
Carol (Nichols \|\| Goulding)	2e83e04eab	feat: Use workspace package metadata to reduce differences and repetition	2022-10-24 13:04:09 -04:00
Andrew Lamb	d706f8221d	chore: Update datafusion and arrow / parquet / arrow-flight 25.0.0 (#5900 ) * chore: Update datafusion and `arrow` / `parquet` / `arrow-flight` 25.0.0 * chore: Update for structure changes * chore: Update for new projection pushdown * chore: Run cargo hakari tasks * fix: fmt Co-authored-by: CircleCI[bot] <circleci@influxdata.com> Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-10-18 20:58:47 +00:00
Carol (Nichols \|\| Goulding)	efb964c390	feat: Enforce table column limits from the schema cache (#5819 ) * fix: Avoid some allocations by collecting instead of inserting into a vec * refactor: Encode that adding columns is for one table at a time * test: Add another test of column limits * test: Add below/above limit tests for create_or_get_many * fix: Explicitly DO NOT check column limits when inserting many columns * feat: Cache the max_columns_per_table on the NamespaceSchema * feat: Add a function to validate column limits in-memory * fix: Provide more useful information when over column limits * fix: Swap types to remove intermediate allocation * docs: Explain the interactions of the cache and the column limits * test: Actually set up test that showcases column limit race condition * fix: Allow writing to existing columns even if table is over column limit Co-authored-by: Dom <dom@itsallbroken.com>	2022-10-14 11:34:17 +00:00
Andrew Lamb	d57c99638c	chore: Update datafusion + `arrow`, `arrow-flight`, and `parquet` to 24.0.0.0 (#5792 ) * chore: Update datafusion + `arrow`, `arrow-flight`, and `parquet` to 24.0.0.0 * fix: Update for coercion, fix explain plans for change in column name display * chore: Update datafusion lock * fix: Update for other API changes * chore: Update to latest datafusion pin * chore: Run cargo hakari tasks Co-authored-by: CircleCI[bot] <circleci@influxdata.com> Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-10-12 16:19:14 +00:00
Dom Dwyer	cd4087e00d	style: add no todo!() or dbg!() lints Some crates had theme, some not - lets be consistent and have the compiler spot dbg!() and todo!() macro calls - they should never be in prod code!	2022-09-29 13:10:07 +02:00
Andrew Lamb	66dbb9541f	chore: Update datafusion and `arrow`/`parquet`/`arrow-flight` to 23.0.0, `thrift` to 0.16.0 (#5694 ) * chore: Update datafusion and `arrow`/`parquet`/`arrow-flight` to 23.0.0 * chore: Update thrift / remove parquet_format * fix: Update APIs * chore: Update lock + Run cargo hakari tasks * fix: use patched version of arrow-rs to work around https://github.com/apache/arrow-rs/issues/2779 * chore: Run cargo hakari tasks Co-authored-by: CircleCI[bot] <circleci@influxdata.com> Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-09-27 12:50:54 +00:00
Andrew Lamb	1fd31ee3bf	chore: Update datafusion / `arrow` / `arrow-flight` / `parquet` to version 22.0.0 (#5591 ) * chore: Update datafusion / `arrow` / `arrow-flight` / `parquet` to version 22.0.0 * fix: enable dynamic comparison flag * chore: derive Eq for clippy * chore: update explain plans * chore: Update sizes for ReadBuffer encoding * chore: update more tests Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-09-12 17:45:03 +00:00
Andrew Lamb	6669d85fb4	chore: Update datafusion + arrow/parquet to `21.0.0` (#5519 ) * chore: Update arrow/arrow-flight/parquet to 21.0.0 * chore: Update datafusion pin * chore: Fix arrow update script * chore: Update Cargo.lock * chore: Update for new API	2022-08-31 13:30:47 +00:00
Carol (Nichols \|\| Goulding)	549a267e3c	fix: Use Self instead of unnecessary structure name repetition As now caught by clippy. https://rust-lang.github.io/rust-clippy/master/index.html#use_self	2022-08-11 15:21:02 -04:00
Andrew Lamb	16ddc5efc6	chore: Update datafusion / arrow/parquet/arrow-flight and prost/tonic ecosystem (#5360 ) * chore: Update datafusion and arrow * chore: Update Cargo.lock * chore: update to Decimal128 * chore: Update tonic/prost/pbjson/etc * chore: Run cargo hakari tasks * fix: doctest in generated types Co-authored-by: CircleCI[bot] <circleci@influxdata.com>	2022-08-09 17:30:44 +00:00

1 2 3

101 Commits (1762172321582bc403875296a9fbaf9e585169cc)