influxdb

Commit Graph

Author	SHA1	Message	Date
Nga Tran	425b8a63cf	fix: avoid combing groups that overlap with other groups even if they are small (#5052 ) * fix: avoid combing groups that overlap with other groups even if they are small * chore: Apply suggestions from code review Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-07-06 14:03:15 +00:00
Nga Tran	d8b74f6af8	refactor: convert a panic into an error and throw a warning if we choose non-actionable compacting candidates (#5041 ) * refactor: convert a panic into an error and throw a warning if we choose non-actionable candidates * chore: Apply suggestions from code review Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * chore: run fmt Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>	2022-07-05 18:53:52 +00:00
Nga Tran	1de022136c	feat: add max desired file size config param (#5025 ) * feat: add max desired file size config param * fix: comment typos * chore: Apply suggestions from code review Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * chore: Apply suggestions from code review Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>	2022-07-05 15:32:45 +00:00
Marco Neumann	16bd3e67c0	refactor: unify `apply_predicate_to_metadata` (#5030 ) Instead of using some hand-rolled timestamp-based logic (or just "unknown") all over the place, just use logic introduced in #5017. This requires slightly improved table summaries within the querier that at least has min/max for the timestamp column. For that, the former `IngesterChunk`-specific `calculate_summary` method was extended to `create_basic_summary` to include that data and is now also used by `QuerierParquetChunk`. Note: `QuerierRBChunk` already has detailled metrics that are provided by the read buffer implementation. Should we ever need even better pruning for `QuerierParquetChunk` (or `IngesterChunk`) then we _only_ need add extra data to the table summaries. Closes #4976. Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-07-05 12:51:59 +00:00
Nga Tran	153c262d63	fix: do not panic on chunks with same range of sequence numbers but are not time-overlapped (#5018 ) * fix: do not panic on chunks with same range of sequence numbers but are not time-overlapped * chore: remove unused comment * chore: fix typo	2022-07-01 15:58:09 +00:00
Marco Neumann	87a8579742	refactor: `ChunkOrder::new` cannot fail (#5004 ) Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-06-30 22:26:20 +00:00
Marco Neumann	be53716e4d	refactor: use IDs for `parquet_file.column_set` (#4965 ) * feat: `ColumnRepo::list_by_table_id` * refactor: use IDs for `parquet_file.column_set` Closes #4959. * refactor: introduce `TableSchema::column_id_map`	2022-06-30 15:08:41 +00:00
Raphael Taylor-Davies	835e1c91c7	chore: update object_store to 0.3.0 (#4707 ) * chore: update object_store to 0.3.0 * chore: review feedback Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-06-29 21:44:03 +00:00
Nga Tran	0cca975167	fix: Split overlapped files based on the order of sequence numbers and only group non-overlapped contigous small files (#4968 ) * fix: Split overlapped files based on the order of sequence numbers and only group non-overlapped contigous small files * test: add one more test for group contiguous files: * refactor: address review comments Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-06-29 20:09:51 +00:00
Nga Tran	cfcc4b8426	refactor: change level 1 to level 2 preparing for next design changes (#4954 ) * refactor: change level 1 to level 2 preparing for next design changes * fix: make level-2 consistent everywhere * chore: remove unused comments * refactor: change all the name level_1 to level_2 to completely replace 1 with 2 to amke everything consistent * chore: add correspinding constants for the comapction levels in the comments Co-authored-by: Dom <dom@itsallbroken.com>	2022-06-29 14:08:58 +00:00
Marco Neumann	215f297162	refactor: parquet file metadata from catalog (#4949 ) * refactor: remove `ParquetFileWithMetadata` * refactor: remove `ParquetFileRepo::parquet_metadata` * refactor: parquet file metadata from catalog Closes #4124.	2022-06-27 15:38:39 +00:00
Marco Neumann	1a74f84494	refactor: remove `ParquetFileWithMetadata` usage outside the catalog (#4948 ) * refactor: remove `DecodedParquetFile` from `iox_tests` * refactor: remove `DecodedParquetFile` from querier Also pull out all the chunk schema and sort key handling into a function so that RB chunks and parquet chunks mostly use the same code path. * refactor: remove `DecodedParquetFile` * refactor: remove `ParquetFileWithMetadata` usage * fix: test data consistency	2022-06-27 15:19:29 +00:00
Marco Neumann	3b78bf1c48	refactor: remove binary parquet file MD from compactor (#4938 ) * refactor: simplify sort key calculation * refactor: use schema from catalog instead from file * refactor: do not request parquet file MD in compactor * test: ensure that `QueryableParquetChunk` works correctly	2022-06-27 15:11:15 +00:00
Marco Neumann	b9cbb3dfca	refactor: do not use in-parquet IOx metadata in compactor () (#4935 ) refactor: avoid feeding sort key from struct into same struct * feat: allow namespace schema query by ID * refactor: do not use binary parquet file MD in compactor tests * refactor: do not use in-parquet IOx metadata * refactor: reduce number of catalog queries	2022-06-27 08:06:11 +00:00
Nga Tran	3c0fb6e8ef	fix: avoid using min_time, which can be negative, for ChunkId. Using object store id which is uuid instead (#4942 ) * fix: avoid using min_time, which can be negative, for ChunkId. Using object store id which is uuid instead * chore: Apply suggestions from code review Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * chore: run fmt Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-06-23 19:00:13 +00:00
Nga Tran	35dacf388b	feat: Compact now can split compacted results into multiple non-overlapped files based on config max file size (#4918 ) * feat: split times of compacting results based on the max file size * feat: cosider max file size while computing split time * test: tests for comput_split_time * feat: first step to teach the function split_the_steam to know how to split data into n streams using n-1 input PhysicalExprs * feat: make StreamSplitNode support a list of expression * docs: explain how StreamSplitNode works * feat: Teach compute_split_time to split a time range into many contiguous ranges and split compacted result into multiple non-overlapped files based on the config comapction_max_size_bytes * chore: cleanup * chore: clean up doc * chore: address review comments Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-06-23 18:54:03 +00:00
Marco Neumann	bd6c4659af	refactor: slim down parquet chunk (remove Metadata) (#4934 ) * feat: conversion from `ParquetFile` to `ParquetFilePath` * refactor: slim down parquet chunk - ensure it works without binary parquet metadata - timestamp range is no longer optional (ensured by the NG type system) - remove table summary: this is only needed for SOME API users. The compactor can perfectly work without statistics since has the timestamp range which is sufficient for the current overlap check (we don't use any other primary key stats at the moment). The querier currently does NOT use parquet chunks (was replaced by read buffer) but if it will again in some future it will likely need to find a way to fetch and cache the statistics. - the schema is now provided by the API user since it can be reconstructed using the NG catalog only (and "wrong" column orders are tolerated as of #4921) Ref #4124 Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-06-23 10:55:16 +00:00
Marco Neumann	9591bed696	refactor: make querier internals private (#4922 ) Queries internals are not meant to be used by other crates. Only a handful selected interfaces should be used by IOxD and the query tests. The compactor only used a very small subset just to read parquet files back into memory. It shall rather use the official `parquet_file` interface instead. Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-06-22 13:00:08 +00:00
Marco Neumann	c3912e34e9	refactor: store per-file column set in catalog (#4908 ) * refactor: store per-file column set in catalog Together with the table-wide schema and the partition-wide sort key, this should be everything we need to read a parquet file directly into memory without peeking any file-level metadata. The querier will use this to directly load parquet files into the read buffer. WARNING: This requires a catalog wipe! Ref #4124. * refactor: use proper `ColumnSet` type	2022-06-21 10:26:12 +00:00
Nga Tran	72c8cfa6ed	fix: make ChunkOrder i64 data type to accept min sequence number 0 and match with data type of sequence number (#4888 ) * fix: make ChunkOrder u64 data type to accept min sequence number 0 * fix: make ChunkOrder i64 to match with sequence number type Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-06-17 13:45:17 +00:00
Marco Neumann	0fbff981ec	chore(deps): Bump sqlx to 0.6.0 and uuid to 1 (#4894 ) Closes #4889. Closes #4890. Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2022-06-17 10:28:28 +00:00
Nga Tran	3ca74744bf	chore: debug info about sequence number while it gets converted into ChunkOrder (#4884 )	2022-06-16 18:40:55 +00:00
Nga Tran	d57b0eb1fa	chore: more info for i64-to-u128 panic message (#4881 ) * chore: more info for i64-to-u128 panic message * chore: Apply suggestions from code review Co-authored-by: Dom <dom@itsallbroken.com> * chore: fix fmt Co-authored-by: Dom <dom@itsallbroken.com> Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-06-16 15:49:43 +00:00
Andrew Lamb	005610b172	refactor: remove some `&` use in iox_catalog (#4862 ) * refactor: remove some `&` use in iox_catalog * fix: Update data_types/src/lib.rs	2022-06-15 11:31:49 +00:00
Andrew Lamb	e91d00b10c	chore: Update datafusion + `arrow`/`parquet`/`arrow-flight` to `16.0.0 (#4851 ) * chore: TEMP Update DataFusion to pre-release * chore: update arrow et al to 16.0.0 * chore: Run cargo hakari tasks * fix: update reader read_dictionary API * chore: Update to real Datafusion release * fix: Update parquet API * fix: update test Co-authored-by: CircleCI[bot] <circleci@influxdata.com>	2022-06-14 16:31:40 +00:00
Dom Dwyer	b41ea1d718	refactor: PartitionKey type This commit changes the code base to use a new reference-counted PartitionKey type wrapper, instead of passing a bare String around. This allows the compiler to type check & verify usage of the partition key, instead of passing a bare string around. By reference counting the underlying string, we reduce memory usage for some use cases.	2022-06-14 14:47:56 +01:00
Nga Tran	99f1f0a10c	chore: Revert "feat: compact all overlapped files no matter how large they are (#4779 )" (#4831 ) This reverts commit `3e89daa0d4`.	2022-06-10 15:52:00 +00:00
Carol (Nichols \|\| Goulding)	1c7cbaf5ae	refactor: Use DurationHistogram in more places	2022-06-09 14:20:51 -04:00
Andrew Lamb	f34282be2c	fix: Do not run DataFusion optimizer pass twice (#4809 ) * fix: Do not run DataFusion optimizer pass twice * docs: improve docstring and logging	2022-06-08 21:01:22 +00:00
Nga Tran	b60e1be0cf	chore: remove irrelaevant comments (#4791 )	2022-06-07 00:43:56 +00:00
Nga Tran	3e89daa0d4	feat: compact all overlapped files no matter how large they are (#4779 ) * feat: add an option to compact all overlapped files no matter how large they are * chore: Apply suggestions from code review * feat: always compact oerlapped files no matter how large they are * chore: cleaup	2022-06-06 23:39:09 +00:00
dependabot[bot]	04c685b3b7	chore(deps): Bump tokio-util from 0.7.2 to 0.7.3 (#4784 ) Bumps [tokio-util](https://github.com/tokio-rs/tokio) from 0.7.2 to 0.7.3. - [Release notes](https://github.com/tokio-rs/tokio/releases) - [Commits](https://github.com/tokio-rs/tokio/compare/tokio-util-0.7.2...tokio-util-0.7.3) --- updated-dependencies: - dependency-name: tokio-util dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2022-06-06 14:46:27 +00:00
dependabot[bot]	e03bf94420	chore(deps): Bump tokio from 1.18.2 to 1.19.1 (#4783 ) Bumps [tokio](https://github.com/tokio-rs/tokio) from 1.18.2 to 1.19.1. - [Release notes](https://github.com/tokio-rs/tokio/releases) - [Commits](https://github.com/tokio-rs/tokio/compare/tokio-1.18.2...tokio-1.19.1) --- updated-dependencies: - dependency-name: tokio dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-06-06 14:15:12 +00:00
Carol (Nichols \|\| Goulding)	aa510ae4e6	fix: Remove test uses of parquet chunks and document as unused The querier is now using read buffer chunks only, but we're leaving the parquet chunk code around for the moment.	2022-06-03 09:16:04 -04:00
Andrew Lamb	3592aa52d8	chore: Update datafusion + `arrow`/`parquet`/`arrow-flight` to `15.0.0` (#4743 ) * chore: Update datafusion + `arrow`/`parquet`/`arrow-flight` to `15.0.0` * chore: Update APIs * chore: Run cargo hakari tasks * feat: normalize parquet file metadata * chore: update size tests * chore: add docs on metadata stripping * chore: TEMP UPDATE TO DF BRANCH * chore: Update for new API * fix: Update to latest DF * fix: cargo hakari Co-authored-by: CircleCI[bot] <circleci@influxdata.com> Co-authored-by: Raphael Taylor-Davies <r.taylordavies@googlemail.com>	2022-06-03 10:32:26 +00:00
dependabot[bot]	9a21292db8	chore(deps): Bump async-trait from 0.1.53 to 0.1.56 (#4774 ) Bumps [async-trait](https://github.com/dtolnay/async-trait) from 0.1.53 to 0.1.56. - [Release notes](https://github.com/dtolnay/async-trait/releases) - [Commits](https://github.com/dtolnay/async-trait/compare/0.1.53...0.1.56) --- updated-dependencies: - dependency-name: async-trait dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2022-06-03 09:10:40 +00:00
Ryan Russell	d279deddad	docs(various): Improve Readability (#4768 ) Signed-off-by: Ryan Russell <git@ryanrussell.org> Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-06-02 18:01:06 +00:00
Nga Tran	79895b995c	chore: add debug info to see how many concurrent partitions being compacted in each cycle (#4772 ) Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-06-02 15:19:08 +00:00
Dom Dwyer	9ae58c89b6	refactor: constructor for ParquetFileWithTombstone Use a constructor to initialise a ParquetFileWithTombstone struct, rather than making the fields pub. This allows IDEs to "go to" places where this is constructed when browsing the code, but also keeps the type closed for modification of internals (SOLID).	2022-06-01 15:58:06 +01:00
Nga Tran	79220720be	chore: increase size of a compactor job and level of concurrency (#4746 ) * fix: let us not compact no-data * fix: split time must be greater min_time, too * fix: resolve merge conflict * chore: increase size of a compactor job and level of concurrency Co-authored-by: Dom <dom@itsallbroken.com>	2022-05-31 19:57:06 +00:00
Nga Tran	dfd35c05a1	fix: let us not compact no-data (#4744 ) * fix: let us not compact no-data * fix: split time must be greater min_time, too * fix: resolve merge conflict Co-authored-by: Dom <dom@itsallbroken.com> Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-05-31 17:02:14 +00:00
Dom Dwyer	70864b9f48	refactor: always use correct chunk sort key Don't use the same sort key for all files - sort keys may grow over time, and the information is already at hand.	2022-05-30 17:41:41 +01:00
Dom Dwyer	6aa2a6958a	refactor: assert consistent parquet file metadata Assert consistent metadata when evaluating candidate parquet files for compaction. Asserts all files have the same: * Sequencer ID * Namespace ID * Table ID * Partition ID * Sort key	2022-05-30 17:41:41 +01:00
Dom Dwyer	0f16d6cabb	refactor: consistent SortKey source Changes the compaction logic to always reference the same SortKey instance, rather than repeatedly querying for it. The Partition metadata is always read from the catalog as part of compact_partition(), where it previously threw away all metadata except the sort key, which was passed into compact(). Then compact() would always re-query the catalog to look up just the sort key again, and mix up the two instances during use - one passed into the fn, one freshly queried within the fn. Now the Partition metadata is resolved in compact_partition() as it was previously, but the entire Partition reference is passed to compact(), and this is consistently used do access the sort key. This also removes a catalog query per compaction call.	2022-05-30 17:41:41 +01:00
kodiakhq[bot]	842ef8e308	Merge branch 'main' into cn/fetch-from-parquet-file	2022-05-27 17:08:28 +00:00
Andrew Lamb	dde3c3922c	refactor: use consistent spelling of serialize (#4717 )	2022-05-27 14:42:59 +00:00
Nga Tran	ea81152fac	refactor: add partition ID into debug info and panic earlier to identify the bug easier (#4716 ) * chore: point tests to the new ticket * chore: cleanup * refactor: add partition ID into debug info and panic earlier to identify the bug easier	2022-05-27 12:20:36 +00:00
Carol (Nichols \|\| Goulding)	5fd3ffc17f	refactor: Rename ParquetChunkAdapter to only ChunkAdapter It might be creating chunks of different kinds other than ParquetChunks.	2022-05-26 16:52:14 -04:00
Carol (Nichols \|\| Goulding)	df10452e2e	refactor: Rename methods from new_querier_chunk to new_querier_parquet_chunk	2022-05-25 17:19:10 -04:00
Nga Tran	6cc767efcc	feat: teach compactor to compact smaller number of files (#4671 ) * refactor: split compact_partition into two functions to handle concurrency better * feat: limit number of files to compact * test: add test for limit num files * chore: fix cipply * feat: split group if over max size * fix: split the overlapped group to limit size or file num * chore: reduce config values * test: add tests and clearer comments for the split_overlapped_groups and test_limit_size_and_num_files * chore: more comments * chore: cleanup	2022-05-25 19:54:34 +00:00

1 2 3 4

164 Commits (a96976db46a68349539b92eb9e4aed05ed785014)