influxdb

Commit Graph

Author	SHA1	Message	Date
Carol (Nichols \|\| Goulding)	fe9c474620	fix: rustfmt	2022-08-29 14:06:45 -04:00
Carol (Nichols \|\| Goulding)	240946d8f5	fix: Deprecate proto sequencer_id fields; add shard_id fields	2022-08-29 14:06:44 -04:00
Jake Goulding	4abf21c724	refactor: Rename Sequencer (and its entourage) to Shard	2022-08-29 14:06:43 -04:00
Andrew Lamb	7f0ae53d6f	chore: Update to (almost) released object_store 0.4.0 (#5419 ) * chore: update object_store * chore: update hakari config * chore: Run cargo hakari tasks Co-authored-by: CircleCI[bot] <circleci@influxdata.com>	2022-08-17 13:44:48 +00:00
Carol (Nichols \|\| Goulding)	b982bdaf2f	fix: Derive Eq when we derive PartialEq and members can derive Eq Allow this in generated code that we don't control, though. Recommended by clippy now. https://rust-lang.github.io/rust-clippy/master/index.html#derive_partial_eq_without_eq	2022-08-11 15:04:06 -04:00
Andrew Lamb	16ddc5efc6	chore: Update datafusion / arrow/parquet/arrow-flight and prost/tonic ecosystem (#5360 ) * chore: Update datafusion and arrow * chore: Update Cargo.lock * chore: update to Decimal128 * chore: Update tonic/prost/pbjson/etc * chore: Run cargo hakari tasks * fix: doctest in generated types Co-authored-by: CircleCI[bot] <circleci@influxdata.com>	2022-08-09 17:30:44 +00:00
Marco Neumann	e24cecd926	fix: buffer allocation while reading parquet files (#5312 ) Work around https://github.com/apache/arrow-rs/issues/2321 by limiting reader batch size to number of rows (based on file-level metadata). Fixes https://github.com/influxdata/conductor/issues/1103 .	2022-08-04 14:21:05 +00:00
dependabot[bot]	55e1e2ec2b	chore(deps): Bump thiserror from 1.0.31 to 1.0.32 (#5294 ) * chore(deps): Bump thiserror from 1.0.31 to 1.0.32 Bumps [thiserror](https://github.com/dtolnay/thiserror) from 1.0.31 to 1.0.32. - [Release notes](https://github.com/dtolnay/thiserror/releases) - [Commits](https://github.com/dtolnay/thiserror/compare/1.0.31...1.0.32) --- updated-dependencies: - dependency-name: thiserror dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> * chore: Run cargo hakari tasks Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: CircleCI[bot] <circleci@influxdata.com> Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-08-03 16:20:36 +00:00
Nga Tran	ee151c8b41	fix: make batch size back to 1024 to see if the OOM in the compactor go away (#5289 ) * fix: make batch size back to 1024 to see if the OOM in the compactor go away * fix: address review comments * chore: Apply suggestions from code review Co-authored-by: Marco Neumann <marco@crepererum.net> * fix: import needed constant Co-authored-by: Marco Neumann <marco@crepererum.net> Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-08-03 15:13:04 +00:00
Nga Tran	8f1b6f2465	chore: reduce log info (#5254 ) Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-08-01 16:00:34 +00:00
Marco Neumann	3ae89de324	refactor: increase batch size when reading parquet (#5250 ) * refactor: increase batch size when reading parquet This reduced our overhead when reading parquet files quite a lot. In some internal benchmark, this reduces the size to perform a single series aggregation of a rather large series with cold caches from 58s to 48s for cold caches. No real difference could be measured for warm caches (~21ms for both). This should also help the compactor since the record batches should be larger. * refactor: ensure that parquet row group size is in-sync Ensure that we use the same row group size for reading and writing parquet files. This is the same value as upstream currently uses as a default, but let's make sure we don't diverge from that: `3032a521c9/parquet/src/file/properties.rs (L65)` Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-08-01 10:31:26 +00:00
Andrew Lamb	9215a534d0	chore: Update datafusion and `arrow`/`parquet`/`arrow-flight` to `19.0.0` (#5229 ) * chore: Update datafusion and `arrow`/`parquet`/`arrow-flight` to `19.0.0` * chore: Run cargo hakari tasks * fix: Update for API changes * fix: clippy Co-authored-by: CircleCI[bot] <circleci@influxdata.com> Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-07-28 08:10:47 +00:00
Nga Tran	d05f383a98	refactor: reduce compacting size and compacted file size to prevent compactor from waiting for reading a large file forever (#5206 ) Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-07-25 20:08:11 +00:00
Nga Tran	69cb3f2b19	refactor: remove min_sequence_number from Compactor and Querier, add `count_by_overlaps_with_level_0` and `count_by_overlaps_with_level_1` to catalog (#5151 ) * refactor: remove min_sequnce_number * fix: typos * fix: remove min_sequencer_number from new files from merging main * fix: add back throwing error if the compactor compacts files persisted by the ingester after the ingester sends max seq_num back to querier * test: add test_compactor_collision back but modify the input to make it work woth new changes Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-07-21 13:51:54 +00:00
dependabot[bot]	278a7f91af	chore(deps): Bump bytes from 1.1.0 to 1.2.0 (#5156 ) Bumps [bytes](https://github.com/tokio-rs/bytes) from 1.1.0 to 1.2.0. - [Release notes](https://github.com/tokio-rs/bytes/releases) - [Changelog](https://github.com/tokio-rs/bytes/blob/master/CHANGELOG.md) - [Commits](https://github.com/tokio-rs/bytes/compare/v1.1.0...v1.2.0) --- updated-dependencies: - dependency-name: bytes dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-07-20 10:00:08 +00:00
Andrew Lamb	e2d871b00b	chore: Update datafusion and arrow/parquet/arrow-flight to `18.0.0` (#5079 ) * chore: Update datafusion to 10.0.0, arrow/parquet/arrow-flight to 18 * chore: Run cargo hakari tasks * fix: update cargo pin Co-authored-by: CircleCI[bot] <circleci@influxdata.com> Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-07-18 15:01:03 +00:00
Jake Goulding	635f535e0e	refactor: replace level_2 with level_1	2022-07-16 21:49:45 -04:00
dependabot[bot]	9b67de2f43	chore(deps): Bump tokio from 1.19.2 to 1.20.0 Bumps [tokio](https://github.com/tokio-rs/tokio) from 1.19.2 to 1.20.0. - [Release notes](https://github.com/tokio-rs/tokio/releases) - [Commits](https://github.com/tokio-rs/tokio/compare/tokio-1.19.2...tokio-1.20.0) --- updated-dependencies: - dependency-name: tokio dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com>	2022-07-14 01:21:43 +00:00
Carol (Nichols \|\| Goulding)	61c023139b	refactor: Switch compaction levels to an enum with values rather than separate consts Bonuses: - Type checking - Validation - Less casting - Exhaustiveness checking - Less use of the numerical value	2022-07-13 11:30:36 -04:00
Marco Neumann	b1b2cb5d4a	feat: load read buffer on demand (#5091 ) * refactor: extract `select_schema` * refactor: improve `InternalLostInputField` error message * test: improve SQL runner output * feat: load read buffer on demand Closes #5032. * refactor: move `[Half]OwnedSelection` to `schema` crate`	2022-07-13 08:51:40 +00:00
Carol (Nichols \|\| Goulding)	80b6c5c82f	fix: Correct typo in constant name so searching for COMPACTION_LEVEL returns all (#5077 )	2022-07-08 16:31:52 +00:00
Andrew Lamb	c46e1c6347	chore: Update datafusion + arrow/parquet/arrow-flight to `17.0.0` (#5021 ) * fix: correct nullability declaration of system tables * chore: Update datafusion and arrow/parquet/arrow-flight * chore: Run cargo hakari tasks * fix: Update tests * fix: Update tests * fix: predicate pruning * fix: add some tests * fix: query_functions * fix: fix read_buffer test * fix: fix clippy Co-authored-by: CircleCI[bot] <circleci@influxdata.com> Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-07-07 19:22:15 +00:00
Marco Neumann	16bd3e67c0	refactor: unify `apply_predicate_to_metadata` (#5030 ) Instead of using some hand-rolled timestamp-based logic (or just "unknown") all over the place, just use logic introduced in #5017. This requires slightly improved table summaries within the querier that at least has min/max for the timestamp column. For that, the former `IngesterChunk`-specific `calculate_summary` method was extended to `create_basic_summary` to include that data and is now also used by `QuerierParquetChunk`. Note: `QuerierRBChunk` already has detailled metrics that are provided by the read buffer implementation. Should we ever need even better pruning for `QuerierParquetChunk` (or `IngesterChunk`) then we _only_ need add extra data to the table summaries. Closes #4976. Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-07-05 12:51:59 +00:00
Marco Neumann	016dd93d9c	feat: filter chunks before requesting read buffers (#4996 ) Fixes #4976.	2022-07-01 08:59:07 +00:00
Marco Neumann	be53716e4d	refactor: use IDs for `parquet_file.column_set` (#4965 ) * feat: `ColumnRepo::list_by_table_id` * refactor: use IDs for `parquet_file.column_set` Closes #4959. * refactor: introduce `TableSchema::column_id_map`	2022-06-30 15:08:41 +00:00
Marco Neumann	05189fdb00	fix: ensure panic is propagated throw `AdapterStream` (#4980 ) Prior to this change background tasks that we feed into `AdapterStream` can panic but that would just end the stream without any user-visible error (except for the panic message on stdout/stderr). This was found while developing #4964. I have proposed another fix in #4966 but found that I actually developed an existing solution a 2nd time: `watch_task`. But I also see a major issue with the existing API: one can create `AdapterStream` with ordinary tokio tasks that are not watched at all, leaving the burden to the implementor to check for that (and actually we forgot that in `parquet_file`). So this change takes a slightly different approach: The `AdapterStream` does NOT accept ordinary join handles any longer but requires that you pass a "watched task". The newly introduced `WatchedTask` does the same as we did manually before: wrapping a future into a tokio task, watch it and wrap the watcher into a task. It is now way more difficult to do anything stupid (sure you can still mix up the tasks and the channels, but we need at least some flexibility here to allow for "split" and potential future fan-in/out constructs). Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-06-30 07:57:58 +00:00
Raphael Taylor-Davies	835e1c91c7	chore: update object_store to 0.3.0 (#4707 ) * chore: update object_store to 0.3.0 * chore: review feedback Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-06-29 21:44:03 +00:00
Nga Tran	cfcc4b8426	refactor: change level 1 to level 2 preparing for next design changes (#4954 ) * refactor: change level 1 to level 2 preparing for next design changes * fix: make level-2 consistent everywhere * chore: remove unused comments * refactor: change all the name level_1 to level_2 to completely replace 1 with 2 to amke everything consistent * chore: add correspinding constants for the comapction levels in the comments Co-authored-by: Dom <dom@itsallbroken.com>	2022-06-29 14:08:58 +00:00
Marco Neumann	9b66d02229	fix: parquet reader sort order (#4964 )	2022-06-28 15:28:38 +00:00
Marco Neumann	215f297162	refactor: parquet file metadata from catalog (#4949 ) * refactor: remove `ParquetFileWithMetadata` * refactor: remove `ParquetFileRepo::parquet_metadata` * refactor: parquet file metadata from catalog Closes #4124.	2022-06-27 15:38:39 +00:00
Marco Neumann	1a74f84494	refactor: remove `ParquetFileWithMetadata` usage outside the catalog (#4948 ) * refactor: remove `DecodedParquetFile` from `iox_tests` * refactor: remove `DecodedParquetFile` from querier Also pull out all the chunk schema and sort key handling into a function so that RB chunks and parquet chunks mostly use the same code path. * refactor: remove `DecodedParquetFile` * refactor: remove `ParquetFileWithMetadata` usage * fix: test data consistency	2022-06-27 15:19:29 +00:00
Nga Tran	3c0fb6e8ef	fix: avoid using min_time, which can be negative, for ChunkId. Using object store id which is uuid instead (#4942 ) * fix: avoid using min_time, which can be negative, for ChunkId. Using object store id which is uuid instead * chore: Apply suggestions from code review Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * chore: run fmt Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-06-23 19:00:13 +00:00
Marco Neumann	bd6c4659af	refactor: slim down parquet chunk (remove Metadata) (#4934 ) * feat: conversion from `ParquetFile` to `ParquetFilePath` * refactor: slim down parquet chunk - ensure it works without binary parquet metadata - timestamp range is no longer optional (ensured by the NG type system) - remove table summary: this is only needed for SOME API users. The compactor can perfectly work without statistics since has the timestamp range which is sufficient for the current overlap check (we don't use any other primary key stats at the moment). The querier currently does NOT use parquet chunks (was replaced by read buffer) but if it will again in some future it will likely need to find a way to fetch and cache the statistics. - the schema is now provided by the API user since it can be reconstructed using the NG catalog only (and "wrong" column orders are tolerated as of #4921) Ref #4124 Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-06-23 10:55:16 +00:00
Marco Neumann	c899c3a0f4	fix: column handling when reading parquet files (#4921 ) * fix: column handling when reading parquet files This improves/fixes/tests a few aspects when reading parquet files: - fix usage of `Selection::Some(...)`. This was broken since #4912 but apparently no test caught that. - ensure that the order of `Selection::Some(...)` is preserved - ensure that schema metadata is attached to output batches - ignore parquet columns that we don't care about (i.e. do not select) - allow parquet file to have a different column order than our internal bookkeeping, this makes it way simpler to read parquet files w/o scanning the metadata first - extend the test coverage Ref #4124. * test: even more tests for parquet reader	2022-06-22 13:51:30 +00:00
Marco Neumann	59accfe862	refactor: assorted fixes and prep work for #4124 (#4912 ) * refactor: `TestPartition::update_sort_key` should return an `Arc` The whole test framework is built around `Arc`s, so let's fix this consistency issue. * fix: actually calculate correct column set in test framework * feat: check expected parquet file schema While working on the querier I made some mistakes regarding schemas and such a check would have greatly improved the debugging experience. * feat: namespace cache expiration * fix: improve parquet schema check * fix: remove clone	2022-06-21 16:08:28 +00:00
Marco Neumann	0f63be26c3	refactor: pass path instead of metadata around to load parquet files (#4909 )	2022-06-21 10:57:10 +00:00
Marco Neumann	c3912e34e9	refactor: store per-file column set in catalog (#4908 ) * refactor: store per-file column set in catalog Together with the table-wide schema and the partition-wide sort key, this should be everything we need to read a parquet file directly into memory without peeking any file-level metadata. The querier will use this to directly load parquet files into the read buffer. WARNING: This requires a catalog wipe! Ref #4124. * refactor: use proper `ColumnSet` type	2022-06-21 10:26:12 +00:00
Marco Neumann	0fbff981ec	chore(deps): Bump sqlx to 0.6.0 and uuid to 1 (#4894 ) Closes #4889. Closes #4890. Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2022-06-17 10:28:28 +00:00
Andrew Lamb	e91d00b10c	chore: Update datafusion + `arrow`/`parquet`/`arrow-flight` to `16.0.0 (#4851 ) * chore: TEMP Update DataFusion to pre-release * chore: update arrow et al to 16.0.0 * chore: Run cargo hakari tasks * fix: update reader read_dictionary API * chore: Update to real Datafusion release * fix: Update parquet API * fix: update test Co-authored-by: CircleCI[bot] <circleci@influxdata.com>	2022-06-14 16:31:40 +00:00
Dom Dwyer	b41ea1d718	refactor: PartitionKey type This commit changes the code base to use a new reference-counted PartitionKey type wrapper, instead of passing a bare String around. This allows the compiler to type check & verify usage of the partition key, instead of passing a bare string around. By reference counting the underlying string, we reduce memory usage for some use cases.	2022-06-14 14:47:56 +01:00
dependabot[bot]	e03bf94420	chore(deps): Bump tokio from 1.18.2 to 1.19.1 (#4783 ) Bumps [tokio](https://github.com/tokio-rs/tokio) from 1.18.2 to 1.19.1. - [Release notes](https://github.com/tokio-rs/tokio/releases) - [Commits](https://github.com/tokio-rs/tokio/compare/tokio-1.18.2...tokio-1.19.1) --- updated-dependencies: - dependency-name: tokio dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-06-06 14:15:12 +00:00
Andrew Lamb	3592aa52d8	chore: Update datafusion + `arrow`/`parquet`/`arrow-flight` to `15.0.0` (#4743 ) * chore: Update datafusion + `arrow`/`parquet`/`arrow-flight` to `15.0.0` * chore: Update APIs * chore: Run cargo hakari tasks * feat: normalize parquet file metadata * chore: update size tests * chore: add docs on metadata stripping * chore: TEMP UPDATE TO DF BRANCH * chore: Update for new API * fix: Update to latest DF * fix: cargo hakari Co-authored-by: CircleCI[bot] <circleci@influxdata.com> Co-authored-by: Raphael Taylor-Davies <r.taylordavies@googlemail.com>	2022-06-03 10:32:26 +00:00
Ryan Russell	d279deddad	docs(various): Improve Readability (#4768 ) Signed-off-by: Ryan Russell <git@ryanrussell.org> Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-06-02 18:01:06 +00:00
Carol (Nichols \|\| Goulding)	9328ba8c45	feat: Use new extra loading info to load read buffer chunks into cache	2022-06-02 09:22:44 -04:00
Carol (Nichols \|\| Goulding)	054c25de50	refactor: Add more methods to DecodedParquetFile I'm tired of trying to remember which info is on which metadata.	2022-06-02 09:22:44 -04:00
Dom Dwyer	f8b83c5085	test: assert panic behaviour Modifies the existing test added as part of #4695 to ensure a panic is emitted when serialising an empty parquet file.	2022-06-01 16:55:53 +01:00
Dom Dwyer	0bfc11f4a1	refactor: always panic for empty parquet files Moves the panic into the child call to_parquet() so all code paths are covered (i.e. not serialising into memory via to_parquet_bytes()).	2022-06-01 16:54:36 +01:00
Dom Dwyer	5d74ae2ac1	test: fix test_metadata_from_parquet_metadata Changes the test_metadata_from_parquet_metadata test to embed the IOx metadata before asserting it can be read back.	2022-05-31 17:34:04 +01:00
Marco Neumann	988bd38e93	refactor: remove unused code (#4742 )	2022-05-31 11:36:02 +00:00
Marco Neumann	5a95da7327	refactor: do NOT use ANY file IO for parquet reading (#4741 )	2022-05-31 11:18:24 +00:00
Marco Neumann	2bf03e57bf	feat: limit tmp parquet file count and size (#4737 ) Fixes #4736. Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-05-31 08:56:15 +00:00
Marco Neumann	79c054ffc9	fix: do NOT block in parquet file IO (#4727 ) * fix: do NOT block in parquet file IO I think for historical reason we were using blocking IO to read parquet files. With the current streaming `SendableRecordStream` approach this is technically NOT required anymore. Now one might think that the sync-async dance that we did is kinda harmless, but looking at our producition querier I think it is really bad. The querier seems to be stuck but looking at `strace` and other health signal it seems it is not entirely dead. Looking at GDB backtraces it seems that nearly all threads are busy in `download_and_scan_parquet`. Looking at the tokio docs (<https://docs.rs/tokio/1.18.2/tokio/task/fn.spawn_blocking.html>) for `spawn_blocking` (which is used to start the sync download) this makes sense: tokio only starts replacement threads for the current runtime thread (which calls `spawn_blocking`) if this does NOT exceed the runtime thread limit. However we set the runtime thread limit to the number of CPU cores available to IOx, so this is a limiting factor. This means that there are only a few threads left to do actual work (I've seen postgres data flowing back and forth for example) but tokio is not able to use its full potential anymore. This is esp. bad because the sync code in `download_and_scan_parquet` then uses `futures` `block_on` functionality to call back into async code, so it waits for tokio itself. The change is rather simple: just use async task spawns. * fix: use async IO to write stream to temp file * fix: do not block tokio thread during parquet file reading * refactor: ensure parquet IO tasks are cancelled if they are not needed anymore There is no REAL way to cancel sync tasks, but at least we can try our best.	2022-05-30 13:32:20 +00:00
Andrew Lamb	9f21512296	chore: reduce `debug!` log spew in `parquet_file` (#4718 ) * chore: reduce log spew * chore: trace another overly verbose message Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-05-27 20:57:10 +00:00
kodiakhq[bot]	842ef8e308	Merge branch 'main' into cn/fetch-from-parquet-file	2022-05-27 17:08:28 +00:00
Nga Tran	16e7a6d596	test: test that hits panic becasue of no column meta data (#4719 ) * test: test that hits panic becasue of no column meta data * chore: Apply suggestions from code review * chore: run format after applying changes * chore: Apply suggestions from code review Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * chore: run clippy Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>	2022-05-27 15:27:03 +00:00
Andrew Lamb	dde3c3922c	refactor: use consistent spelling of serialize (#4717 )	2022-05-27 14:42:59 +00:00
Nga Tran	ea81152fac	refactor: add partition ID into debug info and panic earlier to identify the bug easier (#4716 ) * chore: point tests to the new ticket * chore: cleanup * refactor: add partition ID into debug info and panic earlier to identify the bug easier	2022-05-27 12:20:36 +00:00
Nga Tran	09b55a209d	chore: point tests to the new ticket (#4715 ) * chore: point tests to the new ticket * chore: cleanup	2022-05-27 11:12:55 +00:00
Nga Tran	372b262f37	test: parquet meta decoded tests and more debug info (#4713 ) * test: reproducer for 4695 * chore: some debug info * test: test with many columns and rows * chore: cleanup and add debug info * chore: cleanup * chore: cleanup * chore: more debug info	2022-05-27 09:53:07 +00:00
Carol (Nichols \|\| Goulding)	b2905650aa	refactor: Extract extract_range to be a method on TableSummary So that other kinds of chunks can use this code too.	2022-05-26 16:52:14 -04:00
Nga Tran	05151d5c69	test: reproducer for 4695 (#4706 ) * test: reproducer for 4695 * chore: Apply suggestions from code review Co-authored-by: Carol (Nichols \|\| Goulding) <193874+carols10cents@users.noreply.github.com> Co-authored-by: Carol (Nichols \|\| Goulding) <193874+carols10cents@users.noreply.github.com> Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-05-26 15:32:30 +00:00
Dom Dwyer	6aa626ef84	refactor: retry object store upload Changes the Storage::upload() method to endlessly retry uploading the generated Parquet file.	2022-05-24 11:29:42 +01:00
Andrew Lamb	e877a64462	feat: Add `ParquetFiles` cache and memory size estimation for ParquetMetadata (#4661 ) * feat: Add `ParquetFiles` cache * fix: Apply suggestions from code review Co-authored-by: Marko Mikulicic <mkm@influxdata.com> * fix: remove commented out debugging println * refactor: Improve size calculation * fix: mark `ParquetFileCache::clear` test only * fix: assert on metric count Co-authored-by: Marko Mikulicic <mkm@influxdata.com>	2022-05-23 17:11:38 +00:00
Dom Dwyer	2e6c49be83	refactor: remove IoxMetadata min & max timestamp Removes the min/max timestamp fields from the IoxMetadata proto structure embedded within a Parquet file's metadata. These values are redundant as they already exist within the Parquet column statistics, and precluded streaming serialisation as these removed min/max values were needed before serialising the file.	2022-05-23 16:27:08 +01:00
Dom Dwyer	a142a9eb57	refactor: remove row_count from IoxMetadata Remove the redundant row_count from the IoxMetadata structure that is serialised into the Parquet file. The reasoning is twofold: * The Parquet file's native metadata already contains a row count * Needing to know the number of rows up-front precludes streaming	2022-05-23 16:18:35 +01:00
Dom Dwyer	71555ee55c	test: Parquet metadata integration test Adds two integration tests covering validation of the embedded IOx metadata within the Parquet file metadata, and validation of the derived ParquetFileParams metadata used to populate the catalog.	2022-05-23 16:17:56 +01:00
Dom Dwyer	af6d3f4d48	docs: remove clone ref comment	2022-05-23 11:46:06 +01:00
Dom Dwyer	00dc95829d	style: enable more lints Enable more lints on the parquet_file crate to keep it a little cleaner - adds the following: clippy::clone_on_ref_ptr, unreachable_pub, missing_docs, clippy::todo, clippy::dbg_macro This commit includes fixes for any new lint failures.	2022-05-20 15:17:40 +01:00
Dom Dwyer	7df7c4844c	refactor: remove redundant ParquetChunk errors Eliminates unused / refactors away unnecessary errors for the parquet::chunk module.	2022-05-20 15:17:40 +01:00
Dom Dwyer	661f8599a6	refactor: internalise Parquet path generation Derive the ParquetFilePath from the IoxMetadata within the ParquetStorage::read_filter() call. This prevents the "put/get RecordBatches" abstraction from leaking out the object store path generation concern - an implementation detail of the ParquetStorage layer.	2022-05-20 15:17:40 +01:00
Dom Dwyer	cdb341d45a	test: ParquetStorage upload() and read_filter() Adds tests for the previously untested (directly at least) Parquet (de)serialisation & persistence layer, provided by the ParquetStorage type.	2022-05-20 15:17:40 +01:00
Dom Dwyer	302301659e	refactor: derive ParquetFilePath from IoxMetadata Allow directly converting an IoxMetadata to a ParquetFilePath.	2022-05-20 15:17:40 +01:00
Dom Dwyer	b9a745d42d	feat: RecordBatch stream to Parquet file upload Implements an upload() method on the ParquetStorage type, consuming a stream of RecordBatch, serialising the Parquet file, and uploading the result to object storage. Returns the IOx-specific file metadata. Currently while the upload() method accepts a stream of RecordBatch, the actual resulting Parquet file is buffered in memory before uploading to object store, due to lack of streaming upload functionality in the ObjectStore abstraction - this isn't the end of the world, as the files tend to be relatively small with our current usage. This impl should be easily modified to be fully streaming once streaming object store puts are implemented: https://github.com/influxdata/object_store_rs/issues/9	2022-05-20 15:17:40 +01:00
Dom Dwyer	76e08d14a3	perf: IoxParquetMetaData direct from file metadata Construct a IoxParquetMetaData instance directly from the FileMetaData instance returned by the ArrowWriter. This change will allow us to avoid the inefficient impl currently in use: * Serialise batches into memory * Wrap buffer in arrow cursor * Read parquet metadata with arrow file reader * Serialise schema with thrift * Serialise each row group's metadata with thrift * Construct our own FileMetaData instance * Serialise FileMetaData with thrift * zstd encode resulting thrift bytes * Wrap in IoxParquetMetaData Now we "only": * Stream batches into opaque Write impl * Serialise FileMetaData with thrift * zstd encode resulting thrift bytes * Wrap in IoxParquetMetaData Then accessing any data within the IoxParquetMetaData (as before this change) requires deserialising it first. There are still a number of easy performance improvements to be had w.r.t the metadata handling.	2022-05-20 15:17:40 +01:00
Dom Dwyer	70856a645f	feat: streaming RecordBatch -> parquet encoding Implements a streaming RecordBatch to Parquet file serialiser. This impl automatically discovers the schema of the RecordBatch stream, and accepts &mut destination types (internalising the handle cloning/etc) to simplify caller usage. This encoder returns the resulting FileMetaData to allow callers to inspect the resulting metadata without reading back the file. Currently unused / not yet plumbed in.	2022-05-20 15:09:26 +01:00
Marco Neumann	addc45327e	fix: ensure that query tokio background tasks are canceled (#4643 ) * fix: ensure that query tokio background tasks are canceled While I am not entirely sure if this explains some of the memory leaks I am seeing in prod, not canceling the tasks correctly certainly makes debugging way harder and also renders certain form of throttling (e.g. max. concurrent queries) somewhat ineffective. Note that parquet file downloads are currently NOT canceled because tokios `spawn_blocking` cannot be canceled. * refactor: `Vec` -> `Option` * refactor: `spawn_blocking` creates a join handle, even though it is useless Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-05-20 07:18:52 +00:00
Dom Dwyer	baa86d846f	refactor: use ParquetStore instead of ObjectStore Changes the code paths that interact with Parquet files in the object store to reference the ParquetStorage directly (DRY refactor). This change takes us from a dependency graph of: ┌─────────────────┐ │ │ ▼ │ Parquet Consumer │ │ ┌──────────────┐ ├────────▶│ParquetStorage│ ▼ └──────────────┘ ┌──────────────┐ │ ObjectStore │ └──────────────┘ │ ┌────┴────┐ ▼ ▼ File s3 System (etc) to: Parquet Consumer │ ▼ ┌──────────────┐ │ParquetStorage│ └──────────────┘ │ ▼ ┌──────────────┐ │ ObjectStore │ └──────────────┘ │ ┌────┴────┐ ▼ ▼ File s3 System (etc) With the ParquetStorage being solely responsible for managing interactions with the object store when dealing with Parquet files.	2022-05-19 13:52:51 +01:00
Dom Dwyer	d3548653d5	refactor: rename Storage -> ParquetStorage Renames the Storage type so the context is clear in usage (i.e. fn args), rather than having to rely on knowing the fully-qualified import path to know what the type stores.	2022-05-19 13:51:07 +01:00
Dom Dwyer	e20b02b914	refactor: tidy ParquetChunk constructor Removes two unused constructors for a ParquetChunk, and moves the bare fn constructor that is actually used to be an associated method (a conventional constructor).	2022-05-19 13:51:07 +01:00
Dom Dwyer	7a8e6d1a38	refactor: remove unused max_row_group_size The Parquet writer references an unused max_row_group_size property in the parquet file metadata.	2022-05-18 16:45:15 +01:00
Andrew Lamb	3a33e806c7	chore: Update datafusion + `arrow`/`parquet`/`arrow-flight` to `14.0.0` (#4619 ) * chore: Update datafusion deps * chore: update arrow/parquet/arrow flight deps * chore: Run cargo hakari tasks * chore: Update location of utils * chore: Update some more APIs Co-authored-by: CircleCI[bot] <circleci@influxdata.com>	2022-05-17 14:13:03 +00:00
Nga Tran	9530e73925	chore: move noisy debug to trace and fix some comments (#4598 ) * chore: move noisy debug to trace and fix some comments * chore: Apply suggestions from code review Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * chore: fix format Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-05-13 19:18:15 +00:00
Raphael Taylor-Davies	f2bb0fdf77	feat: update to crates.io object_store version (#4595 ) * feat: update to crates.io object_store version * chore: Run cargo hakari tasks * fix: tests * chore: remove object store integration test plumbing Co-authored-by: CircleCI[bot] <circleci@influxdata.com>	2022-05-13 16:26:07 +00:00
Carol (Nichols \|\| Goulding)	55313d290a	fix: Update or remove comments that mention NG or OG Connects to #4450.	2022-05-12 16:09:08 -04:00
Raphael Taylor-Davies	8b379c83cc	refactor: simplify object_store path handling (#4534 ) * refactor: simplify object_store path handling * fix: aws integration tests * chore: lint * fix: update gcs tests * refactor: move errors into submodules * chore: lint * chore: review feedback * refactor: replace provider with Display * fix: failing tests Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-05-09 18:43:22 +00:00
Jake Goulding	e07bcd40c2	refactor: Remove unused dependencies These were found by iterating over all of the dependencies of each Cargo.toml, then grepping that crate for the dependency's name. If it didn't show up, I attempted to remove it. I left a few dependencies that this process flagged: * generated_types - `pbjson`,`serde`. Apparently used by the generated code. * grpc-router-test-gen - `prost`. Apparently used by the generated code. * influxdb_iox - `heappy`. Doesn't appear used, but is behind enough feature flags that I don't care to reason about and it's already optional. - `tikv_jemalloc_sys`. Appears to be setting a feature flag of an indirect dependency. * iox_gitops_adapter - `k8s_openapi`. Appears to be setting a feature flag of an indirect dependency.	2022-05-06 15:57:58 -04:00
Carol (Nichols \|\| Goulding)	068096e7e1	fix: Rename data_types2 to data_types	2022-05-06 14:45:39 -04:00
Carol (Nichols \|\| Goulding)	0541c6e40f	fix: Remove data_types crate where it's no longer used	2022-05-06 14:45:39 -04:00
Carol (Nichols \|\| Goulding)	d2671355c3	fix: Move partition metadata types to data_types2	2022-05-06 14:45:37 -04:00
Carol (Nichols \|\| Goulding)	ea46830954	fix: Remove iox_object_store crate; move ParquetFilePath to parquet_file	2022-05-06 14:45:36 -04:00
Carol (Nichols \|\| Goulding)	b4894c2b46	fix: Remove unused parts of parquet_file	2022-05-06 11:30:36 -04:00
Carol (Nichols \|\| Goulding)	ba8191c1eb	fix: Remove persistence_windows	2022-05-06 11:30:35 -04:00
Andrew Lamb	02893e598c	chore: Update datafusion and upgrade arrow/parquet/arrow-flight to 13 (#4516 ) * chore: Tool for automating arrow version update * chore: Update datafusion and arrow/parquet/arrow-flight * fix: update for changes in Arrow API Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-05-05 00:21:02 +00:00
dependabot[bot]	420c306caa	chore(deps): Bump tokio from 1.17.0 to 1.18.0 (#4453 ) Bumps [tokio](https://github.com/tokio-rs/tokio) from 1.17.0 to 1.18.0. - [Release notes](https://github.com/tokio-rs/tokio/releases) - [Commits](https://github.com/tokio-rs/tokio/compare/tokio-1.17.0...tokio-1.18.0) --- updated-dependencies: - dependency-name: tokio dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2022-04-28 08:21:17 +00:00
二手掉包工程师	4b47d723b1	refactor: Rename time to iox_time (#4416 ) Signed-off-by: hi-rustin <rustin.liu@gmail.com> Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-04-26 00:19:59 +00:00
Marco Neumann	86e8f05ed1	fix: make all catalog IDs 64bit (#4418 ) Closes #4365. Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-04-25 16:49:34 +00:00
Nga Tran	d963110842	feat: group chunk overlaps based on time range only (#4389 ) * feat: overlap for NG querier * chore: cleanup * refactor: address review comments * fix: typo Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-04-25 13:32:07 +00:00
Andrew Lamb	73bed810da	chore: Update arrow, arrow-flight, parquet, tonic, prost, etc (#4357 ) * chore: Update datafusion * chore: Update arrow/arrow-flight/parquet to 12 * chore: update datafusion correctly * chore: Update prost, tonic, and dependents * fix: Fixup some api changes * fix: Update test output in db * fix: Update test output in parquet_file * fix: remove old pbjson types * fix: Add "--experimental_allow_proto3_optional" flag * chore: Run cargo hakari tasks * fix: compile error * chore: Update heappy Co-authored-by: CircleCI[bot] <circleci@influxdata.com> Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-04-20 11:12:17 +00:00
Carol (Nichols \|\| Goulding)	94dcde4996	fix: Do fewer queries for metadata By adding another _with_metadata catalog function. Also introduce a new type rather than passing around tuples everywhere.	2022-04-13 10:43:20 -04:00
Carol (Nichols \|\| Goulding)	02fee3b84f	feat: Request parquet metadata from the catalog when needed only	2022-04-13 10:43:19 -04:00

1 2 3 4 5 ...

530 Commits (b15aebbddcc1c43fc417472177df3736bd614f35)