influxdb

Commit Graph

Author	SHA1	Message	Date
Marco Neumann	66c7d95312	refactor: use new ingester<>querier wire protocol (#4867 ) * refactor: use new ingester<>querier wire protocol Use and document the new and more flexible ingester<>querier wire protocol. Note that the ingester does NOT stream the response data yet, but the internal data structures would allow that. A follow-up change will adjust the ingester code to stream the data. Ref #4849. * fix: typos Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * refactor: clarify naming and public interface * test: add schema assertion to `ingester_response_to_record_batches` Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>	2022-06-16 08:02:28 +00:00
Andrew Lamb	6b771375bf	feat: log when partitions are written due to going over size (#4868 )	2022-06-15 20:12:43 +00:00
Dom Dwyer	4df2964566	refactor: store PartitionKey in DmlWrite Carry the PartitionKey in the DmlWrite, allowing the batch to be associated with a specific partition key.	2022-06-15 15:48:54 +01:00
Marco Neumann	7c60edd38c	refactor: prepare new ingester<>querier protocol on the querier side (#4863 ) * refactor: prepare new ingester<>querier protocol on the querier side This changes the querier internals to work with the new protocol. The wire protocol stays the same (for now). There's a (somewhat hackish) adapter in place on the querier side that converts the old to the new protocol on-the-fly. This is an intermediate step before we actually change the wire protocol (and in a step after that also take advantage of the new possibilites on the ingester side). Ref #4849. * docs: explain adapter	2022-06-15 14:32:24 +00:00
Andrew Lamb	005610b172	refactor: remove some `&` use in iox_catalog (#4862 ) * refactor: remove some `&` use in iox_catalog * fix: Update data_types/src/lib.rs	2022-06-15 11:31:49 +00:00
Nga Tran	b682dbbc2e	chore: Add debug info of sort_key for ingester (#4859 ) Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-06-14 20:39:17 +00:00
Andrew Lamb	c8f70b8933	feat: log query from querier to ingester at `info` level (#4856 ) Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-06-14 18:35:50 +00:00
Andrew Lamb	eca3b6b9a1	fix: reduce memory usage in ingester with less buffering prior to query engine (#4830 ) * refactor: remove another buffer copy in ingester * docs: Update arrow_util/src/util.rs Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-06-14 18:22:55 +00:00
Andrew Lamb	7d2a5c299f	refactor: remove one buffer copy in the ingester (#4855 ) Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-06-14 17:15:36 +00:00
Andrew Lamb	e91d00b10c	chore: Update datafusion + `arrow`/`parquet`/`arrow-flight` to `16.0.0 (#4851 ) * chore: TEMP Update DataFusion to pre-release * chore: update arrow et al to 16.0.0 * chore: Run cargo hakari tasks * fix: update reader read_dictionary API * chore: Update to real Datafusion release * fix: Update parquet API * fix: update test Co-authored-by: CircleCI[bot] <circleci@influxdata.com>	2022-06-14 16:31:40 +00:00
Andrew Lamb	34e8659876	refactor: consolidate plan creation from `QueryChunk`s in `iox_query` (#4837 ) * refactor: consolidate plan creation from Chunks * docs: update docstrings Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-06-14 14:36:07 +00:00
Dom Dwyer	b41ea1d718	refactor: PartitionKey type This commit changes the code base to use a new reference-counted PartitionKey type wrapper, instead of passing a bare String around. This allows the compiler to type check & verify usage of the partition key, instead of passing a bare string around. By reference counting the underlying string, we reduce memory usage for some use cases.	2022-06-14 14:47:56 +01:00
Andrew Lamb	9fdbfb05e7	refactor: Use scan_and_filter in ReorgPlanner (#4822 ) Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-06-10 17:31:25 +00:00
kodiakhq[bot]	dd8d44e24f	Merge branch 'main' into cn/duration	2022-06-10 14:23:09 +00:00
Nga Tran	13c57d524a	feat: Change data type of catalog partition's sort_key from a string to an array of string (#4801 ) * feat: Change data type of catalog Postgres partition's sort_key from a string to an array of string * test: add column with comma * fix: use new protonuf field to avoid incompactible * fix: ensure sort_key is an empty array rather than NULL * refactor: address review comments * refactor: address more comments * chore: clearer comments * chore: Update iox_catalog/migrations/20220607102200_change_sort_key_type_to_array.sql * chore: Update iox_catalog/migrations/20220607102200_change_sort_key_type_to_array.sql * fix: Rename migration so it will be applied after Co-authored-by: Marko Mikulicic <mkm@influxdata.com>	2022-06-10 13:31:31 +00:00
Andrew Lamb	dc992209be	test: account for active writes when reporting readable status (#4782 ) * test: account for active writes when reporting readable status * fix: logical merge conflict	2022-06-10 12:59:09 +00:00
Andrew Lamb	11cec18edc	refactor: Move `scan_and_filter` into a `common` module for reuse (#4823 ) * refactor: remove unused error variants * refactor: move scan_and_filter into a module so it can be reused * docs: update comments about pruning	2022-06-10 11:15:47 +00:00
Andrew Lamb	50697906b1	refactor: Make `DMLWrite::sequence_number` a `SequenceNumber` (#4817 )	2022-06-09 19:36:37 +00:00
Carol (Nichols \|\| Goulding)	1c7cbaf5ae	refactor: Use DurationHistogram in more places	2022-06-09 14:20:51 -04:00
Andrew Lamb	2ec7764fdd	refactor: rename builder like predicate methods to be `with_` (#4808 ) * refactor: rename builder like predicate methods to be `with_` * fix: merge conflict Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-06-09 11:26:03 +00:00
Andrew Lamb	d8331e8679	fix: do not return 'readable' until a write is completely readable (#4778 ) * fix: do not return readable until a write is completely readable * docs: Add diagram with partially buffered write * refactor: account for actively buffering during update rather than fixup * fix: fixup * fix: use checked_sub Co-authored-by: Marco Neumann <marco@crepererum.net> * fix: checked_sub calculation Co-authored-by: Marco Neumann <marco@crepererum.net>	2022-06-09 11:15:15 +00:00
Andrew Lamb	f34282be2c	fix: Do not run DataFusion optimizer pass twice (#4809 ) * fix: Do not run DataFusion optimizer pass twice * docs: improve docstring and logging	2022-06-08 21:01:22 +00:00
Andrew Lamb	afc1c12062	refactor: consolidate `PredicateBuilder` into `Predicate` (#4799 ) Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-06-08 12:21:24 +00:00
Dom Dwyer	1fc5596023	perf: streaming compaction in ingester Reduces memory usage in the ingester during persist operations by streaming the results of the snapshot merge/sort/dedupe directly to the parquet file. Prior to this commit the output of the compact was buffered in memory before being wrote to the parquet file.	2022-06-07 12:01:26 +01:00
dependabot[bot]	04c685b3b7	chore(deps): Bump tokio-util from 0.7.2 to 0.7.3 (#4784 ) Bumps [tokio-util](https://github.com/tokio-rs/tokio) from 0.7.2 to 0.7.3. - [Release notes](https://github.com/tokio-rs/tokio/releases) - [Commits](https://github.com/tokio-rs/tokio/compare/tokio-util-0.7.2...tokio-util-0.7.3) --- updated-dependencies: - dependency-name: tokio-util dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2022-06-06 14:46:27 +00:00
dependabot[bot]	a1ea793e13	chore(deps): Bump tokio-stream from 0.1.8 to 0.1.9 (#4785 ) Bumps [tokio-stream](https://github.com/tokio-rs/tokio) from 0.1.8 to 0.1.9. - [Release notes](https://github.com/tokio-rs/tokio/releases) - [Commits](https://github.com/tokio-rs/tokio/compare/tokio-stream-0.1.8...tokio-stream-0.1.9) --- updated-dependencies: - dependency-name: tokio-stream dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-06-06 14:21:54 +00:00
dependabot[bot]	e03bf94420	chore(deps): Bump tokio from 1.18.2 to 1.19.1 (#4783 ) Bumps [tokio](https://github.com/tokio-rs/tokio) from 1.18.2 to 1.19.1. - [Release notes](https://github.com/tokio-rs/tokio/releases) - [Commits](https://github.com/tokio-rs/tokio/compare/tokio-1.18.2...tokio-1.19.1) --- updated-dependencies: - dependency-name: tokio dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-06-06 14:15:12 +00:00
Andrew Lamb	3592aa52d8	chore: Update datafusion + `arrow`/`parquet`/`arrow-flight` to `15.0.0` (#4743 ) * chore: Update datafusion + `arrow`/`parquet`/`arrow-flight` to `15.0.0` * chore: Update APIs * chore: Run cargo hakari tasks * feat: normalize parquet file metadata * chore: update size tests * chore: add docs on metadata stripping * chore: TEMP UPDATE TO DF BRANCH * chore: Update for new API * fix: Update to latest DF * fix: cargo hakari Co-authored-by: CircleCI[bot] <circleci@influxdata.com> Co-authored-by: Raphael Taylor-Davies <r.taylordavies@googlemail.com>	2022-06-03 10:32:26 +00:00
dependabot[bot]	9a21292db8	chore(deps): Bump async-trait from 0.1.53 to 0.1.56 (#4774 ) Bumps [async-trait](https://github.com/dtolnay/async-trait) from 0.1.53 to 0.1.56. - [Release notes](https://github.com/dtolnay/async-trait/releases) - [Commits](https://github.com/dtolnay/async-trait/compare/0.1.53...0.1.56) --- updated-dependencies: - dependency-name: async-trait dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2022-06-03 09:10:40 +00:00
Ryan Russell	d279deddad	docs(various): Improve Readability (#4768 ) Signed-off-by: Ryan Russell <git@ryanrussell.org> Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-06-02 18:01:06 +00:00
Marco Neumann	c91dbe062e	test: "optimize" ingesterrecord batches in query tests (#4700 ) * test: "optimize" ingesterrecord batches in query tests It seems that I had the right idea in #4656 but wasn't able to trigger https://github.com/influxdata/conductor/issues/955 because the query tests do not "optimize" the record batches in the same way the actual gRPC implementation does. If we apply the same transformation we indeed end up with the same error. * fix: all batches within the ingester flight response must have same schema * refactor: simplify and reuse code Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-06-01 07:37:11 +00:00
Paul Dix	6af32b7750	feat: add concurrency limit for ingester queries (#4703 ) I've defaulted it to 20, we can adjust as needed. Closes #4657 Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-05-30 10:22:17 +00:00
Andrew Lamb	dde3c3922c	refactor: use consistent spelling of serialize (#4717 )	2022-05-27 14:42:59 +00:00
Nga Tran	ea81152fac	refactor: add partition ID into debug info and panic earlier to identify the bug easier (#4716 ) * chore: point tests to the new ticket * chore: cleanup * refactor: add partition ID into debug info and panic earlier to identify the bug easier	2022-05-27 12:20:36 +00:00
Marco Neumann	31d1b37d73	refactor: de-duplicate low-level arrow code (#4697 ) It seems that during prototyping NG we've copied low level code (w/o tests!) and never cleaned up. Let's not have this functionality twice.	2022-05-25 16:24:28 +00:00
Carol (Nichols \|\| Goulding)	6ce6a38094	fix: Make metric names potentially less confusing	2022-05-25 10:04:39 -04:00
Dom	9cd1286051	Merge branch 'main' into dom/meta-remove-row-count	2022-05-23 16:39:38 +01:00
Marco Neumann	2029bd16ba	feat: enable debugging of failed querier->ingester requests (#4659 ) * feat: enable debugging of failed querier->ingester requests - extend `query-ingester` CLI to allow usage of predicates - on failed requests: log all information that required for the CLI - test the "ingester fails" scenario * test: explain Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * docs: improve Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * refactor: move b64 pred. serde into a single crate Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>	2022-05-23 15:37:31 +00:00
Dom Dwyer	2e6c49be83	refactor: remove IoxMetadata min & max timestamp Removes the min/max timestamp fields from the IoxMetadata proto structure embedded within a Parquet file's metadata. These values are redundant as they already exist within the Parquet column statistics, and precluded streaming serialisation as these removed min/max values were needed before serialising the file.	2022-05-23 16:27:08 +01:00
Dom Dwyer	a142a9eb57	refactor: remove row_count from IoxMetadata Remove the redundant row_count from the IoxMetadata structure that is serialised into the Parquet file. The reasoning is twofold: * The Parquet file's native metadata already contains a row count * Needing to know the number of rows up-front precludes streaming	2022-05-23 16:18:35 +01:00
Dom	f0d0f1ba0c	Merge branch 'main' into dom/codec-object-store	2022-05-23 15:39:54 +01:00
kodiakhq[bot]	a06746c715	Merge branch 'main' into cn/last-available	2022-05-23 13:08:19 +00:00
Carol (Nichols \|\| Goulding)	05bd9de4d3	test: Add a test for the sequence number skipping metric Ok, so... this needed lots of... channels. Channels everywhere. The stream method on TestWriteBufferStreamHandler previously assumed it would only be called once. In a test where reset_to_earliest is called, stream might be called again to get the reset stream. We want to be able to control which of the streams gets which operations, so that's why the macro now takes a vec of vec of operations-- one vec of operations per expected call to stream, and the stream will send all the operations in its vec. The test thread needs to wait for the handler stream to consume the last item from the last receiver stream, so when the TestWriteBufferStreamHandler has set up the last expected call to stream, pass back the last transmitter and have it wait until it's at full expected capacity (which means all operations have been consumed by the receiver).	2022-05-20 20:50:02 -04:00
Carol (Nichols \|\| Goulding)	bda231051a	feat: Record metrics when resetting the write buffer and skipping sequence numbers	2022-05-20 20:48:17 -04:00
Carol (Nichols \|\| Goulding)	bcbf7b4f46	refactor: Move error handling logic to be all together	2022-05-20 20:48:17 -04:00
Carol (Nichols \|\| Goulding)	549dd497ea	refactor: Extract an ingester verification function	2022-05-20 20:48:16 -04:00
Carol (Nichols \|\| Goulding)	2aa76622c3	refactor: Extract a test setup function	2022-05-20 11:51:57 -04:00
Carol (Nichols \|\| Goulding)	ab72c93a5e	docs: Updating wrapping, content, and grammar of comments	2022-05-20 10:51:07 -04:00
Carol (Nichols \|\| Goulding)	c811bebdb7	feat: Add ingester CLI option to skip to oldest available WB seq num The default behavior of the ingester is to panic if the min unpersisted sequence number in the catalog is unknown to the write buffer due to the retention policies having evicted that sequence number. Specifying `--skip-to-oldest-available` changes this behavior to skip to the oldest sequence number the write buffer does have available and go from there. Fixes #4624.	2022-05-20 10:51:07 -04:00
Carol (Nichols \|\| Goulding)	b3f97bdb9d	test: Capture existing behavior for unknown sequence number	2022-05-20 10:51:06 -04:00
Dom Dwyer	b9a745d42d	feat: RecordBatch stream to Parquet file upload Implements an upload() method on the ParquetStorage type, consuming a stream of RecordBatch, serialising the Parquet file, and uploading the result to object storage. Returns the IOx-specific file metadata. Currently while the upload() method accepts a stream of RecordBatch, the actual resulting Parquet file is buffered in memory before uploading to object store, due to lack of streaming upload functionality in the ObjectStore abstraction - this isn't the end of the world, as the files tend to be relatively small with our current usage. This impl should be easily modified to be fully streaming once streaming object store puts are implemented: https://github.com/influxdata/object_store_rs/issues/9	2022-05-20 15:17:40 +01:00
Carol (Nichols \|\| Goulding)	4bad553dc6	fix: Use a method instead of holding a lock across an await point	2022-05-19 16:09:47 -04:00
Dom Dwyer	baa86d846f	refactor: use ParquetStore instead of ObjectStore Changes the code paths that interact with Parquet files in the object store to reference the ParquetStorage directly (DRY refactor). This change takes us from a dependency graph of: ┌─────────────────┐ │ │ ▼ │ Parquet Consumer │ │ ┌──────────────┐ ├────────▶│ParquetStorage│ ▼ └──────────────┘ ┌──────────────┐ │ ObjectStore │ └──────────────┘ │ ┌────┴────┐ ▼ ▼ File s3 System (etc) to: Parquet Consumer │ ▼ ┌──────────────┐ │ParquetStorage│ └──────────────┘ │ ▼ ┌──────────────┐ │ ObjectStore │ └──────────────┘ │ ┌────┴────┐ ▼ ▼ File s3 System (etc) With the ParquetStorage being solely responsible for managing interactions with the object store when dealing with Parquet files.	2022-05-19 13:52:51 +01:00
Dom Dwyer	d3548653d5	refactor: rename Storage -> ParquetStorage Renames the Storage type so the context is clear in usage (i.e. fn args), rather than having to rely on knowing the fully-qualified import path to know what the type stores.	2022-05-19 13:51:07 +01:00
Marco Neumann	52346642a0	ci: fix cargo deny (#4629 ) * ci: fix cargo deny * chore: downgrade `socket2`, version 0.4.5 was yanked * chore: rename `query` to `iox_query` `query` is already taken on crates.io and yanked and I am getting tired of working around that.	2022-05-18 09:38:35 +00:00
Andrew Lamb	3a33e806c7	chore: Update datafusion + `arrow`/`parquet`/`arrow-flight` to `14.0.0` (#4619 ) * chore: Update datafusion deps * chore: update arrow/parquet/arrow flight deps * chore: Run cargo hakari tasks * chore: Update location of utils * chore: Update some more APIs Co-authored-by: CircleCI[bot] <circleci@influxdata.com>	2022-05-17 14:13:03 +00:00
Carol (Nichols \|\| Goulding)	9eb21095e7	feat: Add more logging in particular situations to debug flaky test	2022-05-16 16:46:29 -04:00
dependabot[bot]	259d2486c1	chore(deps): Bump tokio-util from 0.7.1 to 0.7.2 (#4605 ) Bumps [tokio-util](https://github.com/tokio-rs/tokio) from 0.7.1 to 0.7.2. - [Release notes](https://github.com/tokio-rs/tokio/releases) - [Commits](https://github.com/tokio-rs/tokio/compare/tokio-util-0.7.1...tokio-util-0.7.2) --- updated-dependencies: - dependency-name: tokio-util dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2022-05-16 11:42:31 +00:00
Raphael Taylor-Davies	f2bb0fdf77	feat: update to crates.io object_store version (#4595 ) * feat: update to crates.io object_store version * chore: Run cargo hakari tasks * fix: tests * chore: remove object store integration test plumbing Co-authored-by: CircleCI[bot] <circleci@influxdata.com>	2022-05-13 16:26:07 +00:00
kodiakhq[bot]	0f8f294319	Merge branch 'main' into cn/remove-chunk-addr	2022-05-13 13:54:44 +00:00
Carol (Nichols \|\| Goulding)	07c7c75067	fix: Remove ng_chunk method Connects to #4450.	2022-05-12 16:09:08 -04:00
Carol (Nichols \|\| Goulding)	faba90d992	fix: Remove ChunkAddr	2022-05-12 15:50:41 -04:00
Carol (Nichols \|\| Goulding)	26170b7a07	refactor: Move gRPC conversion code to generated_types to share	2022-05-11 14:07:12 -04:00
Carol (Nichols \|\| Goulding)	8545bb60c6	refactor: Move KafkaPartitionWriteStatus to data_types to share more	2022-05-11 14:07:06 -04:00
kodiakhq[bot]	5b866989ca	Merge branch 'main' into dom/query-metrics	2022-05-11 15:51:55 +00:00
Dom Dwyer	7f3473e19f	refactor(ingester): emit per-op debugging info Emit a TRACE level log containing the op offset & other helpful fields. This will allow us to identify which messages were last successfully decoded, and which caused errors so we can pull them from analysis.	2022-05-11 16:35:35 +01:00
Dom Dwyer	3890a8986b	feat(ingester): emit query latency metrics Adds a histogram metric "flight_query_duration_ms" that records the duration a flight RPC query takes to complete. Broken down by query result (success/error).	2022-05-11 16:14:47 +01:00
Raphael Taylor-Davies	8b379c83cc	refactor: simplify object_store path handling (#4534 ) * refactor: simplify object_store path handling * fix: aws integration tests * chore: lint * fix: update gcs tests * refactor: move errors into submodules * chore: lint * chore: review feedback * refactor: replace provider with Display * fix: failing tests Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-05-09 18:43:22 +00:00
kodiakhq[bot]	ebd078133c	Merge branch 'main' into cn/more-cleanup	2022-05-09 13:58:25 +00:00
YIXIAO SHI	cbe9eb4b1e	chore: fix comment typo (#4533 ) Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-05-09 11:24:12 +00:00
Carol (Nichols \|\| Goulding)	4506bf3b8f	fix: Remove or rescope dead code in ingester	2022-05-06 16:58:03 -04:00
Jake Goulding	e07bcd40c2	refactor: Remove unused dependencies These were found by iterating over all of the dependencies of each Cargo.toml, then grepping that crate for the dependency's name. If it didn't show up, I attempted to remove it. I left a few dependencies that this process flagged: * generated_types - `pbjson`,`serde`. Apparently used by the generated code. * grpc-router-test-gen - `prost`. Apparently used by the generated code. * influxdb_iox - `heappy`. Doesn't appear used, but is behind enough feature flags that I don't care to reason about and it's already optional. - `tikv_jemalloc_sys`. Appears to be setting a feature flag of an indirect dependency. * iox_gitops_adapter - `k8s_openapi`. Appears to be setting a feature flag of an indirect dependency.	2022-05-06 15:57:58 -04:00
Carol (Nichols \|\| Goulding)	fcd4815645	fix: Rename router2 to router	2022-05-06 14:51:52 -04:00
Carol (Nichols \|\| Goulding)	068096e7e1	fix: Rename data_types2 to data_types	2022-05-06 14:45:39 -04:00
Carol (Nichols \|\| Goulding)	0541c6e40f	fix: Remove data_types crate where it's no longer used	2022-05-06 14:45:39 -04:00
Carol (Nichols \|\| Goulding)	2ef44f2024	fix: Move timestamp types to data_types2	2022-05-06 14:45:38 -04:00
Carol (Nichols \|\| Goulding)	eb31b347b0	refactor: Move tombstones_to_delete_predicates to the predicate crate	2022-05-06 14:45:37 -04:00
Carol (Nichols \|\| Goulding)	485d6edb8f	refactor: Move IngesterQueryRequest to generated_types	2022-05-06 14:45:37 -04:00
Carol (Nichols \|\| Goulding)	ea46830954	fix: Remove iox_object_store crate; move ParquetFilePath to parquet_file	2022-05-06 14:45:36 -04:00
Andrew Lamb	02893e598c	chore: Update datafusion and upgrade arrow/parquet/arrow-flight to 13 (#4516 ) * chore: Tool for automating arrow version update * chore: Update datafusion and arrow/parquet/arrow-flight * fix: update for changes in Arrow API Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-05-05 00:21:02 +00:00
dependabot[bot]	420c306caa	chore(deps): Bump tokio from 1.17.0 to 1.18.0 (#4453 ) Bumps [tokio](https://github.com/tokio-rs/tokio) from 1.17.0 to 1.18.0. - [Release notes](https://github.com/tokio-rs/tokio/releases) - [Commits](https://github.com/tokio-rs/tokio/compare/tokio-1.17.0...tokio-1.18.0) --- updated-dependencies: - dependency-name: tokio dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2022-04-28 08:21:17 +00:00
Marco Neumann	59f6556483	fix: do not create empty batches in ingester (#4443 ) Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-04-27 17:52:22 +00:00
Nga Tran	fa2c1febf4	feat: use stored partition sort key to deduplicate data (#4360 ) * feat: use stored sort key to deduplicate data * refactor: verify if one is a super sort key of the other * test: unit tests for scan and deduplication plans * fix: typo * refactor: refactor and add comments * feat: cache partition sort key to read during planning as needed * test: tests for query plans with different overlap groups * chore: cleanup * chore: resolve merge conflicts Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-04-26 20:36:32 +00:00
Marco Neumann	11f87cffdd	fix: memorize max persisted tombstone (#4430 )	2022-04-26 16:13:09 +00:00
Marco Neumann	bd600bbac6	refactor: allow ingester to be integrated into query tests (#4427 ) * refactor: improve `IngesterData` public interface * feat: impl `Debug` for `Test{Namespace,Sequencer}` * refactor: trait interface for `LifecyleHandle` This is required to mock the lifecycle for query tests. * refactor: trait for partitioner	2022-04-26 13:44:30 +00:00
二手掉包工程师	4b47d723b1	refactor: Rename time to iox_time (#4416 ) Signed-off-by: hi-rustin <rustin.liu@gmail.com> Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-04-26 00:19:59 +00:00
Marco Neumann	86e8f05ed1	fix: make all catalog IDs 64bit (#4418 ) Closes #4365. Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-04-25 16:49:34 +00:00
Nga Tran	d963110842	feat: group chunk overlaps based on time range only (#4389 ) * feat: overlap for NG querier * chore: cleanup * refactor: address review comments * fix: typo Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-04-25 13:32:07 +00:00
Marco Neumann	f444e63960	test: include materialized delete predicates in NG query tests (#4371 ) * refactor: move `batch_filter` to `datafusion_util` * fix: outdated docstring * feat: allow passing record batches to `iox_tests` parquet files * test: include materialized delete predicates in NG query tests * docs: improve wording Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-04-21 13:00:13 +00:00
Andrew Lamb	73bed810da	chore: Update arrow, arrow-flight, parquet, tonic, prost, etc (#4357 ) * chore: Update datafusion * chore: Update arrow/arrow-flight/parquet to 12 * chore: update datafusion correctly * chore: Update prost, tonic, and dependents * fix: Fixup some api changes * fix: Update test output in db * fix: Update test output in parquet_file * fix: remove old pbjson types * fix: Add "--experimental_allow_proto3_optional" flag * chore: Run cargo hakari tasks * fix: compile error * chore: Update heappy Co-authored-by: CircleCI[bot] <circleci@influxdata.com> Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-04-20 11:12:17 +00:00
Andrew Lamb	5ea676d3f7	feat: add per kafka partition durability reporting to write info response (#4341 ) * feat: add per kafka partition durability reporting to write info response * fix: buf lint + test cleanup * fix: clean up protobuf * refactor: pull out conversion of KafkaPartitionStatus into a function * fix: fmt * fix: typo Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-04-19 16:46:20 +00:00
Marco Neumann	5b48675435	fix: actually transmit record-batch metadata from querier (#4347 ) Attaching the "batch => partition" mapping via per-batch schema KV metadata does NOT work because flight will transmit the schema once for all batches (even though on the Rust side we have a schema ref attached to every batch, probably for convenience). Instead we now use the same global protobuf metadata that we also use for the "partition => max sequence number" information. This somewhat limits our ability to create record batches lazily on the ingester side (since the global metadata is sent before any actual payload) but I think we should not modify the usage of the flight protocol too much right now (e.g. by sending more schema messages). If this becomes an issue, we can always find a more complex solution in the future.	2022-04-19 10:54:23 +00:00
Nga Tran	2a601c3099	fix: Revert "chore: Revert "fx: Revert "fix: Revert "feat: Use the sort key stored in the catalog during compaction" (#4299 )" (#4303 )" (#4327 )" (#4328 ) * fix: Revert "chore: Revert "fx: Revert "fix: Revert "feat: Use the sort key stored in the catalog during compaction" (#4299)" (#4303)" (#4327)" This reverts commit `7e5d719027`. * chore: resolve merge conflict Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-04-18 15:27:39 +00:00
Nga Tran	7e5d719027	chore: Revert "fix: Revert "fix: Revert "feat: Use the sort key stored in the catalog during compaction" (#4299 )" (#4303 )" (#4327 ) This reverts commit `fe8d9948d5`.	2022-04-14 17:11:55 +00:00
Carol (Nichols \|\| Goulding)	fe8d9948d5	fix: Revert "fix: Revert "feat: Use the sort key stored in the catalog during compaction" (#4299 )" (#4303 ) This reverts commit `7ddbf7c025`. Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-04-14 15:42:28 +00:00
Marco Neumann	351b0d0c15	fix: unknown namespace/table in querier<>ingester flight protocol (#4307 ) * fix: return "not found" gRPC error instead of "internal" when ingester does not know table * fix: properly handle "namespace not found" in ingester queries * fix: make `initialize_db` work with async code * test: add custom step for NG tests * fix: handle "unknown table/namespace" resp. in querier * docs: explain test setup Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>	2022-04-14 12:36:15 +00:00
Marco Neumann	8bf2fbb7d3	fix: ingester min-unpersisted-sequence-number calc + doc (#4302 ) Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-04-14 07:05:06 +00:00
Carol (Nichols \|\| Goulding)	7ddbf7c025	fix: Revert "feat: Use the sort key stored in the catalog during compaction" (#4299 ) Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-04-13 14:11:10 +00:00
kodiakhq[bot]	21f748062e	Merge branch 'main' into cn/sort-in-compactor	2022-04-13 12:43:31 +00:00
Marco Neumann	83f77712b1	refactor: querier<>ingester flight protocol adjustments (#4286 ) * refactor: querier<>ingester flight protocol adjustments This makes a few adjustments to the querier<>ingester flight protocol. Query Scope =========== The querier will request data for ALL sequencer IDs for now. There is no reason to have a request per sequencer ID. We can add a range/set filter later if we want, but this is not required for now. Partition-level =============== The only time when the querier cares about sequencer IDs (i.e. sharding) at all is when it selects which ingesters to ask for unpersisted data (this is currently not implemented, it just asks all ingesters). Afterwards the querier only cares about partitions (which are bound to specific sequencers anyways) because this is the level where parquet file persistence and compaction as well as deduplication happen. So we make partitions a first-class citizen in the ingester response. Metadata VS RecordBatches ========================= The global app-metadata will list all partitions and their max persisted parquet files and tombstones (theoretically tombstones are at table-level, but the ingester could in the future break them down to the partition-level). Then it receives a stream of record batches. Each record batch is tagged (via key-value metadata in its schema) so it can be assigned to a partition. At the moment the ingester returns 0 or 1 batches per unpersisted partition (0 in case we've filtered out all the data via the predicate), but in the future it is free to return multiple batches. This setup gives the ingester more freedom over memory management and (potentially parallel) query processing, while at the same time keeps the set of duplicated information minimal and allows easy extensions (since the global metadata is a full-blown protobuf message). Querier ======= At the moment the querier ignores all the metdata. Follow-up PRs will change that. * docs: improve Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * refactor: make code clearer Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>	2022-04-12 16:48:40 +00:00
Carol (Nichols \|\| Goulding)	87e4a1a51d	refactor: Move ingester sort key to schema sort key This logic isn't actually ingester specific	2022-04-11 14:09:45 -04:00
Carol (Nichols \|\| Goulding)	a053077a05	refactor: Make compute_sort_key more general than the ingester Enable computing sort keys for a schema and an iterator of record batches.	2022-04-11 14:09:45 -04:00
Dom Dwyer	5c3cbb14b4	test: join ingester background tasks	2022-04-08 14:24:56 +01:00
Dom Dwyer	dce939c580	refactor: use SequencedStreamHandler Removes the old stream_in_sequenced_entries() write buffer handler, replacing it with the SequencedStreamHandler introduced in #4203. This change will affect the metrics emitted by an ingester as outlined in #4243.	2022-04-08 11:28:39 +01:00
Dom Dwyer	71a278ac7e	refactor: accept !Sync write buffer streams Removes the Sync bound SequencedStreamHandler input stream type, as the BoxStream returned by the WriteBufferStreamHandler is not Sync. This change means the SequencedStreamHandler is not Sync either, but is still Send and therefore can be moved into tokio tasks.	2022-04-08 11:28:39 +01:00
Dom Dwyer	c2236fa3fb	feat: impl DmlSink for IngesterData This commit adds an adaptor (IngestSinkAdaptor) that provides a DmlSink implementation for the existing write path (IngesterData). With this, the existing write path becomes compatible with the new op stream handler (SequencedStreamHandler).	2022-04-08 11:28:39 +01:00
Andrew Lamb	a30a85e62c	feat: Add get_write_info service (#4227 ) Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-04-07 19:24:58 +00:00
kodiakhq[bot]	8bd0bfb669	Merge branch 'main' into dom/ingester-op-instrumentation	2022-04-07 16:33:25 +00:00
kodiakhq[bot]	f5996c5ab4	Merge branch 'main' into cn/sort-key-across-persists	2022-04-07 14:40:55 +00:00
Dom	998a66fd98	docs: Update ingester/src/stream_handler/sink_instrumentation.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>	2022-04-07 12:18:14 +01:00
Carol (Nichols \|\| Goulding)	30c3ef5aa6	fix: Only save relevant columns in parquet file's sort key	2022-04-06 14:09:08 -04:00
Dom Dwyer	24eeddce8a	chore: fix lint warnings	2022-04-06 16:45:31 +01:00
Dom Dwyer	091640bb23	feat: emit tracing span for op apply This commit uses the tracing metadata within the DmlOperation to emit a tracing span from the ingester covering the DmlSink::apply() operation.	2022-04-06 16:32:00 +01:00
Dom Dwyer	f6c65f52a3	refactor: impl WatermarkFetcher Implement WatermarkFetcher for PeriodicWatermarkFetcher and remove unnecessary async.	2022-04-06 16:32:00 +01:00
Dom Dwyer	436da19d9a	feat: DmlSink instrumentation This commit adds the SinkInstrumentation type that decorates an inner DmlSink with call latency and write buffer metrics. The write buffer / sink call metrics may be split apart into two separate responsibilities in the future if there are multiple DmlSink that need instrumentation, but deferring adding more types until it is needed.	2022-04-06 16:32:00 +01:00
Andrew Lamb	c244b03281	feat: Add `SequencerProgress` reporting to ingester (#4238 ) * feat: Add `SequencerProgress` reporting to ingester * refactor: Use KafkaPartition in write_summary * fix: Update docstrings * refactor: Change ingester to use KafkaPartition everywhere * refactor: add SequencerProgress::combine * refactor: return new SequencerProgress rather than updating * fix: distinguish between yes/no/unknown in WriteSummary * docs: Update data_types2/src/lib.rs Co-authored-by: Paul Dix <paul@pauldix.net> Co-authored-by: Paul Dix <paul@pauldix.net> Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-04-06 15:13:21 +00:00
Carol (Nichols \|\| Goulding)	bf3cb45723	refactor: Pass PartitionInfo as argument	2022-04-06 09:31:42 -04:00
Carol (Nichols \|\| Goulding)	f0d5987317	feat: Update partition sort_key in catalog after persist Connects to #4196.	2022-04-06 09:31:42 -04:00
Carol (Nichols \|\| Goulding)	b16fcc284d	feat: Add new columns to the sort key during compaction Connects to #4196.	2022-04-06 09:31:42 -04:00
Carol (Nichols \|\| Goulding)	98d052dba7	feat: Use catalog sort key if specified Pass the sort key from the catalog through to compact_persisting_batch. If the sort key is Some, use that. If the sort key is None, compute it from the data's cardinality with compute_sort_key. Connects to #4196.	2022-04-06 09:31:42 -04:00
Dom Dwyer	891d2e1368	feat: periodic kafka max watermark offset fetcher Adds the PeriodicWatermarkFetcher type responsible for querying write buffer / Kafka for the maximum sequence number / offset, surfacing any errors via both logs & metrics. This high watermark / max offset value is used within the ingest instrumentation metrics. This use case is tolerant of caching / stale values, and as such the value is periodically updated to minimise load on the write buffer.	2022-04-05 12:02:07 +01:00
Dom Dwyer	aaa677dec8	docs: describe graceful shutdown behaviour	2022-04-05 11:31:55 +01:00
Dom Dwyer	8edefc415d	refactor: rename ttbr -> write_time in tests	2022-04-05 11:31:55 +01:00
Dom Dwyer	a387ec361d	refactor: use self.deref() instead of **self	2022-04-05 11:31:55 +01:00
Dom Dwyer	f15275cf96	feat: expose ingest sequencer errors Instruments the SequencedStreamHandler with a series of new metrics that record the various error classes observable in the stream handler. These metrics are labelled with potential_data_loss=true where relevant to surface potential data loss events for alerting & further review.	2022-04-05 11:31:55 +01:00
Dom Dwyer	083ff1f8e3	refactor: ingest stream handler Refactors the stream_in_sequenced_entries() into a new impl in the SequencedStreamHandler type, decoupling the reading / decoding of ops from Kafka (and associated error handling) from the "what happens to those ops" concern to ease testing, encapsulate the specifics of "how to get an op" and improve flexibility. This is intended to provide robust error handling within what is reasonably possible (unexpected errors are always unexpected!) while retaining the existing metrics and functionality. I've also separated out code that exists in the current impl specifically to drive tests from the prod code path, instead driving those behaviours through mocks. As of this commit, the handler is not used - this commit simply adds the new impl.	2022-04-05 11:31:54 +01:00
Paul Dix	81d41f81a1	fix: ingester replay logic (#4212 ) Fix the ingester to track the max persisted sequence number per partition. Ensure replay takes in data from unpersisted partitions. Simplify the table persist info to not return a max persisted sequence number for the table as that information isn't needed.	2022-04-04 18:04:34 +00:00
Carol (Nichols \|\| Goulding)	d41adf074f	test: Add assertions for sort keys	2022-04-01 13:13:04 -04:00
Carol (Nichols \|\| Goulding)	f4b5fa1b5e	feat: Implement distinct counts in terms of distinct values For one record batch. Connects to #4194.	2022-03-31 16:46:27 -04:00
Carol (Nichols \|\| Goulding)	832495a7c9	feat: Implement ingester compute_sort_key similarly to query compute_sort_key And add a test that currently fails because this implementation doesn't include actually computing the cardinalities. Connects to #4194.	2022-03-31 16:35:16 -04:00
Carol (Nichols \|\| Goulding)	9d83554f20	feat: Get the sort key from the schema and data in the QueryableBatch Connects to #4194.	2022-03-31 16:34:48 -04:00
Carol (Nichols \|\| Goulding)	9043966443	docs: Fix some typos in comments as I noticed them	2022-03-31 16:34:47 -04:00
Andrew Lamb	a384448b92	refactor: rename Sequence::id and Sequence::number field names (#4190 ) Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-03-31 15:17:58 +00:00
Nga Tran	ddc2c8304f	fix: have the compaction level set correctly (#4184 ) * fix: have the compaction level set correctly, especially for compacted file from the compactor * fix: typo	2022-03-30 21:23:40 +00:00
Marco Neumann	2b76c31157	refactor: make statistics null counts optional (#4160 ) Min/max values and distinct counts are already optional, so let's make the null counts optional as well. This will be helpful for NG to deal w/ partial statistics (e.g. we only populate stats for the time column). Note that the total count is still mandatory, but we normally have the chunk/file-level row count at hand.	2022-03-29 17:47:57 +00:00
Carol (Nichols \|\| Goulding)	f3f792fd08	feat: Add namespace_id to the parquet_files table; object store paths need it	2022-03-29 08:15:26 -04:00
Carol (Nichols \|\| Goulding)	a373c90415	refactor: Extract the list_all function to object store I'm about to use this in a third file, so time to extract this. Make it clear that this is appropriate for tests only.	2022-03-29 08:15:24 -04:00
dependabot[bot]	17af5fcbd1	chore(deps): Bump tokio-util from 0.7.0 to 0.7.1 (#4154 ) * chore(deps): Bump tokio-util from 0.7.0 to 0.7.1 Bumps [tokio-util](https://github.com/tokio-rs/tokio) from 0.7.0 to 0.7.1. - [Release notes](https://github.com/tokio-rs/tokio/releases) - [Commits](https://github.com/tokio-rs/tokio/compare/tokio-util-0.7.0...tokio-util-0.7.1) --- updated-dependencies: - dependency-name: tokio-util dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> * chore: Run cargo hakari tasks Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: CircleCI[bot] <circleci@influxdata.com> Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-03-29 08:39:02 +00:00
dependabot[bot]	4f9515ffba	chore(deps): Bump async-trait from 0.1.52 to 0.1.53 (#4141 ) Bumps [async-trait](https://github.com/dtolnay/async-trait) from 0.1.52 to 0.1.53. - [Release notes](https://github.com/dtolnay/async-trait/releases) - [Commits](https://github.com/dtolnay/async-trait/compare/0.1.52...0.1.53) --- updated-dependencies: - dependency-name: async-trait dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2022-03-28 08:55:24 +00:00
Andrew Lamb	5c69a3f43b	chore: Update deps: datafusion, arrow/arrow-flight/parquet to 11, zstd to 0.11 (#4119 ) * chore: update datafusion * chore(deps): Bump arrow from 10.0.0 to 11.0.0 Bumps [arrow](https://github.com/apache/arrow-rs) from 10.0.0 to 11.0.0. - [Release notes](https://github.com/apache/arrow-rs/releases) - [Changelog](https://github.com/apache/arrow-rs/blob/master/CHANGELOG.md) - [Commits](https://github.com/apache/arrow-rs/compare/10.0.0...11.0.0) --- updated-dependencies: - dependency-name: arrow dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> * chore(deps): Bump arrow-flight from 10.0.0 to 11.0.0 Bumps [arrow-flight](https://github.com/apache/arrow-rs) from 10.0.0 to 11.0.0. - [Release notes](https://github.com/apache/arrow-rs/releases) - [Changelog](https://github.com/apache/arrow-rs/blob/master/CHANGELOG.md) - [Commits](https://github.com/apache/arrow-rs/compare/10.0.0...11.0.0) --- updated-dependencies: - dependency-name: arrow-flight dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> * chore: update parquet to 11.0.0 * fix: error on create schema, test for same * fix: upgrade zstd * chore: Run cargo hakari tasks * fix: fix logical merge conflict * fix: hakari * fix: hakari * fix: update newly introduced dep Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: CircleCI[bot] <circleci@influxdata.com> Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-03-24 15:27:36 +00:00
Andrew Lamb	204dd7c8e9	refactor: Fix some random clippy lints from the future (#4118 ) Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-03-24 09:21:29 +00:00
Carol (Nichols \|\| Goulding)	67e13a7c34	fix: Change to_delete column on parquet_files to be a time (#4117 ) Set to_delete to the time the file was marked as deleted rather than true. Fixes #4059. Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-03-23 18:47:27 +00:00
Marco Neumann	51da6dd7fa	feat: store sort key in NG metadata (#4110 ) The sort key is optional and currently only produced by `iox_tests`. Writing it within the ingester/compactor is tracked by #3968. The sort key is read by the querier (and this will be verified by the query tests and is required to merge #4103). Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-03-23 18:24:46 +00:00
Paul Dix	b18b18afd9	fix: have ingester use single mutable batch for buffer (#4095 ) Removed some unnecessary tests as they no longer apply with the new buffer structure. This will hopefully reduce the memory footprint of the ingesters significantly. Closes #4072	2022-03-22 13:42:52 +00:00
Andrew Lamb	b83b000590	chore: Update datafusion (#4071 ) * chore: update to datafusion 5936edc2a94d5fb20702a41eab2b80695961b9dc * chore: Update apis to match datafusion changes	2022-03-22 13:17:41 +00:00
kodiakhq[bot]	67939fb37d	Merge branch 'main' into crepererum/issue3934b	2022-03-19 06:56:30 +00:00
Paul Dix	85287abc4e	feat: add ttbr metric to ingester (#4068 )	2022-03-18 17:35:30 +00:00
Marco Neumann	169fa2fb2f	refactor: make `QueryChunk` object-safe This makes it way easier to dyn-type database implementations. The only real change is that we make `QueryChunk::Error` opaque. Nobody is going to inspect that anyways, it's just printed to the user. This is a follow-up of #4053. Ref #3934.	2022-03-18 11:40:31 +01:00
Dom Dwyer	d9900f661b	refactor: ingester table & namespace count metrics Record the number of tables / namespaces an ingester process has observed.	2022-03-17 17:20:30 +00:00
Dom Dwyer	c0d5c6a559	feat: ingester pause metrics Emit a counter metric "ingest_paused_duration_ms_total" that records the duration of time an ingester stream is paused with millisecond granularity. This metric will allow us to measure the frequency and severity of, and alert on, an ingester stopping ingest due to memory limits enforced by the LifecycleManager. This will help us tune these config params.	2022-03-17 17:20:30 +00:00
Paul Dix	d3ea361337	feat: add ingester lifecycle metrics (#4031 )	2022-03-15 19:32:58 +00:00
Dom Dwyer	5585dd3c21	refactor: switch to using DynObjectStore Changes all consumers of the object store to use the dynamically dispatched DynObjectStore type, instead of using a hardcoded concrete implementation type.	2022-03-15 16:32:52 +00:00
Dom Dwyer	1d5066c421	refactor: rename ObjectStore -> ObjectStoreImpl Frees up the name for so we can use `dyn ObjectStore` throughout the code instead of `ObjectStoreApi`.	2022-03-15 16:29:43 +00:00
Andrew Lamb	9b3f946c10	feat: all in 1 IOx NG mode (#3965 ) * feat: Add all_in_one mode * fix: doc * docs: fix truncated docs * refactor: correctly identify PG connections * refactor: resolve failed merge Co-authored-by: Dom Dwyer <dom@itsallbroken.com> Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-03-15 16:28:37 +00:00
Marko Mikulicic	4c674b931a	fix: Remove partition_id from metric attributes (#4028 )	2022-03-14 14:12:34 +00:00
Nga Tran	5a29d070ea	feat: Implement the compact function for NG Compactor (#4001 ) * feat: initial implementation of compact a given list of overlapped parquet files * feat: Add QueryableParquetChunk and some refactoring * feat: build queryable parquet chunks for parquet files with tombstones * feat: second half the implementation for Compactor's compact. Tests will be next * fix: comments for trait funnctions fof QueryChunkMeta * test: add tests for compactor's compact function * fix: typos * refactor: address Jake's review comments * refactor: address Andrew's comments and add one more test for files in different order in the vector Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-03-11 20:25:19 +00:00
Andrew Lamb	cc4875cca0	refactor: decouple ingester setup and creation logic from the config structs (#4020 ) * refactor: decouple ingester setup and creation logic from the config structs * fix: clippy * refactor: remove comments	2022-03-11 19:25:50 +00:00
kodiakhq[bot]	30fc29c296	Merge branch 'main' into dom/ingester-lock-fix	2022-03-11 10:32:26 +00:00
Dom Dwyer	4a7364d63f	refactor: rename lifecycle_manager args	2022-03-11 10:31:58 +00:00
Carol (Nichols \|\| Goulding)	ecd06c6ec3	fix: ParquetFileRepo create should be responsible for setting INITIAL_COMPACTION_LEVEL When created in the catalog, parquet files should always have compaction level 0. Updating the compaction level should always happen in the compactor. Only the catalog should need to know about the initial compaction level value.	2022-03-10 13:51:18 -05:00
Carol (Nichols \|\| Goulding)	ff31407dce	refactor: Extract a ParquetFileParams type for create This has the advantages of: - Not needing to create fake parquet file IDs or fake deleted_at values that aren't used by create before insertion - Not needing too many arguments for create - Naming the arguments so it's easier to see what value is what argument, especially in tests - Easier to reuse arguments or parts of arguments by using copies of params, which makes it easier to see differences, especially in tests	2022-03-10 13:51:18 -05:00
Paul Dix	27999ff72f	feat: add compaction_level and created_at to parquet_file (#3972 )	2022-03-10 15:56:57 +00:00
Andrew Lamb	2c3d30ca32	chore: Update datafusion, arrow, flight and parquet (#4000 ) * chore: Update datafusion, arrow, flight and parquet * fix: api change * fix: fmt * fix: update test metadata size * fix: Update sizes in parquet test * fix: more metadata size update	2022-03-10 12:24:47 +00:00
Dom Dwyer	874be0097d	refactor: eliminate unnecessary lock This commit splits the API of the LifecycleManager into two: * LifecycleManager: singleton responsible for evaluating partitions and running persist tasks. * LifecycleHandle: a handle for each sequencer ingester(s) to update the global LifecycleManager state when applying ops. This keeps the accessible API & responsibilities of each caller distinct and allows us to leverage the type system to enforce linearisation of calls to LifecycleManager::maybe_persist() without resorting to an (unnecessary) mutex guard for serialisation.	2022-03-09 11:16:44 +00:00
Paul Dix	96100635c3	feat: ingester seeks kafka partition on initialization (#3940 ) Fixes #3851	2022-03-08 19:53:35 +00:00
Dom Dwyer	384133aac7	refactor: impl Debug for ingester types Derive / implement Debug for all types in the ingester crate to help with debugging, and add a lint at "warn" level to keep everything in sync.	2022-03-08 15:19:29 +00:00
Paul Dix	337e432a0f	feat: ingester persists on cold partitions (#3942 ) Add configuration and lifecycle to trigger partition persistence if it hasn't received a write in a given number of secods. Fixes #3869	2022-03-07 18:55:56 +00:00
Carol (Nichols \|\| Goulding)	9961efd702	feat: Send parquet and tombstone seq nums with ingester query response (#3925 ) Fixes #3867.	2022-03-04 15:22:29 +00:00
Paul Dix	6ba5e51897	feat: update max_persisted_sequence_number in the buffered table on persist (#3868 ) This includes a bit of a refactor in the locking structure of the buffer data. Locking at the partition collection and within the partition data was making things more complex than they needed to be. The partitions in the buffer are there only temporarily until they get persisted. Locking on the table simplifies things a bit and makes it more clear when the table state is being modified since it no longer has any interior mutability. Having access to separate partitions without the same lock isn't something we need because queries will hit all partitions and data is brought in sequentially, regardless of which partition it is hitting in a sequencer. Fixes #3850	2022-03-03 23:52:31 +00:00
Edd Robinson	787a848bf5	feat: add tracing for tag_values	2022-03-03 14:27:01 +00:00
Edd Robinson	6a6fbf73ae	feat: add tracing support tag_keys	2022-03-03 14:27:01 +00:00
Dom Dwyer	8de453edd1	feat: batch column upsert for schema validation Uses the new ColumnRepo::create_or_get_many() catalog method to perform a bulk upsert of (potentially) new columns to the catalog during schema validation.	2022-03-03 11:18:29 +00:00
kodiakhq[bot]	caba3e9fd2	Merge branch 'main' into cn/querier-flight-request	2022-03-02 20:30:00 +00:00
Edd Robinson	3d047073b9	feat: add tracing down to the chunk level (#3804 ) * refactor: wire exectution context to Deduplicator * feat: example trace to chunk read_filter * refactor: make execution context required * refactor: expose metadata API * refactor: more span context for chunk read_filter * refactor: fix build * refactor: push context into result stream * refactor: make executor optional	2022-03-02 19:08:22 +00:00
Carol (Nichols \|\| Goulding)	3f2a58b47f	refactor: pub use data_types from data_types2 So it's clearer which parts of data_types the NG design is using, and which types can be cleaned up eventually.	2022-03-02 13:55:31 -05:00
Carol (Nichols \|\| Goulding)	2a90841715	refactor: Move IngesterQueryRequest to data_types2 So that querier doesn't need to depend on ingester.	2022-03-02 13:52:13 -05:00
Carol (Nichols \|\| Goulding)	8f3e44bf76	refactor: Extract a crate for shared data types in the new design	2022-03-02 12:16:15 -05:00
Carol (Nichols \|\| Goulding)	16d86ed05b	feat: Deserialize metadata to get max_sequencer_number And add an end-to-end test for the flight request to the ingester.	2022-03-02 11:50:47 -05:00
Carol (Nichols \|\| Goulding)	141a6087d0	feat: Querier able to send Flight queries to Ingester Fixes #3773.	2022-03-02 11:50:45 -05:00
Marco Neumann	ace4af1b66	feat: `DedicatedExecutor` async `join` and job `detach`. (#3835 ) * feat: detach dedicated exec jobs * feat: async `DedicatedExecutor::join` Now `DedicatedExecutor` follows the system we use for other server components: - `shutdown`: a quick sync call that signals the shutdown but doesn't drop - `join`: async awaits until the executor has finished shutdown - `drop`: warn but still try to shut down * test: irmpove `detach_receiver` test Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>	2022-03-01 15:25:31 +00:00
Marco Neumann	48722783f9	feat: offer metrics for in-mem catalog (#3876 ) This can be quite helpful to test certain caching behavior w/o writing yet-another abstraction layer.	2022-03-01 11:33:54 +00:00
Nga Tran	0e0dc500f6	feat: prepare data to send to querier (#3825 ) * feat: changes needed to apply tombstones correctly on the life-cycle ingest bacthes * refactor: adjust the design after discussing with Paul * feat: apply the coming tombstone on all data but persiting one * chore: fmt * fix: build on buffer tombstone * test: delete & write tests for a parition and some cleanup * feat: No need add processed tombstones for newly created parquet file in the ingester becasue all deletes before that parquet file is created were applied * chore: cleanup * feat: intitial implementation for preparing data to send back to the Querier * feat: full implementation of prepare_data_to_querier * fix: apply filters for the batches * chore: Apply suggestions from code review Co-authored-by: Carol (Nichols \|\| Goulding) <193874+carols10cents@users.noreply.github.com> * chore: cleanup * fix: typos in comments * fix: typos in comments * fix: typos in comments * test: create different scenarios and test them * chore: fix typos * test: add tests with deletes * chore: make pub pub(crate) * chore: Apply suggestions from code review Co-authored-by: Jake Goulding <jake.goulding@integer32.com> * refactor: address review comments * fix: keep batches in their arrival order * refactor: not assign unecessary values to enum * refactor: use bitflags enum * fix: use bitflags correctly * chore: Apply suggestions from code review Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * refactor: avoid using use at the end of the function * chore: merge main to branch * fix: fix downgrade versions * refactor: address review comments * chore: remove unnecessary comments * refactor: Make the whole test_utils module test-only and bring paths into module scope Co-authored-by: Paul Dix <paul@pauldix.net> Co-authored-by: Carol (Nichols \|\| Goulding) <193874+carols10cents@users.noreply.github.com> Co-authored-by: Jake Goulding <jake.goulding@integer32.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Carol (Nichols \|\| Goulding) <carol.nichols@gmail.com>	2022-03-01 01:00:45 +00:00
Marco Neumann	6bb18672a4	fix: do not ignore failed persist tasks (#3866 ) I'm seeing some panics in our test bench, but it the ingester happily continues and thinks it persisted tasks even though it didn't. Let's at least bail out if a persist task fails.	2022-02-28 09:30:42 +00:00
Raphael Taylor-Davies	2a842fbb1a	feat: correctly sort data and store in catalog metadata (#3864 ) * feat: respect sort order in ChunkTableProvider (#3214) feat: persist sort order in catalog (#3845) refactor: owned SortKey (#3845) * fix: size tests * refactor: immutable SortKey * test: test sort order restart (#3845) * chore: explicit None for sort key * chore: test cleanup * fix: handling of sort keys containing fields * chore: remove unused selected_sort_key * chore: more docs Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-02-25 17:56:27 +00:00
Nga Tran	8edc462c37	fix: while executing deduplication, do not return empty record batches as a result of deduplication (#3861 ) Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-02-25 15:00:13 +00:00
Paul Dix	c965221df1	feat: have ingester ignore already persisted data (#3849 )	2022-02-25 00:08:33 +00:00
Dom Dwyer	b07f15bec7	refactor: parallel column resolution A quick change to perform the ColumnRepo::create_or_get() calls in parallel (up to a maximum of 3 in-flight at any one time) in order to mitigate the latency of the call and reduce the overall schema validation call duration. The in-flight limit is enforced to avoid starving the DB connection pool of connections.	2022-02-24 21:04:25 +00:00
Carol (Nichols \|\| Goulding)	723a0c659f	fix: Remove greater_than_sequence_number from IngesterQueryRequest (#3856 )	2022-02-24 19:23:44 +00:00
Marco Neumann	49d1be30e7	feat: wire up `ParquetFilePath` for NG (#3853 ) It's a bit of a duck-type hack, but if we wanna just `ParquetFileChunk` in the new architecture, we somehow need it to accept new-gen paths. Also path handling should be somewhat centralized since ingester/compactor/querier all need to construct them. So having a `ParquetFilePath` that supports both path styles seems to be a not-to-bad solution. This should obviously be cleaned up in some not-to-distant future.	2022-02-24 16:05:38 +00:00
Carol (Nichols \|\| Goulding)	252ced7adf	feat: Add row count to the parquet_file record in the catalog (#3847 ) Fixes #3842.	2022-02-24 15:20:50 +00:00
Marco Neumann	d62a052394	feat: extend catalog so we can recover `ParquetChunk`s from it (#3852 ) * refactor: less parquet data copying * feat: `PartitionRepo::get_by_id` * feat: `TableRepo::get_by_id` * feat: `ParquetFile::file_size_bytes` * feat: `ParquetFile::parquet_metadata` Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-02-24 13:16:15 +00:00
Marco Neumann	9079e6ddb0	feat: backoff retries in ingester (#3841 ) * feat: add `backoff` crate * feat: backoff retries in ingester Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-02-23 17:58:16 +00:00
Carol (Nichols \|\| Goulding)	71f62eee68	fix: Remove min_time and max_time from IngesterQueryRequest (#3839 ) Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-02-23 15:46:31 +00:00
Marco Neumann	657ac249e9	feat: track ingester jobs (#3836 )	2022-02-23 15:33:47 +00:00
dependabot[bot]	b63f920d4c	chore(deps): Bump parquet from 9.0.2 to 9.1.0 (#3828 ) * chore(deps): Bump parquet from 9.0.2 to 9.1.0 Bumps [parquet](https://github.com/apache/arrow-rs) from 9.0.2 to 9.1.0. - [Release notes](https://github.com/apache/arrow-rs/releases) - [Changelog](https://github.com/apache/arrow-rs/blob/master/CHANGELOG.md) - [Commits](https://github.com/apache/arrow-rs/compare/9.0.2...9.1.0) --- updated-dependencies: - dependency-name: parquet dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * chore: update chunk size test Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Raphael Taylor-Davies <r.taylordavies@googlemail.com> Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-02-23 11:25:15 +00:00
dependabot[bot]	5a79b3a68b	chore(deps): Bump arrow-flight from 9.0.2 to 9.1.0 (#3829 ) Bumps [arrow-flight](https://github.com/apache/arrow-rs) from 9.0.2 to 9.1.0. - [Release notes](https://github.com/apache/arrow-rs/releases) - [Changelog](https://github.com/apache/arrow-rs/blob/master/CHANGELOG.md) - [Commits](https://github.com/apache/arrow-rs/compare/9.0.2...9.1.0) --- updated-dependencies: - dependency-name: arrow-flight dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-02-23 11:03:22 +00:00
dependabot[bot]	3b7d31c88a	chore(deps): Bump arrow from 9.0.2 to 9.1.0 (#3826 ) Bumps [arrow](https://github.com/apache/arrow-rs) from 9.0.2 to 9.1.0. - [Release notes](https://github.com/apache/arrow-rs/releases) - [Changelog](https://github.com/apache/arrow-rs/blob/master/CHANGELOG.md) - [Commits](https://github.com/apache/arrow-rs/compare/9.0.2...9.1.0) --- updated-dependencies: - dependency-name: arrow dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2022-02-23 09:25:46 +00:00
Paul Dix	276d9b123a	feat: Add min_sequence_number tracking for sequencers in ingester (#3785 ) Fixes #3702. This pulls the min sequence tracking into the LifecycleManager. Because the number requires looking at all other partitions in memory, this was the most efficient place to put it. The manager updates the sequencer state after it calls persist. The number is meant to be a lower bound on the sequence number. Issue #3783 will add functionality for the ingester to ignore replayed data that has already been persisted.	2022-02-22 21:53:33 +00:00
Nga Tran	a91e2eadc7	feat: apply tombstones to the batches of the ingest life-cycle (#3770 ) * feat: changes needed to apply tombstones correctly on the life-cycle ingest bacthes * refactor: adjust the design after discussing with Paul * feat: apply the coming tombstone on all data but persiting one * chore: fmt * fix: build on buffer tombstone * test: delete & write tests for a parition and some cleanup * feat: No need add processed tombstones for newly created parquet file in the ingester becasue all deletes before that parquet file is created were applied * chore: cleanup Co-authored-by: Paul Dix <paul@pauldix.net> Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-02-22 18:54:21 +00:00
dependabot[bot]	ad3868ed7c	chore(deps): Bump tokio from 1.16.1 to 1.17.0 (#3814 ) * chore(deps): Bump tokio from 1.16.1 to 1.17.0 Bumps [tokio](https://github.com/tokio-rs/tokio) from 1.16.1 to 1.17.0. - [Release notes](https://github.com/tokio-rs/tokio/releases) - [Commits](https://github.com/tokio-rs/tokio/compare/tokio-1.16.1...tokio-1.17.0) --- updated-dependencies: - dependency-name: tokio dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * build: update workspace-hack Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Dom Dwyer <dom@itsallbroken.com> Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-02-22 16:27:43 +00:00
Carol (Nichols \|\| Goulding)	1b9212540b	feat: Send IngesterQueryResponse data back as response of doGet Flight request (#3772 ) * fix: Adjust fields of IngesterQueryResponse * feat: Adjust IngestHandler query method to call prepare_data_to_querier * feat: Send ingest query result data back through Flight doGet * feat: Send delete predicates and max sequencer number in metadata * fix: greater_than_sequence_number should be of type SequenceNumber * fix: Remove DeletePredicates from IngesterQueryResponse Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-02-18 17:42:49 +00:00
Marco Neumann	f54ef92b77	fix: supervise and shutdown ingester background tasks (#3769 ) * fix: supervise and shutdown ingester background tasks Closes #3761. Closes #3762. * docs: improve wording Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com> * test: join/shutdown handling for ingester Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>	2022-02-18 09:35:29 +00:00
Paul Dix	23b3942306	fix: compact persisting panics on single row (#3784 )	2022-02-17 18:33:04 +00:00
Carol (Nichols \|\| Goulding)	90da060156	feat: Add namespace and sequencer id fields to IngesterQueryRequest protobuf (#3766 ) Fixes #3753. Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-02-16 19:21:15 +00:00
Nga Tran	ea814e9aa4	feat: API and steps to prepare data to send back to the Querier per its request (#3756 ) Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-02-16 02:45:58 +00:00
Paul Dix	f542045485	feat: wire up persistence in ingester (#3685 ) This adds persistence into the ingester with a lifecycle manager. The persist operation must still be updated to keep track of the min_unpersisted_sequence_number for each sequencer.	2022-02-16 00:13:40 +00:00
Nga Tran	0b3f76462d	feat: build Query Plan that queries QueryableBatch with filters (#3742 ) * feat: initial implementaion the Query Plan that query QueryableBatch with filters * fix: read_filter of QueryableBatch should provide the shema of the columns/projection it needs * chore: Apply suggestions from code review Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * chore: address review comment Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>	2022-02-15 16:06:26 +00:00
Marco Neumann	44ee0166a0	fix: start Kafka write buffer stream at "earliest" offset, not at "0" (#3748 )	2022-02-15 13:36:59 +00:00
Andrew Lamb	a30803e692	chore: Update datafusion, update `arrow`/`parquet`/`arrow-flight` to 9.0 (#3733 ) * chore: Update datafusion * chore: Update arrow * fix: missing updates * chore: Update cargo.lock * fix: update for smaller parquet size * fix: update test for smaller parquet files * test: ensure parquet_file tests write multiple row groups * fix: update callsite * fix: Update for tests * fix: harkari * fix: use IoxObjectStore::existing Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-02-15 12:10:24 +00:00
dependabot[bot]	89105ccfab	chore(deps): Bump tokio-util from 0.6.9 to 0.7.0 (#3743 ) Bumps [tokio-util](https://github.com/tokio-rs/tokio) from 0.6.9 to 0.7.0. - [Release notes](https://github.com/tokio-rs/tokio/releases) - [Commits](https://github.com/tokio-rs/tokio/commits) --- updated-dependencies: - dependency-name: tokio-util dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2022-02-15 11:33:41 +00:00
Marco Neumann	c6e374a025	feat: allow catalog access w/o a transaction (#3735 ) * feat: allow catalog access w/o a transaction Now the caller has the full control if they want to use a transaction or not. * fix: remove non-transaction-safe `create_many` * fix: remove unnecessary transactions	2022-02-15 10:15:36 +00:00
Carol (Nichols \|\| Goulding)	85aa019f50	feat: Turn protobuf predicates into predicate::Predicate (#3707 ) * feat: Turn protobuf predicates into predicate::Predicate * fix: Take buf lint's suggestions Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-02-14 17:56:56 +00:00
Nga Tran	d1c71ba5d8	feat: predicate pushdown for Ingester's QueryableBatch (#3728 ) * feat: predicate pushdown for Ingester's QueryableBatch * chore: comment cleanup * chore: Apply suggestions from code review Co-authored-by: Edd Robinson <me@edd.io> * refactor: address review comments Co-authored-by: Edd Robinson <me@edd.io> Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-02-14 17:28:52 +00:00
Nga Tran	d3bd03e37a	feat: Support Projection Pushdown for a QueryableBatch (#3712 ) * feat: projection pushdown for QueryableBatch * chore: clean up and remove unwrap * fix: Add Sync to a Snafu source to have the code compile * chore: cleanup and add comments for tests * refactor: Add tests for scanning non existing columns and fix related bugs * chore: modify comment to trigger auto check in github work	2022-02-10 19:29:21 +00:00
Carol (Nichols \|\| Goulding)	73828323ac	feat: Ingester Flight gRPC API (#3623 ) * feat: Add a way to run ingester with an in-memory catalog from the CLI If you set the --catalog-dsn string to "mem", rather than using that as a Postgres connection URL, create an in-memory catalog. Planning on using this in tests, so not documenting. * fix: Set default topic to the same value as SHARED_KAFKA_TOPIC Namely, both should use an underscore. I don't think there's a way to directly share these values between a constant and an annotation. * feat: Add a flight API (handshake only) to ingester * fix: Create partitions if using file-based write buffer * fix: Change the server fixture to handle ingester server type For now, the ingester doesn't implement the deployment API. Not sure if it should or not. * feat: Start implementing ingester do_get, namely decoding the query Skip serialization of the predicate for the moment. * refactor: Rename ingest protos to ingester to match crate name * refactor: Rename QueryResults to QueryData * feat: Move ingester flight client to new querier crate * fix: Off by one error, different starting indexes in sequencers * fix: Create new CLI argument to pick the catalog type * fix: Create a CLI option to set the number of topics to auto-create in the write buffer * fix: Check the arrow flight service's health to tell that the ingester gRPC is up * fix: Set postgres as the default catalog type * fix: Return an error rather than panicking if CLI args aren't right	2022-02-09 19:07:44 +00:00
Paul Dix	59b2141c0b	feat: Add lifecycle manager to ingester (#3645 ) This adds the lifecycle manager to the ingester. It will trigger based on a threshold for max partition size or age or based on keeping total memory under a certain threshold. It defines a new interface for a persister, which is stubbed out for IngesterData. I'm not sure yet how persistence errors should be handled. The assumption here is that the persister continues to retry persistence forever until it succeeds. There is one scenario I can think of that may cause this lifecycle manager problems. If a single partition is very high throughput, it could cause things to back up as persistence is not parallelized within a single partition. Any given partition can currently only run one persistence operation at a time. We can address this later. Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-02-08 15:23:40 +00:00
Marco Neumann	5de4d6203f	refactor: catalog transaction (#3660 ) * refactor: catalog Unit of Work (= transaction) Setup an inteface to handle Units of Work within our catalog. Previously both the Postgres and the in-mem backend used "mini-transactions on demand". Now the caller has a clear way to establish boundaries and gets read and write isolation. A single `Arc<dyn Catalog>` can create as many `Box<dyn UnitOfWork>` as you like, but note that depending on the backend you may not scale infinitely (postgres will likely impose certain limits and the in-mem backend limits concurrency to 1 to keep things simple). * docs: improve wording Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * refactor: rename Unit of Work to Transaction * test: improve `test_txn_isolation` * feat: clearify transaction drop semantics Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-02-08 13:38:33 +00:00
Marco Neumann	977ccc1989	fix: use a single metric registry for ingester (#3652 ) With this change write buffer ingestion metrics are showing up under `/metrics` Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-02-07 15:56:54 +00:00
Carol (Nichols \|\| Goulding)	2e30483f1f	refactor: Remove predicate module from predicate crate (#3648 ) Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-02-07 14:54:07 +00:00
Marco Neumann	e2db1df11f	refactor: improve writer buffer consumer interface (#3631 ) * refactor: improve writer buffer consumer interface The change looks huge but is actually rather simple. To understand the interface change, let me first explain what we want: - be able to fetch watermarks for any sequencer - have streams: - each streams tracks a sequencer and has an offset state (no read multiplexing) - we can seek a stream - seeking and streaming cannot be done at the same time (that would be weird and likely leads to many bugs both in write buffer and in the user code) - ideally we don't need to create streams of all sequencers but can choose a subset Before this change we had one mutable consumer struct where you can get all streams and watermark functions (this mutable-borrows the consumer) or you can seek a single stream (this also mutable-borrows the consumer). This is a bit weird for multiple reasons: - you cannot seek a single stream without dropping all of them - the mutable-borrow construct makes it really difficult to pass the streams into separate threads - the consumer is boxed (because its mutable) which makes it more difficult to handle in a large-scale application What this change does is the following: - you have an immutable consumer (similar to the producer) - the consumer offers the following methods: - get the set of sequencer IDs - get watermark for any sequencer - get a stream handler (see next point) for any sequencer - the stream handler captures the stream state (offset) and provides you a standard `Stream<_>` interface as well as a seek function. Mutable-borrows ensure that you cannot use both at the same time. The stream handler provides you the stream via `handler.stream()`. It doesn't implement `Stream<_>` itself because the way boxing, dynamic dispatch work, and pinning interact (i.e. I couldn't get it to work without the indirection). As a bonus point (which we don't use however) you can now create multiple streams for the same sequencer and they all have their own offset. * fix: review comments Co-authored-by: Carol (Nichols \|\| Goulding) <193874+carols10cents@users.noreply.github.com> Co-authored-by: Carol (Nichols \|\| Goulding) <193874+carols10cents@users.noreply.github.com> Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-02-07 12:24:17 +00:00
Paul Dix	ce46bbaada	feat: wire up the write buffer to the ingester process (#3533 ) This adds the scaffolding for the ingester server to consume data from Kafka. This ingests data in an in memory structure while creating records in the catalog for any partitions that don't yet exist. I've removed catalog_update.rs in ingester for now. That was mostly a placeholder and will be going in a combination of handler.rs and data.rs on my next PR which will have some primitive lifecycle wired up. There's one ugly bit here where the DML write is cloned because it's getting borrowed to output spans and metrics. I'll need to follow up with a refactor to make it so that the DML write's tables can be consumed without it gumming up the metrics stuff. Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-02-03 11:47:28 +00:00
Marco Neumann	22778a3a80	chore: upgrade rskafka and parking_lot (#3592 )	2022-02-01 11:50:42 +00:00
kodiakhq[bot]	8bef2c105c	Merge branch 'main' into cn/persist	2022-01-31 18:50:45 +00:00
Andrew Lamb	7b96a37165	chore: Update datafusion (#3586 ) * chore: update DataFusion to f849968057ddddccc9aa19915ef3ea56bf14d80d * fix: reduce overhead of creating physical expressions * chore: use MemTrackingMetrics Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-01-31 18:15:28 +00:00
Carol (Nichols \|\| Goulding)	4006dc14b3	fix: Correct typo in function name	2022-01-31 10:48:30 -05:00
Carol (Nichols \|\| Goulding)	749989a937	refactor: Simplify type, eliminating empty vec creation If there aren't any record batches, there isn't any metadata, and vice versa. Make this relationship clearer by putting the Option around both the vec of recordbatches and the metadata.	2022-01-31 10:48:30 -05:00
Carol (Nichols \|\| Goulding)	093d5acfd4	fix: Unify temporary multiple definitions of IoxMetadata	2022-01-31 10:48:29 -05:00
Carol (Nichols \|\| Goulding)	8f81ce5501	refactor: Share parquet_file::storage code between new and old metadata	2022-01-31 10:36:33 -05:00
Carol (Nichols \|\| Goulding)	bf89162fa5	refactor: Move IoxMetadata to parquet_file	2022-01-31 10:36:33 -05:00
Carol (Nichols \|\| Goulding)	dd9620da0c	feat: Create a new proto definition for the new design's IoxMetadata	2022-01-31 10:36:32 -05:00
Carol (Nichols \|\| Goulding)	81647f253c	feat: Use IoxMetadata and a list of RecordBatches	2022-01-31 10:36:32 -05:00
Carol (Nichols \|\| Goulding)	fef968f75c	fix: Remove catalog insertion; will be handled elsewhere	2022-01-31 10:36:32 -05:00
Carol (Nichols \|\| Goulding)	8b47ad6885	test: Add more tests	2022-01-31 10:36:32 -05:00
Carol (Nichols \|\| Goulding)	d413157b99	feat: Extract a fn for creating the parquet file paths and test it	2022-01-31 10:36:32 -05:00
Carol (Nichols \|\| Goulding)	5e0e0d8aa7	feat: Write parquet to object storage in a similar way as parquet_file::Storage	2022-01-31 10:36:32 -05:00
Carol (Nichols \|\| Goulding)	ea18c71e6d	feat: Create an object store path for a new parquet file	2022-01-31 10:36:32 -05:00
Carol (Nichols \|\| Goulding)	c633c9bc5c	feat: Wire object store into ingester persistence	2022-01-31 10:36:30 -05:00
Nga Tran	ac247e4de5	feat: update catalog after persistence (#3581 ) * feat: update catalog after persistence * test: add a negative test for the update catalog * chore: add IDs into the messages Co-authored-by: Carol (Nichols \|\| Goulding) <193874+carols10cents@users.noreply.github.com> * chore: Apply suggestions from code review Co-authored-by: Carol (Nichols \|\| Goulding) <193874+carols10cents@users.noreply.github.com> * refactor: address review comments Co-authored-by: Carol (Nichols \|\| Goulding) <193874+carols10cents@users.noreply.github.com> Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-01-31 15:23:16 +00:00
Nga Tran	8735ede74f	feat: IoxMetadata for parquet file (#3547 ) * feat: IoxMetadata for parquet file * fix: typos * refactor: address review comments Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-01-28 14:41:59 +00:00
Nga Tran	fb33a88dc8	test: Delete application during Ingester's compaction (#3542 ) * test: Delete application during Ingester's compaction * fix: typos Co-authored-by: Andrew Lamb <alamb@influxdata.com> * chore: remove comments Co-authored-by: Andrew Lamb <alamb@influxdata.com> Co-authored-by: Andrew Lamb <alamb@influxdata.com> Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>	2022-01-27 16:53:17 +00:00
Andrew Lamb	5488c257d1	chore: Update datafusion, upgrade to arrow/parqet/arrow-flight 8.0.0 (#3517 ) * chore: Update datafusion * chore: update to arrow 8 * fix: update to use new DataFusion APIs * fix: update case for sortedness * fix: cargo hakari	2022-01-27 13:33:27 +00:00
Carol (Nichols \|\| Goulding)	bc44d33108	feat: Implement a snapshot method on DataBuffer (#3518 ) * feat: Implement a snapshot method on DataBuffer Fixes #3510. * test: Add a test snapshotting batches with different but compatible schemas * fix: Simplify min/max sequencer number collection The first batch should always have the min sequencer number. The last batch should always have the max sequencer number. The min should always be less than (or equal to, in case there's only one batch) the max.	2022-01-26 15:22:51 +00:00
Nga Tran	52866fe6a9	fix: merge record batches into one batch (#3535 ) * fix: merge record batches into one batch refactor: address review comments * chore: update test output	2022-01-25 23:29:16 +00:00
Nga Tran	d559561fd7	refactor: have the deduplicate work without chunk statistics (#3519 ) * refactor: have the deduplicate work without chunk statistics * test: more tests for duplicates data on different combinations of record batches * refactor: address review comments	2022-01-25 17:00:25 +00:00
NGA-TRAN	c6a195b0e6	refactor: address review comments	2022-01-24 13:05:44 -05:00
NGA-TRAN	797ba459b9	chore: merge main to branch	2022-01-24 12:06:23 -05:00
NGA-TRAN	939ea536d4	feat: add but ignore a few compaction tests	2022-01-24 12:00:23 -05:00
NGA-TRAN	ee0a468b4d	feat: a few tests for compaction	2022-01-21 18:15:23 -05:00
Paul Dix	bb893510a0	feat: Add scaffolding for ingester server * Adds a new ingester command to start an ingester server * Moves previous ingester server over to handler * Skeleton for gRPC and HTTP handlers	2022-01-21 18:02:19 -05:00
NGA-TRAN	fa41067e3d	refactor: for paul	2022-01-21 16:50:49 -05:00
NGA-TRAN	cd01b141f3	refactor: for paul	2022-01-21 16:49:02 -05:00
Paul Dix	bfa54033bd	refactor: Clean up the Catalog API This updates the catalog API to make it easier to work with for consumers. I also found a bug in the MemCatalog implementation while refactoring the tests to work with the new API definition. Consumers will now be able to Arc wrap the catalog and use it across awaits.	2022-01-21 16:01:13 -05:00
NGA-TRAN	191adc9fc7	feat: initial implementation for ingester's compaction	2022-01-20 18:22:41 -05:00
NGA-TRAN	029f4bb41e	fix: comment	2022-01-19 18:11:00 -05:00
NGA-TRAN	dcf952bb27	chore: merge main to branch	2022-01-19 17:59:05 -05:00
NGA-TRAN	4ede10b3a0	refactor: add new fields and comments in ingest data buffer	2022-01-19 17:53:58 -05:00
Paul Dix	860e5a30ca	refactor: update ingester to get sequencer record and not attempt to create	2022-01-19 17:15:10 -05:00
NGA-TRAN	be3e523312	fix: use PersistingBatch	2022-01-19 13:25:03 -05:00
NGA-TRAN	9977f174b7	refactor: use wrapper ID	2022-01-19 12:51:04 -05:00
NGA-TRAN	edb97f51cf	refactor: add persisting struct	2022-01-19 12:36:18 -05:00
NGA-TRAN	8a17e1c132	refactor: address review comments	2022-01-19 11:20:20 -05:00
NGA-TRAN	fe9a41ee9a	chore: remove non-longer needed dependency	2022-01-18 21:45:20 -05:00
NGA-TRAN	b89c250ccc	refactor: use RepoColection instead of MemCatalog	2022-01-18 21:39:22 -05:00
NGA-TRAN	b57f027e35	refactor: address review comments	2022-01-18 20:57:13 -05:00
NGA-TRAN	367a9fb812	fix: add workspace-hack	2022-01-18 18:10:42 -05:00
NGA-TRAN	1c970a2064	fix: format	2022-01-18 18:01:47 -05:00
NGA-TRAN	667ec5bfc5	fix: the code is now compile without warnings	2022-01-18 18:01:06 -05:00
NGA-TRAN	b20d1757d0	feat: initialize ingester data	2022-01-18 17:43:03 -05:00
NGA-TRAN	125285ae9a	feat: commit in order to pull and merge new commit from main	2022-01-18 16:11:25 -05:00
NGA-TRAN	23290fd2ff	fix: new data structures suggested by reviewers	2022-01-18 14:04:07 -05:00
NGA-TRAN	ef336b4659	feat: add ingester crate and a few basic data structures for its data lifecycle	2022-01-17 15:38:03 -05:00

... 5 6 7 8 9 ...

571 Commits (b521c68eefdea481a1ba7b1390de67baa850e89b)