Commit Graph

571 Commits (b521c68eefdea481a1ba7b1390de67baa850e89b)

Author SHA1 Message Date
Marco Neumann 66c7d95312
refactor: use new ingester<>querier wire protocol (#4867)
* refactor: use new ingester<>querier wire protocol

Use and document the new and more flexible ingester<>querier wire
protocol.

Note that the ingester does NOT stream the response data yet, but the
internal data structures would allow that. A follow-up change will
adjust the ingester code to stream the data.

Ref #4849.

* fix: typos

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* refactor: clarify naming and public interface

* test: add schema assertion to `ingester_response_to_record_batches`

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
2022-06-16 08:02:28 +00:00
Andrew Lamb 6b771375bf
feat: log when partitions are written due to going over size (#4868) 2022-06-15 20:12:43 +00:00
Dom Dwyer 4df2964566 refactor: store PartitionKey in DmlWrite
Carry the PartitionKey in the DmlWrite, allowing the batch to be
associated with a specific partition key.
2022-06-15 15:48:54 +01:00
Marco Neumann 7c60edd38c
refactor: prepare new ingester<>querier protocol on the querier side (#4863)
* refactor: prepare new ingester<>querier protocol on the querier side

This changes the querier internals to work with the new protocol. The
wire protocol stays the same (for now). There's a (somewhat hackish)
adapter in place on the querier side that converts the old to the new
protocol on-the-fly. This is an intermediate step before we actually
change the wire protocol (and in a step after that also take advantage
of the new possibilites on the ingester side).

Ref #4849.

* docs: explain adapter
2022-06-15 14:32:24 +00:00
Andrew Lamb 005610b172
refactor: remove some `&` use in iox_catalog (#4862)
* refactor: remove some `&` use in iox_catalog

* fix: Update data_types/src/lib.rs
2022-06-15 11:31:49 +00:00
Nga Tran b682dbbc2e
chore: Add debug info of sort_key for ingester (#4859)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-06-14 20:39:17 +00:00
Andrew Lamb c8f70b8933
feat: log query from querier to ingester at `info` level (#4856)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-06-14 18:35:50 +00:00
Andrew Lamb eca3b6b9a1
fix: reduce memory usage in ingester with less buffering prior to query engine (#4830)
* refactor: remove another buffer copy in ingester

* docs: Update arrow_util/src/util.rs

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-06-14 18:22:55 +00:00
Andrew Lamb 7d2a5c299f
refactor: remove one buffer copy in the ingester (#4855)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-06-14 17:15:36 +00:00
Andrew Lamb e91d00b10c
chore: Update datafusion + `arrow`/`parquet`/`arrow-flight` to `16.0.0 (#4851)
* chore: TEMP Update DataFusion to pre-release

* chore: update arrow et al to 16.0.0

* chore: Run cargo hakari tasks

* fix: update reader read_dictionary API

* chore: Update to real Datafusion release

* fix: Update parquet API

* fix: update test

Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
2022-06-14 16:31:40 +00:00
Andrew Lamb 34e8659876
refactor: consolidate plan creation from `QueryChunk`s in `iox_query` (#4837)
* refactor: consolidate plan creation from Chunks

* docs: update docstrings

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-06-14 14:36:07 +00:00
Dom Dwyer b41ea1d718 refactor: PartitionKey type
This commit changes the code base to use a new reference-counted
PartitionKey type wrapper, instead of passing a bare String around.

This allows the compiler to type check & verify usage of the partition
key, instead of passing a bare string around. By reference counting the
underlying string, we reduce memory usage for some use cases.
2022-06-14 14:47:56 +01:00
Andrew Lamb 9fdbfb05e7
refactor: Use scan_and_filter in ReorgPlanner (#4822)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-06-10 17:31:25 +00:00
kodiakhq[bot] dd8d44e24f
Merge branch 'main' into cn/duration 2022-06-10 14:23:09 +00:00
Nga Tran 13c57d524a
feat: Change data type of catalog partition's sort_key from a string to an array of string (#4801)
* feat: Change data type of catalog Postgres partition's sort_key from a string to an array of string

* test: add column with comma

* fix: use new protonuf field to avoid incompactible

* fix: ensure sort_key is an empty array rather than NULL

* refactor: address review comments

* refactor: address more comments

* chore: clearer comments

* chore: Update iox_catalog/migrations/20220607102200_change_sort_key_type_to_array.sql

* chore: Update iox_catalog/migrations/20220607102200_change_sort_key_type_to_array.sql

* fix: Rename migration so it will be applied after

Co-authored-by: Marko Mikulicic <mkm@influxdata.com>
2022-06-10 13:31:31 +00:00
Andrew Lamb dc992209be
test: account for active writes when reporting readable status (#4782)
* test: account for active writes when reporting readable status

* fix: logical merge conflict
2022-06-10 12:59:09 +00:00
Andrew Lamb 11cec18edc
refactor: Move `scan_and_filter` into a `common` module for reuse (#4823)
* refactor: remove unused error variants

* refactor: move scan_and_filter into a module so it can be reused

* docs: update comments about pruning
2022-06-10 11:15:47 +00:00
Andrew Lamb 50697906b1
refactor: Make `DMLWrite::sequence_number` a `SequenceNumber` (#4817) 2022-06-09 19:36:37 +00:00
Carol (Nichols || Goulding) 1c7cbaf5ae
refactor: Use DurationHistogram in more places 2022-06-09 14:20:51 -04:00
Andrew Lamb 2ec7764fdd
refactor: rename builder like predicate methods to be `with_` (#4808)
* refactor: rename builder like predicate methods to be `with_`

* fix: merge conflict

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-06-09 11:26:03 +00:00
Andrew Lamb d8331e8679
fix: do not return 'readable' until a write is completely readable (#4778)
* fix: do not return readable until a write is completely readable

* docs: Add diagram with partially buffered write

* refactor: account for actively buffering during update rather than fixup

* fix: fixup

* fix: use checked_sub

Co-authored-by: Marco Neumann <marco@crepererum.net>

* fix: checked_sub calculation

Co-authored-by: Marco Neumann <marco@crepererum.net>
2022-06-09 11:15:15 +00:00
Andrew Lamb f34282be2c
fix: Do not run DataFusion optimizer pass twice (#4809)
* fix: Do not run DataFusion optimizer pass twice

* docs: improve docstring and logging
2022-06-08 21:01:22 +00:00
Andrew Lamb afc1c12062
refactor: consolidate `PredicateBuilder` into `Predicate` (#4799)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-06-08 12:21:24 +00:00
Dom Dwyer 1fc5596023 perf: streaming compaction in ingester
Reduces memory usage in the ingester during persist operations by
streaming the results of the snapshot merge/sort/dedupe directly to
the parquet file.

Prior to this commit the output of the compact was buffered in memory
before being wrote to the parquet file.
2022-06-07 12:01:26 +01:00
dependabot[bot] 04c685b3b7
chore(deps): Bump tokio-util from 0.7.2 to 0.7.3 (#4784)
Bumps [tokio-util](https://github.com/tokio-rs/tokio) from 0.7.2 to 0.7.3.
- [Release notes](https://github.com/tokio-rs/tokio/releases)
- [Commits](https://github.com/tokio-rs/tokio/compare/tokio-util-0.7.2...tokio-util-0.7.3)

---
updated-dependencies:
- dependency-name: tokio-util
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-06-06 14:46:27 +00:00
dependabot[bot] a1ea793e13
chore(deps): Bump tokio-stream from 0.1.8 to 0.1.9 (#4785)
Bumps [tokio-stream](https://github.com/tokio-rs/tokio) from 0.1.8 to 0.1.9.
- [Release notes](https://github.com/tokio-rs/tokio/releases)
- [Commits](https://github.com/tokio-rs/tokio/compare/tokio-stream-0.1.8...tokio-stream-0.1.9)

---
updated-dependencies:
- dependency-name: tokio-stream
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-06-06 14:21:54 +00:00
dependabot[bot] e03bf94420
chore(deps): Bump tokio from 1.18.2 to 1.19.1 (#4783)
Bumps [tokio](https://github.com/tokio-rs/tokio) from 1.18.2 to 1.19.1.
- [Release notes](https://github.com/tokio-rs/tokio/releases)
- [Commits](https://github.com/tokio-rs/tokio/compare/tokio-1.18.2...tokio-1.19.1)

---
updated-dependencies:
- dependency-name: tokio
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-06-06 14:15:12 +00:00
Andrew Lamb 3592aa52d8
chore: Update datafusion + `arrow`/`parquet`/`arrow-flight` to `15.0.0` (#4743)
* chore: Update datafusion + `arrow`/`parquet`/`arrow-flight` to `15.0.0`

* chore: Update APIs

* chore: Run cargo hakari tasks

* feat: normalize parquet file metadata

* chore: update size tests

* chore: add docs on metadata stripping

* chore: TEMP UPDATE TO DF BRANCH

* chore: Update for new API

* fix: Update to latest DF

* fix: cargo hakari

Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
Co-authored-by: Raphael Taylor-Davies <r.taylordavies@googlemail.com>
2022-06-03 10:32:26 +00:00
dependabot[bot] 9a21292db8
chore(deps): Bump async-trait from 0.1.53 to 0.1.56 (#4774)
Bumps [async-trait](https://github.com/dtolnay/async-trait) from 0.1.53 to 0.1.56.
- [Release notes](https://github.com/dtolnay/async-trait/releases)
- [Commits](https://github.com/dtolnay/async-trait/compare/0.1.53...0.1.56)

---
updated-dependencies:
- dependency-name: async-trait
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-06-03 09:10:40 +00:00
Ryan Russell d279deddad
docs(various): Improve Readability (#4768)
Signed-off-by: Ryan Russell <git@ryanrussell.org>

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-06-02 18:01:06 +00:00
Marco Neumann c91dbe062e
test: "optimize" ingesterrecord batches in query tests (#4700)
* test: "optimize" ingesterrecord batches in query tests

It seems that I had the right idea in #4656 but wasn't able to trigger
https://github.com/influxdata/conductor/issues/955 because the query
tests do not "optimize" the record batches in the same way the actual
gRPC implementation does. If we apply the same transformation we indeed
end up with the same error.

* fix: all batches within the ingester flight response must have same schema

* refactor: simplify and reuse code

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-06-01 07:37:11 +00:00
Paul Dix 6af32b7750
feat: add concurrency limit for ingester queries (#4703)
I've defaulted it to 20, we can adjust as needed.

Closes #4657

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-05-30 10:22:17 +00:00
Andrew Lamb dde3c3922c
refactor: use consistent spelling of serialize (#4717) 2022-05-27 14:42:59 +00:00
Nga Tran ea81152fac
refactor: add partition ID into debug info and panic earlier to identify the bug easier (#4716)
* chore: point tests to the new ticket

* chore: cleanup

* refactor: add partition ID into debug info and panic earlier to identify the bug easier
2022-05-27 12:20:36 +00:00
Marco Neumann 31d1b37d73
refactor: de-duplicate low-level arrow code (#4697)
It seems that during prototyping NG we've copied low level code (w/o
tests!) and never cleaned up. Let's not have this functionality twice.
2022-05-25 16:24:28 +00:00
Carol (Nichols || Goulding) 6ce6a38094
fix: Make metric names potentially less confusing 2022-05-25 10:04:39 -04:00
Dom 9cd1286051
Merge branch 'main' into dom/meta-remove-row-count 2022-05-23 16:39:38 +01:00
Marco Neumann 2029bd16ba
feat: enable debugging of failed querier->ingester requests (#4659)
* feat: enable debugging of failed querier->ingester requests

- extend `query-ingester` CLI to allow usage of predicates
- on failed requests: log all information that required for the CLI
- test the "ingester fails" scenario

* test: explain

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* docs: improve

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* refactor: move b64 pred. serde into a single crate

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
2022-05-23 15:37:31 +00:00
Dom Dwyer 2e6c49be83 refactor: remove IoxMetadata min & max timestamp
Removes the min/max timestamp fields from the IoxMetadata proto
structure embedded within a Parquet file's metadata.

These values are redundant as they already exist within the Parquet
column statistics, and precluded streaming serialisation as these
removed min/max values were needed before serialising the file.
2022-05-23 16:27:08 +01:00
Dom Dwyer a142a9eb57 refactor: remove row_count from IoxMetadata
Remove the redundant row_count from the IoxMetadata structure that is
serialised into the Parquet file.

The reasoning is twofold:

    * The Parquet file's native metadata already contains a row count
    * Needing to know the number of rows up-front precludes streaming
2022-05-23 16:18:35 +01:00
Dom f0d0f1ba0c
Merge branch 'main' into dom/codec-object-store 2022-05-23 15:39:54 +01:00
kodiakhq[bot] a06746c715
Merge branch 'main' into cn/last-available 2022-05-23 13:08:19 +00:00
Carol (Nichols || Goulding) 05bd9de4d3
test: Add a test for the sequence number skipping metric
Ok, so... this needed lots of... channels. Channels everywhere.

The stream method on TestWriteBufferStreamHandler previously assumed it
would only be called once. In a test where reset_to_earliest is called,
stream might be called again to get the reset stream.

We want to be able to control which of the streams gets which
operations, so that's why the macro now takes a vec of vec of
operations-- one vec of operations per expected call to stream, and the
stream will send all the operations in its vec.

The test thread needs to wait for the handler stream to consume the last
item from the last receiver stream, so when the
TestWriteBufferStreamHandler has set up the last expected call to
stream, pass back the last transmitter and have it wait until it's at
full expected capacity (which means all operations have been consumed by
the receiver).
2022-05-20 20:50:02 -04:00
Carol (Nichols || Goulding) bda231051a
feat: Record metrics when resetting the write buffer and skipping sequence numbers 2022-05-20 20:48:17 -04:00
Carol (Nichols || Goulding) bcbf7b4f46
refactor: Move error handling logic to be all together 2022-05-20 20:48:17 -04:00
Carol (Nichols || Goulding) 549dd497ea
refactor: Extract an ingester verification function 2022-05-20 20:48:16 -04:00
Carol (Nichols || Goulding) 2aa76622c3
refactor: Extract a test setup function 2022-05-20 11:51:57 -04:00
Carol (Nichols || Goulding) ab72c93a5e
docs: Updating wrapping, content, and grammar of comments 2022-05-20 10:51:07 -04:00
Carol (Nichols || Goulding) c811bebdb7
feat: Add ingester CLI option to skip to oldest available WB seq num
The default behavior of the ingester is to panic if the min unpersisted
sequence number in the catalog is unknown to the write buffer due to the
retention policies having evicted that sequence number.

Specifying `--skip-to-oldest-available` changes this behavior to skip to
the oldest sequence number the write buffer does have available and go
from there.

Fixes #4624.
2022-05-20 10:51:07 -04:00
Carol (Nichols || Goulding) b3f97bdb9d
test: Capture existing behavior for unknown sequence number 2022-05-20 10:51:06 -04:00
Dom Dwyer b9a745d42d feat: RecordBatch stream to Parquet file upload
Implements an upload() method on the ParquetStorage type, consuming a
stream of RecordBatch, serialising the Parquet file, and uploading the
result to object storage. Returns the IOx-specific file metadata.

Currently while the upload() method accepts a stream of RecordBatch, the
actual resulting Parquet file is buffered in memory before uploading to
object store, due to lack of streaming upload functionality in the
ObjectStore abstraction - this isn't the end of the world, as the files
tend to be relatively small with our current usage.

This impl should be easily modified to be fully streaming once streaming
object store puts are implemented:

    https://github.com/influxdata/object_store_rs/issues/9
2022-05-20 15:17:40 +01:00
Carol (Nichols || Goulding) 4bad553dc6
fix: Use a method instead of holding a lock across an await point 2022-05-19 16:09:47 -04:00
Dom Dwyer baa86d846f refactor: use ParquetStore instead of ObjectStore
Changes the code paths that interact with Parquet files in the object
store to reference the ParquetStorage directly (DRY refactor).

This change takes us from a dependency graph of:

                    ┌─────────────────┐
                    │                 │
                    ▼                 │
            Parquet Consumer          │
                    │         ┌──────────────┐
                    ├────────▶│ParquetStorage│
                    ▼         └──────────────┘
            ┌──────────────┐
            │ ObjectStore  │
            └──────────────┘
                    │
               ┌────┴────┐
               ▼         ▼
             File       s3
            System    (etc)

to:

                Parquet Consumer
                        │
                        ▼
                ┌──────────────┐
                │ParquetStorage│
                └──────────────┘
                        │
                        ▼
                ┌──────────────┐
                │ ObjectStore  │
                └──────────────┘
                        │
                   ┌────┴────┐
                   ▼         ▼
                 File       s3
                System    (etc)

With the ParquetStorage being solely responsible for managing
interactions with the object store when dealing with Parquet files.
2022-05-19 13:52:51 +01:00
Dom Dwyer d3548653d5 refactor: rename Storage -> ParquetStorage
Renames the Storage type so the context is clear in usage (i.e. fn
args), rather than having to rely on knowing the fully-qualified import
path to know what the type stores.
2022-05-19 13:51:07 +01:00
Marco Neumann 52346642a0
ci: fix cargo deny (#4629)
* ci: fix cargo deny

* chore: downgrade `socket2`, version 0.4.5 was yanked

* chore: rename `query` to `iox_query`

`query` is already taken on crates.io and yanked and I am getting tired
of working around that.
2022-05-18 09:38:35 +00:00
Andrew Lamb 3a33e806c7
chore: Update datafusion + `arrow`/`parquet`/`arrow-flight` to `14.0.0` (#4619)
* chore: Update datafusion deps

* chore: update arrow/parquet/arrow flight deps

* chore: Run cargo hakari tasks

* chore: Update location of utils

* chore: Update some more APIs

Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
2022-05-17 14:13:03 +00:00
Carol (Nichols || Goulding) 9eb21095e7
feat: Add more logging in particular situations to debug flaky test 2022-05-16 16:46:29 -04:00
dependabot[bot] 259d2486c1
chore(deps): Bump tokio-util from 0.7.1 to 0.7.2 (#4605)
Bumps [tokio-util](https://github.com/tokio-rs/tokio) from 0.7.1 to 0.7.2.
- [Release notes](https://github.com/tokio-rs/tokio/releases)
- [Commits](https://github.com/tokio-rs/tokio/compare/tokio-util-0.7.1...tokio-util-0.7.2)

---
updated-dependencies:
- dependency-name: tokio-util
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-05-16 11:42:31 +00:00
Raphael Taylor-Davies f2bb0fdf77
feat: update to crates.io object_store version (#4595)
* feat: update to crates.io object_store version

* chore: Run cargo hakari tasks

* fix: tests

* chore: remove object store integration test plumbing

Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
2022-05-13 16:26:07 +00:00
kodiakhq[bot] 0f8f294319
Merge branch 'main' into cn/remove-chunk-addr 2022-05-13 13:54:44 +00:00
Carol (Nichols || Goulding) 07c7c75067
fix: Remove ng_chunk method
Connects to #4450.
2022-05-12 16:09:08 -04:00
Carol (Nichols || Goulding) faba90d992
fix: Remove ChunkAddr 2022-05-12 15:50:41 -04:00
Carol (Nichols || Goulding) 26170b7a07
refactor: Move gRPC conversion code to generated_types to share 2022-05-11 14:07:12 -04:00
Carol (Nichols || Goulding) 8545bb60c6
refactor: Move KafkaPartitionWriteStatus to data_types to share more 2022-05-11 14:07:06 -04:00
kodiakhq[bot] 5b866989ca
Merge branch 'main' into dom/query-metrics 2022-05-11 15:51:55 +00:00
Dom Dwyer 7f3473e19f refactor(ingester): emit per-op debugging info
Emit a TRACE level log containing the op offset & other helpful fields.

This will allow us to identify which messages were last successfully
decoded, and which caused errors so we can pull them from analysis.
2022-05-11 16:35:35 +01:00
Dom Dwyer 3890a8986b feat(ingester): emit query latency metrics
Adds a histogram metric "flight_query_duration_ms" that records the
duration a flight RPC query takes to complete. Broken down by query
result (success/error).
2022-05-11 16:14:47 +01:00
Raphael Taylor-Davies 8b379c83cc
refactor: simplify object_store path handling (#4534)
* refactor: simplify object_store path handling

* fix: aws integration tests

* chore: lint

* fix: update gcs tests

* refactor: move errors into submodules

* chore: lint

* chore: review feedback

* refactor: replace provider with Display

* fix: failing tests

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-05-09 18:43:22 +00:00
kodiakhq[bot] ebd078133c
Merge branch 'main' into cn/more-cleanup 2022-05-09 13:58:25 +00:00
YIXIAO SHI cbe9eb4b1e
chore: fix comment typo (#4533)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-05-09 11:24:12 +00:00
Carol (Nichols || Goulding) 4506bf3b8f
fix: Remove or rescope dead code in ingester 2022-05-06 16:58:03 -04:00
Jake Goulding e07bcd40c2 refactor: Remove unused dependencies
These were found by iterating over all of the dependencies of each
Cargo.toml, then grepping that crate for the dependency's name. If it
didn't show up, I attempted to remove it.

I left a few dependencies that this process flagged:

* generated_types
  - `pbjson`,`serde`. Apparently used by the generated code.

* grpc-router-test-gen
  - `prost`. Apparently used by the generated code.

* influxdb_iox
  - `heappy`. Doesn't appear used, but is behind enough feature
    flags that I don't care to reason about and it's already optional.
  - `tikv_jemalloc_sys`. Appears to be setting a feature flag of an
    indirect dependency.

* iox_gitops_adapter
  - `k8s_openapi`. Appears to be setting a feature flag of an indirect
    dependency.
2022-05-06 15:57:58 -04:00
Carol (Nichols || Goulding) fcd4815645
fix: Rename router2 to router 2022-05-06 14:51:52 -04:00
Carol (Nichols || Goulding) 068096e7e1
fix: Rename data_types2 to data_types 2022-05-06 14:45:39 -04:00
Carol (Nichols || Goulding) 0541c6e40f
fix: Remove data_types crate where it's no longer used 2022-05-06 14:45:39 -04:00
Carol (Nichols || Goulding) 2ef44f2024
fix: Move timestamp types to data_types2 2022-05-06 14:45:38 -04:00
Carol (Nichols || Goulding) eb31b347b0
refactor: Move tombstones_to_delete_predicates to the predicate crate 2022-05-06 14:45:37 -04:00
Carol (Nichols || Goulding) 485d6edb8f
refactor: Move IngesterQueryRequest to generated_types 2022-05-06 14:45:37 -04:00
Carol (Nichols || Goulding) ea46830954
fix: Remove iox_object_store crate; move ParquetFilePath to parquet_file 2022-05-06 14:45:36 -04:00
Andrew Lamb 02893e598c
chore: Update datafusion and upgrade arrow/parquet/arrow-flight to 13 (#4516)
* chore: Tool for automating arrow version update

* chore: Update datafusion and arrow/parquet/arrow-flight

* fix: update for changes in Arrow API

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-05-05 00:21:02 +00:00
dependabot[bot] 420c306caa
chore(deps): Bump tokio from 1.17.0 to 1.18.0 (#4453)
Bumps [tokio](https://github.com/tokio-rs/tokio) from 1.17.0 to 1.18.0.
- [Release notes](https://github.com/tokio-rs/tokio/releases)
- [Commits](https://github.com/tokio-rs/tokio/compare/tokio-1.17.0...tokio-1.18.0)

---
updated-dependencies:
- dependency-name: tokio
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-04-28 08:21:17 +00:00
Marco Neumann 59f6556483
fix: do not create empty batches in ingester (#4443)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-04-27 17:52:22 +00:00
Nga Tran fa2c1febf4
feat: use stored partition sort key to deduplicate data (#4360)
* feat: use stored sort key to deduplicate data

* refactor: verify if one is a super sort key of the other

* test: unit tests for scan and deduplication plans

* fix: typo

* refactor: refactor and add comments

* feat: cache partition sort key to read during planning as needed

* test: tests for query plans with different overlap groups

* chore: cleanup

* chore: resolve merge conflicts

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-04-26 20:36:32 +00:00
Marco Neumann 11f87cffdd
fix: memorize max persisted tombstone (#4430) 2022-04-26 16:13:09 +00:00
Marco Neumann bd600bbac6
refactor: allow ingester to be integrated into query tests (#4427)
* refactor: improve `IngesterData` public interface

* feat: impl `Debug` for `Test{Namespace,Sequencer}`

* refactor: trait interface for `LifecyleHandle`

This is required to mock the lifecycle for query tests.

* refactor: trait for partitioner
2022-04-26 13:44:30 +00:00
二手掉包工程师 4b47d723b1
refactor: Rename time to iox_time (#4416)
Signed-off-by: hi-rustin <rustin.liu@gmail.com>

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-04-26 00:19:59 +00:00
Marco Neumann 86e8f05ed1
fix: make all catalog IDs 64bit (#4418)
Closes #4365.

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-04-25 16:49:34 +00:00
Nga Tran d963110842
feat: group chunk overlaps based on time range only (#4389)
* feat: overlap for NG querier

* chore: cleanup

* refactor: address review comments

* fix: typo

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-04-25 13:32:07 +00:00
Marco Neumann f444e63960
test: include materialized delete predicates in NG query tests (#4371)
* refactor: move `batch_filter` to `datafusion_util`

* fix: outdated docstring

* feat: allow passing record batches to `iox_tests` parquet files

* test: include materialized delete predicates in NG query tests

* docs: improve wording

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-04-21 13:00:13 +00:00
Andrew Lamb 73bed810da
chore: Update arrow, arrow-flight, parquet, tonic, prost, etc (#4357)
* chore: Update datafusion

* chore: Update arrow/arrow-flight/parquet to 12

* chore: update datafusion correctly

* chore: Update prost, tonic, and dependents

* fix: Fixup some api changes

* fix: Update test output in db

* fix: Update test output in parquet_file

* fix: remove old pbjson types

* fix: Add "--experimental_allow_proto3_optional" flag

* chore: Run cargo hakari tasks

* fix: compile error

* chore: Update heappy

Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-04-20 11:12:17 +00:00
Andrew Lamb 5ea676d3f7
feat: add per kafka partition durability reporting to write info response (#4341)
* feat: add per kafka partition durability reporting to write info response

* fix: buf lint + test cleanup

* fix: clean up protobuf

* refactor: pull out conversion of KafkaPartitionStatus into a function

* fix: fmt

* fix: typo

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-04-19 16:46:20 +00:00
Marco Neumann 5b48675435
fix: actually transmit record-batch metadata from querier (#4347)
Attaching the "batch => partition" mapping via per-batch schema KV
metadata does NOT work because flight will transmit the schema once for
all batches (even though on the Rust side we have a schema ref attached
to every batch, probably for convenience). Instead we now use the same
global protobuf metadata that we also use for the "partition => max
sequence number" information. This somewhat limits our ability to create
record batches lazily on the ingester side (since the global metadata is
sent before any actual payload) but I think we should not modify the
usage of the flight protocol too much right now (e.g. by sending more
schema messages). If this becomes an issue, we can always find a more
complex solution in the future.
2022-04-19 10:54:23 +00:00
Nga Tran 2a601c3099
fix: Revert "chore: Revert "fx: Revert "fix: Revert "feat: Use the sort key stored in the catalog during compaction" (#4299)" (#4303)" (#4327)" (#4328)
* fix: Revert "chore: Revert "fx: Revert "fix: Revert "feat: Use the sort key stored in the catalog during compaction" (#4299)" (#4303)" (#4327)"

This reverts commit 7e5d719027.

* chore: resolve merge conflict

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-04-18 15:27:39 +00:00
Nga Tran 7e5d719027
chore: Revert "fix: Revert "fix: Revert "feat: Use the sort key stored in the catalog during compaction" (#4299)" (#4303)" (#4327)
This reverts commit fe8d9948d5.
2022-04-14 17:11:55 +00:00
Carol (Nichols || Goulding) fe8d9948d5
fix: Revert "fix: Revert "feat: Use the sort key stored in the catalog during compaction" (#4299)" (#4303)
This reverts commit 7ddbf7c025.

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-04-14 15:42:28 +00:00
Marco Neumann 351b0d0c15
fix: unknown namespace/table in querier<>ingester flight protocol (#4307)
* fix: return "not found" gRPC error instead of "internal" when ingester does not know table

* fix: properly handle "namespace not found" in ingester queries

* fix: make `initialize_db` work with async code

* test: add custom step for NG tests

* fix: handle "unknown table/namespace" resp. in querier

* docs: explain test setup

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
2022-04-14 12:36:15 +00:00
Marco Neumann 8bf2fbb7d3
fix: ingester min-unpersisted-sequence-number calc + doc (#4302)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-04-14 07:05:06 +00:00
Carol (Nichols || Goulding) 7ddbf7c025
fix: Revert "feat: Use the sort key stored in the catalog during compaction" (#4299)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-04-13 14:11:10 +00:00
kodiakhq[bot] 21f748062e
Merge branch 'main' into cn/sort-in-compactor 2022-04-13 12:43:31 +00:00
Marco Neumann 83f77712b1
refactor: querier<>ingester flight protocol adjustments (#4286)
* refactor: querier<>ingester flight protocol adjustments

This makes a few adjustments to the querier<>ingester flight protocol.

Query Scope
===========
The querier will request data for ALL sequencer IDs for now. There is
no reason to have a request per sequencer ID. We can add a range/set
filter later if we want, but this is not required for now.

Partition-level
===============
The only time when the querier cares about sequencer IDs (i.e. sharding)
at all is when it selects which ingesters to ask for unpersisted data
(this is currently not implemented, it just asks all ingesters).
Afterwards the querier only cares about partitions (which are bound to
specific sequencers anyways) because this is the level where parquet
file persistence and compaction as well as deduplication happen. So we
make partitions a first-class citizen in the ingester response.

Metadata VS RecordBatches
=========================
The global app-metadata will list all partitions and their max
persisted parquet files and tombstones (theoretically tombstones are at
table-level, but the ingester could in the future break them down to the
partition-level). Then it receives a stream of record batches. Each
record batch is tagged (via key-value metadata in its schema) so it can
be assigned to a partition. At the moment the ingester returns 0 or 1
batches per unpersisted partition (0 in case we've filtered out all the
data via the predicate), but in the future it is free to return multiple
batches. This setup gives the ingester more freedom over memory
management and (potentially parallel) query processing, while at the
same time keeps the set of duplicated information minimal and allows
easy extensions (since the global metadata is a full-blown protobuf
message).

Querier
=======
At the moment the querier ignores all the metdata. Follow-up PRs will
change that.

* docs: improve

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* refactor: make code clearer

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
2022-04-12 16:48:40 +00:00
Carol (Nichols || Goulding) 87e4a1a51d
refactor: Move ingester sort key to schema sort key
This logic isn't actually ingester specific
2022-04-11 14:09:45 -04:00
Carol (Nichols || Goulding) a053077a05
refactor: Make compute_sort_key more general than the ingester
Enable computing sort keys for a schema and an iterator of record
batches.
2022-04-11 14:09:45 -04:00
Dom Dwyer 5c3cbb14b4 test: join ingester background tasks 2022-04-08 14:24:56 +01:00
Dom Dwyer dce939c580 refactor: use SequencedStreamHandler
Removes the old stream_in_sequenced_entries() write buffer handler,
replacing it with the SequencedStreamHandler introduced in #4203.

This change will affect the metrics emitted by an ingester as outlined
in #4243.
2022-04-08 11:28:39 +01:00
Dom Dwyer 71a278ac7e refactor: accept !Sync write buffer streams
Removes the Sync bound SequencedStreamHandler input stream type, as the
BoxStream returned by the WriteBufferStreamHandler is not Sync.

This change means the SequencedStreamHandler is not Sync either, but is
still Send and therefore can be moved into tokio tasks.
2022-04-08 11:28:39 +01:00
Dom Dwyer c2236fa3fb feat: impl DmlSink for IngesterData
This commit adds an adaptor (IngestSinkAdaptor) that provides a DmlSink
implementation for the existing write path (IngesterData). With this,
the existing write path becomes compatible with the new
op stream handler (SequencedStreamHandler).
2022-04-08 11:28:39 +01:00
Andrew Lamb a30a85e62c
feat: Add get_write_info service (#4227)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-04-07 19:24:58 +00:00
kodiakhq[bot] 8bd0bfb669
Merge branch 'main' into dom/ingester-op-instrumentation 2022-04-07 16:33:25 +00:00
kodiakhq[bot] f5996c5ab4
Merge branch 'main' into cn/sort-key-across-persists 2022-04-07 14:40:55 +00:00
Dom 998a66fd98
docs: Update ingester/src/stream_handler/sink_instrumentation.rs
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
2022-04-07 12:18:14 +01:00
Carol (Nichols || Goulding) 30c3ef5aa6
fix: Only save relevant columns in parquet file's sort key 2022-04-06 14:09:08 -04:00
Dom Dwyer 24eeddce8a chore: fix lint warnings 2022-04-06 16:45:31 +01:00
Dom Dwyer 091640bb23 feat: emit tracing span for op apply
This commit uses the tracing metadata within the DmlOperation to emit a
tracing span from the ingester covering the DmlSink::apply() operation.
2022-04-06 16:32:00 +01:00
Dom Dwyer f6c65f52a3 refactor: impl WatermarkFetcher
Implement WatermarkFetcher for PeriodicWatermarkFetcher and remove
unnecessary async.
2022-04-06 16:32:00 +01:00
Dom Dwyer 436da19d9a feat: DmlSink instrumentation
This commit adds the SinkInstrumentation type that decorates an inner
DmlSink with call latency and write buffer metrics.

The write buffer / sink call metrics may be split apart into two
separate responsibilities in the future if there are multiple DmlSink
that need instrumentation, but deferring adding more types until it is
needed.
2022-04-06 16:32:00 +01:00
Andrew Lamb c244b03281
feat: Add `SequencerProgress` reporting to ingester (#4238)
* feat: Add `SequencerProgress` reporting to ingester

* refactor: Use KafkaPartition in write_summary

* fix: Update docstrings

* refactor: Change ingester to use KafkaPartition everywhere

* refactor: add SequencerProgress::combine

* refactor: return new SequencerProgress rather than updating

* fix: distinguish between yes/no/unknown in WriteSummary

* docs: Update data_types2/src/lib.rs

Co-authored-by: Paul Dix <paul@pauldix.net>

Co-authored-by: Paul Dix <paul@pauldix.net>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-04-06 15:13:21 +00:00
Carol (Nichols || Goulding) bf3cb45723
refactor: Pass PartitionInfo as argument 2022-04-06 09:31:42 -04:00
Carol (Nichols || Goulding) f0d5987317
feat: Update partition sort_key in catalog after persist
Connects to #4196.
2022-04-06 09:31:42 -04:00
Carol (Nichols || Goulding) b16fcc284d
feat: Add new columns to the sort key during compaction
Connects to #4196.
2022-04-06 09:31:42 -04:00
Carol (Nichols || Goulding) 98d052dba7
feat: Use catalog sort key if specified
Pass the sort key from the catalog through to compact_persisting_batch.
If the sort key is Some, use that. If the sort key is None, compute it
from the data's cardinality with compute_sort_key.

Connects to #4196.
2022-04-06 09:31:42 -04:00
Dom Dwyer 891d2e1368 feat: periodic kafka max watermark offset fetcher
Adds the PeriodicWatermarkFetcher type responsible for querying write
buffer / Kafka for the maximum sequence number / offset, surfacing any
errors via both logs & metrics.

This high watermark / max offset value is used within the ingest
instrumentation metrics. This use case is tolerant of caching / stale
values, and as such the value is periodically updated to minimise load
on the write buffer.
2022-04-05 12:02:07 +01:00
Dom Dwyer aaa677dec8 docs: describe graceful shutdown behaviour 2022-04-05 11:31:55 +01:00
Dom Dwyer 8edefc415d refactor: rename ttbr -> write_time in tests 2022-04-05 11:31:55 +01:00
Dom Dwyer a387ec361d refactor: use self.deref() instead of **self 2022-04-05 11:31:55 +01:00
Dom Dwyer f15275cf96 feat: expose ingest sequencer errors
Instruments the SequencedStreamHandler with a series of new metrics that
record the various error classes observable in the stream handler.

These metrics are labelled with potential_data_loss=true where relevant
to surface potential data loss events for alerting & further review.
2022-04-05 11:31:55 +01:00
Dom Dwyer 083ff1f8e3 refactor: ingest stream handler
Refactors the stream_in_sequenced_entries() into a new impl in the
SequencedStreamHandler type, decoupling the reading / decoding of ops
from Kafka (and associated error handling) from the "what happens to
those ops" concern to ease testing, encapsulate the specifics of "how to
get an op" and improve flexibility.

This is intended to provide robust error handling within what is
reasonably possible (unexpected errors are always unexpected!) while
retaining the existing metrics and functionality. I've also separated
out code that exists in the current impl specifically to drive tests
from the prod code path, instead driving those behaviours through mocks.

As of this commit, the handler is not used - this commit simply adds the
new impl.
2022-04-05 11:31:54 +01:00
Paul Dix 81d41f81a1
fix: ingester replay logic (#4212)
Fix the ingester to track the max persisted sequence number per partition.
Ensure replay takes in data from unpersisted partitions.
Simplify the table persist info to not return a max persisted sequence number for the table as that information isn't needed.
2022-04-04 18:04:34 +00:00
Carol (Nichols || Goulding) d41adf074f
test: Add assertions for sort keys 2022-04-01 13:13:04 -04:00
Carol (Nichols || Goulding) f4b5fa1b5e
feat: Implement distinct counts in terms of distinct values
For one record batch.

Connects to #4194.
2022-03-31 16:46:27 -04:00
Carol (Nichols || Goulding) 832495a7c9
feat: Implement ingester compute_sort_key similarly to query compute_sort_key
And add a test that currently fails because this implementation doesn't
include actually computing the cardinalities.

Connects to #4194.
2022-03-31 16:35:16 -04:00
Carol (Nichols || Goulding) 9d83554f20
feat: Get the sort key from the schema and data in the QueryableBatch
Connects to #4194.
2022-03-31 16:34:48 -04:00
Carol (Nichols || Goulding) 9043966443
docs: Fix some typos in comments as I noticed them 2022-03-31 16:34:47 -04:00
Andrew Lamb a384448b92
refactor: rename Sequence::id and Sequence::number field names (#4190)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-03-31 15:17:58 +00:00
Nga Tran ddc2c8304f
fix: have the compaction level set correctly (#4184)
* fix: have the compaction level set correctly, especially for compacted file from the compactor

* fix: typo
2022-03-30 21:23:40 +00:00
Marco Neumann 2b76c31157
refactor: make statistics null counts optional (#4160)
Min/max values and distinct counts are already optional, so let's make
the null counts optional as well. This will be helpful for NG to deal w/
partial statistics (e.g. we only populate stats for the time column).

Note that the total count is still mandatory, but we normally have the
chunk/file-level row count at hand.
2022-03-29 17:47:57 +00:00
Carol (Nichols || Goulding) f3f792fd08
feat: Add namespace_id to the parquet_files table; object store paths need it 2022-03-29 08:15:26 -04:00
Carol (Nichols || Goulding) a373c90415
refactor: Extract the list_all function to object store
I'm about to use this in a third file, so time to extract this.

Make it clear that this is appropriate for tests only.
2022-03-29 08:15:24 -04:00
dependabot[bot] 17af5fcbd1
chore(deps): Bump tokio-util from 0.7.0 to 0.7.1 (#4154)
* chore(deps): Bump tokio-util from 0.7.0 to 0.7.1

Bumps [tokio-util](https://github.com/tokio-rs/tokio) from 0.7.0 to 0.7.1.
- [Release notes](https://github.com/tokio-rs/tokio/releases)
- [Commits](https://github.com/tokio-rs/tokio/compare/tokio-util-0.7.0...tokio-util-0.7.1)

---
updated-dependencies:
- dependency-name: tokio-util
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

* chore: Run cargo hakari tasks

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-03-29 08:39:02 +00:00
dependabot[bot] 4f9515ffba
chore(deps): Bump async-trait from 0.1.52 to 0.1.53 (#4141)
Bumps [async-trait](https://github.com/dtolnay/async-trait) from 0.1.52 to 0.1.53.
- [Release notes](https://github.com/dtolnay/async-trait/releases)
- [Commits](https://github.com/dtolnay/async-trait/compare/0.1.52...0.1.53)

---
updated-dependencies:
- dependency-name: async-trait
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-03-28 08:55:24 +00:00
Andrew Lamb 5c69a3f43b
chore: Update deps: datafusion, arrow/arrow-flight/parquet to 11, zstd to 0.11 (#4119)
* chore: update datafusion

* chore(deps): Bump arrow from 10.0.0 to 11.0.0

Bumps [arrow](https://github.com/apache/arrow-rs) from 10.0.0 to 11.0.0.
- [Release notes](https://github.com/apache/arrow-rs/releases)
- [Changelog](https://github.com/apache/arrow-rs/blob/master/CHANGELOG.md)
- [Commits](https://github.com/apache/arrow-rs/compare/10.0.0...11.0.0)

---
updated-dependencies:
- dependency-name: arrow
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>

* chore(deps): Bump arrow-flight from 10.0.0 to 11.0.0

Bumps [arrow-flight](https://github.com/apache/arrow-rs) from 10.0.0 to 11.0.0.
- [Release notes](https://github.com/apache/arrow-rs/releases)
- [Changelog](https://github.com/apache/arrow-rs/blob/master/CHANGELOG.md)
- [Commits](https://github.com/apache/arrow-rs/compare/10.0.0...11.0.0)

---
updated-dependencies:
- dependency-name: arrow-flight
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>

* chore: update parquet to 11.0.0

* fix: error on create schema, test for same

* fix: upgrade zstd

* chore: Run cargo hakari tasks

* fix: fix logical merge conflict

* fix: hakari

* fix: hakari

* fix: update newly introduced dep

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-03-24 15:27:36 +00:00
Andrew Lamb 204dd7c8e9
refactor: Fix some random clippy lints from the future (#4118)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-03-24 09:21:29 +00:00
Carol (Nichols || Goulding) 67e13a7c34
fix: Change to_delete column on parquet_files to be a time (#4117)
Set to_delete to the time the file was marked as deleted rather than
true.

Fixes #4059.

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-03-23 18:47:27 +00:00
Marco Neumann 51da6dd7fa
feat: store sort key in NG metadata (#4110)
The sort key is optional and currently only produced by `iox_tests`.
Writing it within the ingester/compactor is tracked by #3968. The sort
key is read by the querier (and this will be verified by the query tests
and is required to merge #4103).

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-03-23 18:24:46 +00:00
Paul Dix b18b18afd9
fix: have ingester use single mutable batch for buffer (#4095)
Removed some unnecessary tests as they no longer apply with the new buffer structure. This will hopefully reduce the memory footprint of the ingesters significantly.

Closes #4072
2022-03-22 13:42:52 +00:00
Andrew Lamb b83b000590
chore: Update datafusion (#4071)
* chore: update to datafusion 5936edc2a94d5fb20702a41eab2b80695961b9dc

* chore: Update apis to match datafusion changes
2022-03-22 13:17:41 +00:00
kodiakhq[bot] 67939fb37d
Merge branch 'main' into crepererum/issue3934b 2022-03-19 06:56:30 +00:00
Paul Dix 85287abc4e
feat: add ttbr metric to ingester (#4068) 2022-03-18 17:35:30 +00:00
Marco Neumann 169fa2fb2f refactor: make `QueryChunk` object-safe
This makes it way easier to dyn-type database implementations. The only
real change is that we make `QueryChunk::Error` opaque. Nobody is going
to inspect that anyways, it's just printed to the user.

This is a follow-up of #4053.

Ref #3934.
2022-03-18 11:40:31 +01:00
Dom Dwyer d9900f661b refactor: ingester table & namespace count metrics
Record the number of tables / namespaces an ingester process has
observed.
2022-03-17 17:20:30 +00:00
Dom Dwyer c0d5c6a559 feat: ingester pause metrics
Emit a counter metric "ingest_paused_duration_ms_total" that records the
duration of time an ingester stream is paused with millisecond
granularity.

This metric will allow us to measure the frequency and severity of, and
alert on, an ingester stopping ingest due to memory limits enforced by
the LifecycleManager. This will help us tune these config params.
2022-03-17 17:20:30 +00:00
Paul Dix d3ea361337
feat: add ingester lifecycle metrics (#4031) 2022-03-15 19:32:58 +00:00
Dom Dwyer 5585dd3c21 refactor: switch to using DynObjectStore
Changes all consumers of the object store to use the dynamically
dispatched DynObjectStore type, instead of using a hardcoded concrete
implementation type.
2022-03-15 16:32:52 +00:00
Dom Dwyer 1d5066c421 refactor: rename ObjectStore -> ObjectStoreImpl
Frees up the name for so we can use `dyn ObjectStore` throughout the
code instead of `ObjectStoreApi`.
2022-03-15 16:29:43 +00:00
Andrew Lamb 9b3f946c10
feat: all in 1 IOx NG mode (#3965)
* feat: Add all_in_one mode

* fix: doc

* docs: fix truncated docs

* refactor: correctly identify PG connections

* refactor: resolve failed merge

Co-authored-by: Dom Dwyer <dom@itsallbroken.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-03-15 16:28:37 +00:00
Marko Mikulicic 4c674b931a
fix: Remove partition_id from metric attributes (#4028) 2022-03-14 14:12:34 +00:00
Nga Tran 5a29d070ea
feat: Implement the compact function for NG Compactor (#4001)
* feat: initial implementation of compact a given list of overlapped parquet files

* feat: Add QueryableParquetChunk and some refactoring

* feat:  build queryable parquet chunks for parquet files with tombstones

* feat: second half the implementation for Compactor's compact. Tests will be next

* fix: comments for trait funnctions fof QueryChunkMeta

* test: add tests for compactor's compact function

* fix: typos

* refactor: address Jake's review comments

* refactor: address Andrew's comments and add one more test for files in different order in the vector

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-03-11 20:25:19 +00:00
Andrew Lamb cc4875cca0
refactor: decouple ingester setup and creation logic from the config structs (#4020)
* refactor: decouple ingester setup and creation logic from the config structs

* fix: clippy

* refactor: remove comments
2022-03-11 19:25:50 +00:00
kodiakhq[bot] 30fc29c296
Merge branch 'main' into dom/ingester-lock-fix 2022-03-11 10:32:26 +00:00
Dom Dwyer 4a7364d63f refactor: rename lifecycle_manager args 2022-03-11 10:31:58 +00:00
Carol (Nichols || Goulding) ecd06c6ec3
fix: ParquetFileRepo create should be responsible for setting INITIAL_COMPACTION_LEVEL
When created in the catalog, parquet files should always have compaction
level 0. Updating the compaction level should always happen in the
compactor.

Only the catalog should need to know about the initial compaction level
value.
2022-03-10 13:51:18 -05:00
Carol (Nichols || Goulding) ff31407dce
refactor: Extract a ParquetFileParams type for create
This has the advantages of:

- Not needing to create fake parquet file IDs or fake deleted_at
  values that aren't used by create before insertion
- Not needing too many arguments for create
- Naming the arguments so it's easier to see what value is what
  argument, especially in tests
- Easier to reuse arguments or parts of arguments by using copies of
  params, which makes it easier to see differences, especially in tests
2022-03-10 13:51:18 -05:00
Paul Dix 27999ff72f
feat: add compaction_level and created_at to parquet_file (#3972) 2022-03-10 15:56:57 +00:00
Andrew Lamb 2c3d30ca32
chore: Update datafusion, arrow, flight and parquet (#4000)
* chore: Update datafusion, arrow, flight and parquet

* fix: api change

* fix: fmt

* fix: update test metadata size

* fix: Update sizes in parquet test

* fix: more metadata size update
2022-03-10 12:24:47 +00:00
Dom Dwyer 874be0097d refactor: eliminate unnecessary lock
This commit splits the API of the LifecycleManager into two:

    * LifecycleManager: singleton responsible for evaluating partitions
        and running persist tasks.

    * LifecycleHandle: a handle for each sequencer ingester(s) to update
        the global LifecycleManager state when applying ops.

This keeps the accessible API & responsibilities of each caller distinct
and allows us to leverage the type system to enforce linearisation of
calls to LifecycleManager::maybe_persist() without resorting to an
(unnecessary) mutex guard for serialisation.
2022-03-09 11:16:44 +00:00
Paul Dix 96100635c3
feat: ingester seeks kafka partition on initialization (#3940)
Fixes #3851
2022-03-08 19:53:35 +00:00
Dom Dwyer 384133aac7 refactor: impl Debug for ingester types
Derive / implement Debug for all types in the ingester crate to help
with debugging, and add a lint at "warn" level to keep everything in
sync.
2022-03-08 15:19:29 +00:00
Paul Dix 337e432a0f
feat: ingester persists on cold partitions (#3942)
Add configuration and lifecycle to trigger partition persistence if it hasn't received a write in a given number of secods.

Fixes #3869
2022-03-07 18:55:56 +00:00
Carol (Nichols || Goulding) 9961efd702
feat: Send parquet and tombstone seq nums with ingester query response (#3925)
Fixes #3867.
2022-03-04 15:22:29 +00:00
Paul Dix 6ba5e51897
feat: update max_persisted_sequence_number in the buffered table on persist (#3868)
This includes a bit of a refactor in the locking structure of the buffer data. Locking at the partition collection and within the partition data was making things more complex than they needed to be. The partitions in the buffer are there only temporarily until they get persisted. Locking on the table simplifies things a bit and makes it more clear when the table state is being modified since it no longer has any interior mutability. Having access to separate partitions without the same lock isn't something we need because queries will hit all partitions and data is brought in sequentially, regardless of which partition it is hitting in a sequencer.

Fixes #3850
2022-03-03 23:52:31 +00:00
Edd Robinson 787a848bf5 feat: add tracing for tag_values 2022-03-03 14:27:01 +00:00
Edd Robinson 6a6fbf73ae feat: add tracing support tag_keys 2022-03-03 14:27:01 +00:00
Dom Dwyer 8de453edd1 feat: batch column upsert for schema validation
Uses the new ColumnRepo::create_or_get_many() catalog method to perform
a bulk upsert of (potentially) new columns to the catalog during schema
validation.
2022-03-03 11:18:29 +00:00
kodiakhq[bot] caba3e9fd2
Merge branch 'main' into cn/querier-flight-request 2022-03-02 20:30:00 +00:00
Edd Robinson 3d047073b9
feat: add tracing down to the chunk level (#3804)
* refactor: wire exectution context to Deduplicator

* feat: example trace to chunk read_filter

* refactor: make execution context required

* refactor: expose metadata API

* refactor: more span context for chunk read_filter

* refactor: fix build

* refactor: push context into result stream

* refactor: make executor optional
2022-03-02 19:08:22 +00:00
Carol (Nichols || Goulding) 3f2a58b47f
refactor: pub use data_types from data_types2
So it's clearer which parts of data_types the NG design is using, and
which types can be cleaned up eventually.
2022-03-02 13:55:31 -05:00
Carol (Nichols || Goulding) 2a90841715
refactor: Move IngesterQueryRequest to data_types2
So that querier doesn't need to depend on ingester.
2022-03-02 13:52:13 -05:00
Carol (Nichols || Goulding) 8f3e44bf76
refactor: Extract a crate for shared data types in the new design 2022-03-02 12:16:15 -05:00
Carol (Nichols || Goulding) 16d86ed05b
feat: Deserialize metadata to get max_sequencer_number
And add an end-to-end test for the flight request to the ingester.
2022-03-02 11:50:47 -05:00
Carol (Nichols || Goulding) 141a6087d0
feat: Querier able to send Flight queries to Ingester
Fixes #3773.
2022-03-02 11:50:45 -05:00
Marco Neumann ace4af1b66
feat: `DedicatedExecutor` async `join` and job `detach`. (#3835)
* feat: detach dedicated exec jobs

* feat: async `DedicatedExecutor::join`

Now `DedicatedExecutor` follows the system we use for other server
components:

- `shutdown`:  a quick sync call that signals the shutdown but doesn't
  drop
- `join`: async awaits until the executor has finished shutdown
- `drop`: warn but still try to shut down

* test: irmpove `detach_receiver` test

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
2022-03-01 15:25:31 +00:00
Marco Neumann 48722783f9
feat: offer metrics for in-mem catalog (#3876)
This can be quite helpful to test certain caching behavior w/o writing
yet-another abstraction layer.
2022-03-01 11:33:54 +00:00
Nga Tran 0e0dc500f6
feat: prepare data to send to querier (#3825)
* feat: changes needed to apply tombstones correctly on the life-cycle ingest bacthes

* refactor: adjust the  design after discussing with Paul

* feat: apply the coming tombstone on all data but persiting one

* chore: fmt

* fix: build on buffer tombstone

* test: delete & write tests for a parition and some cleanup

* feat: No need add processed tombstones for newly created parquet file in the ingester becasue all deletes before that parquet file is created were applied

* chore: cleanup

* feat: intitial implementation for preparing data to send back to the Querier

* feat: full implementation of prepare_data_to_querier

* fix: apply filters for the batches

* chore: Apply suggestions from code review

Co-authored-by: Carol (Nichols || Goulding) <193874+carols10cents@users.noreply.github.com>

* chore: cleanup

* fix: typos in comments

* fix: typos in comments

* fix: typos in comments

* test: create different scenarios and test them

* chore: fix typos

* test: add tests with deletes

* chore: make pub pub(crate)

* chore: Apply suggestions from code review

Co-authored-by: Jake Goulding <jake.goulding@integer32.com>

* refactor: address review comments

* fix: keep batches in their arrival order

* refactor: not assign unecessary values to enum

* refactor: use bitflags enum

* fix: use bitflags correctly

* chore: Apply suggestions from code review

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* refactor: avoid using use at the end of the function

* chore: merge main to branch

* fix: fix downgrade versions

* refactor: address review comments

* chore: remove unnecessary comments

* refactor: Make the whole test_utils module test-only and bring paths into module scope

Co-authored-by: Paul Dix <paul@pauldix.net>
Co-authored-by: Carol (Nichols || Goulding) <193874+carols10cents@users.noreply.github.com>
Co-authored-by: Jake Goulding <jake.goulding@integer32.com>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com>
2022-03-01 01:00:45 +00:00
Marco Neumann 6bb18672a4
fix: do not ignore failed persist tasks (#3866)
I'm seeing some panics in our test bench, but it the ingester happily
continues and thinks it persisted tasks even though it didn't. Let's at
least bail out if a persist task fails.
2022-02-28 09:30:42 +00:00
Raphael Taylor-Davies 2a842fbb1a
feat: correctly sort data and store in catalog metadata (#3864)
* feat: respect sort order in ChunkTableProvider (#3214)

feat: persist sort order in catalog (#3845)

refactor: owned SortKey (#3845)

* fix: size tests

* refactor: immutable SortKey

* test: test sort order restart (#3845)

* chore: explicit None for sort key

* chore: test cleanup

* fix: handling of sort keys containing fields

* chore: remove unused selected_sort_key

* chore: more docs

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-02-25 17:56:27 +00:00
Nga Tran 8edc462c37
fix: while executing deduplication, do not return empty record batches as a result of deduplication (#3861)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-02-25 15:00:13 +00:00
Paul Dix c965221df1
feat: have ingester ignore already persisted data (#3849) 2022-02-25 00:08:33 +00:00
Dom Dwyer b07f15bec7 refactor: parallel column resolution
A quick change to perform the ColumnRepo::create_or_get() calls in
parallel (up to a maximum of 3 in-flight at any one time) in order to
mitigate the latency of the call and reduce the overall schema
validation call duration.

The in-flight limit is enforced to avoid starving the DB connection pool
of connections.
2022-02-24 21:04:25 +00:00
Carol (Nichols || Goulding) 723a0c659f
fix: Remove greater_than_sequence_number from IngesterQueryRequest (#3856) 2022-02-24 19:23:44 +00:00
Marco Neumann 49d1be30e7
feat: wire up `ParquetFilePath` for NG (#3853)
It's a bit of a duck-type hack, but if we wanna just `ParquetFileChunk`
in the new architecture, we somehow need it to accept new-gen paths.
Also path handling should be somewhat centralized since
ingester/compactor/querier all need to construct them. So having a
`ParquetFilePath` that supports both path styles seems to be a
not-to-bad solution. This should obviously be cleaned up in some
not-to-distant future.
2022-02-24 16:05:38 +00:00
Carol (Nichols || Goulding) 252ced7adf
feat: Add row count to the parquet_file record in the catalog (#3847)
Fixes #3842.
2022-02-24 15:20:50 +00:00
Marco Neumann d62a052394
feat: extend catalog so we can recover `ParquetChunk`s from it (#3852)
* refactor: less parquet data copying

* feat: `PartitionRepo::get_by_id`

* feat: `TableRepo::get_by_id`

* feat: `ParquetFile::file_size_bytes`

* feat: `ParquetFile::parquet_metadata`

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-02-24 13:16:15 +00:00
Marco Neumann 9079e6ddb0
feat: backoff retries in ingester (#3841)
* feat: add `backoff` crate

* feat: backoff retries in ingester

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-02-23 17:58:16 +00:00
Carol (Nichols || Goulding) 71f62eee68
fix: Remove min_time and max_time from IngesterQueryRequest (#3839)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-02-23 15:46:31 +00:00
Marco Neumann 657ac249e9
feat: track ingester jobs (#3836) 2022-02-23 15:33:47 +00:00
dependabot[bot] b63f920d4c
chore(deps): Bump parquet from 9.0.2 to 9.1.0 (#3828)
* chore(deps): Bump parquet from 9.0.2 to 9.1.0

Bumps [parquet](https://github.com/apache/arrow-rs) from 9.0.2 to 9.1.0.
- [Release notes](https://github.com/apache/arrow-rs/releases)
- [Changelog](https://github.com/apache/arrow-rs/blob/master/CHANGELOG.md)
- [Commits](https://github.com/apache/arrow-rs/compare/9.0.2...9.1.0)

---
updated-dependencies:
- dependency-name: parquet
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* chore: update chunk size test

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Raphael Taylor-Davies <r.taylordavies@googlemail.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-02-23 11:25:15 +00:00
dependabot[bot] 5a79b3a68b
chore(deps): Bump arrow-flight from 9.0.2 to 9.1.0 (#3829)
Bumps [arrow-flight](https://github.com/apache/arrow-rs) from 9.0.2 to 9.1.0.
- [Release notes](https://github.com/apache/arrow-rs/releases)
- [Changelog](https://github.com/apache/arrow-rs/blob/master/CHANGELOG.md)
- [Commits](https://github.com/apache/arrow-rs/compare/9.0.2...9.1.0)

---
updated-dependencies:
- dependency-name: arrow-flight
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-02-23 11:03:22 +00:00
dependabot[bot] 3b7d31c88a
chore(deps): Bump arrow from 9.0.2 to 9.1.0 (#3826)
Bumps [arrow](https://github.com/apache/arrow-rs) from 9.0.2 to 9.1.0.
- [Release notes](https://github.com/apache/arrow-rs/releases)
- [Changelog](https://github.com/apache/arrow-rs/blob/master/CHANGELOG.md)
- [Commits](https://github.com/apache/arrow-rs/compare/9.0.2...9.1.0)

---
updated-dependencies:
- dependency-name: arrow
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-02-23 09:25:46 +00:00
Paul Dix 276d9b123a
feat: Add min_sequence_number tracking for sequencers in ingester (#3785)
Fixes #3702. This pulls the min sequence tracking into the LifecycleManager. Because the number requires looking at all other partitions in memory, this was the most efficient place to put it. The manager updates the sequencer state after it calls persist. The number is meant to be a lower bound on the sequence number. Issue #3783 will add functionality for the ingester to ignore replayed data that has already been persisted.
2022-02-22 21:53:33 +00:00
Nga Tran a91e2eadc7
feat: apply tombstones to the batches of the ingest life-cycle (#3770)
* feat: changes needed to apply tombstones correctly on the life-cycle ingest bacthes

* refactor: adjust the  design after discussing with Paul

* feat: apply the coming tombstone on all data but persiting one

* chore: fmt

* fix: build on buffer tombstone

* test: delete & write tests for a parition and some cleanup

* feat: No need add processed tombstones for newly created parquet file in the ingester becasue all deletes before that parquet file is created were applied

* chore: cleanup

Co-authored-by: Paul Dix <paul@pauldix.net>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-02-22 18:54:21 +00:00
dependabot[bot] ad3868ed7c
chore(deps): Bump tokio from 1.16.1 to 1.17.0 (#3814)
* chore(deps): Bump tokio from 1.16.1 to 1.17.0

Bumps [tokio](https://github.com/tokio-rs/tokio) from 1.16.1 to 1.17.0.
- [Release notes](https://github.com/tokio-rs/tokio/releases)
- [Commits](https://github.com/tokio-rs/tokio/compare/tokio-1.16.1...tokio-1.17.0)

---
updated-dependencies:
- dependency-name: tokio
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* build: update workspace-hack

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Dom Dwyer <dom@itsallbroken.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-02-22 16:27:43 +00:00
Carol (Nichols || Goulding) 1b9212540b
feat: Send IngesterQueryResponse data back as response of doGet Flight request (#3772)
* fix: Adjust fields of IngesterQueryResponse

* feat: Adjust IngestHandler query method to call prepare_data_to_querier

* feat: Send ingest query result data back through Flight doGet

* feat: Send delete predicates and max sequencer number in metadata

* fix: greater_than_sequence_number should be of type SequenceNumber

* fix: Remove DeletePredicates from IngesterQueryResponse

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-02-18 17:42:49 +00:00
Marco Neumann f54ef92b77
fix: supervise and shutdown ingester background tasks (#3769)
* fix: supervise and shutdown ingester background tasks

Closes #3761.
Closes #3762.

* docs: improve wording

Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>

* test: join/shutdown handling for ingester

Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>
2022-02-18 09:35:29 +00:00
Paul Dix 23b3942306
fix: compact persisting panics on single row (#3784) 2022-02-17 18:33:04 +00:00
Carol (Nichols || Goulding) 90da060156
feat: Add namespace and sequencer id fields to IngesterQueryRequest protobuf (#3766)
Fixes #3753.

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-02-16 19:21:15 +00:00
Nga Tran ea814e9aa4
feat: API and steps to prepare data to send back to the Querier per its request (#3756)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-02-16 02:45:58 +00:00
Paul Dix f542045485
feat: wire up persistence in ingester (#3685)
This adds persistence into the ingester with a lifecycle manager. The persist operation must still be updated to keep track of the min_unpersisted_sequence_number for each sequencer.
2022-02-16 00:13:40 +00:00
Nga Tran 0b3f76462d
feat: build Query Plan that queries QueryableBatch with filters (#3742)
* feat: initial implementaion the Query Plan that query QueryableBatch with filters

* fix: read_filter of QueryableBatch should provide the shema of the columns/projection it needs

* chore: Apply suggestions from code review

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* chore: address review comment

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
2022-02-15 16:06:26 +00:00
Marco Neumann 44ee0166a0
fix: start Kafka write buffer stream at "earliest" offset, not at "0" (#3748) 2022-02-15 13:36:59 +00:00
Andrew Lamb a30803e692
chore: Update datafusion, update `arrow`/`parquet`/`arrow-flight` to 9.0 (#3733)
* chore: Update datafusion

* chore: Update arrow

* fix: missing updates

* chore: Update cargo.lock

* fix: update for smaller parquet size

* fix: update test for smaller parquet files

* test: ensure parquet_file tests write multiple row groups

* fix: update callsite

* fix: Update for tests

* fix: harkari

* fix: use IoxObjectStore::existing

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-02-15 12:10:24 +00:00
dependabot[bot] 89105ccfab
chore(deps): Bump tokio-util from 0.6.9 to 0.7.0 (#3743)
Bumps [tokio-util](https://github.com/tokio-rs/tokio) from 0.6.9 to 0.7.0.
- [Release notes](https://github.com/tokio-rs/tokio/releases)
- [Commits](https://github.com/tokio-rs/tokio/commits)

---
updated-dependencies:
- dependency-name: tokio-util
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-02-15 11:33:41 +00:00
Marco Neumann c6e374a025
feat: allow catalog access w/o a transaction (#3735)
* feat: allow catalog access w/o a transaction

Now the caller has the full control if they want to use a transaction or
not.

* fix: remove non-transaction-safe `create_many`

* fix: remove unnecessary transactions
2022-02-15 10:15:36 +00:00
Carol (Nichols || Goulding) 85aa019f50
feat: Turn protobuf predicates into predicate::Predicate (#3707)
* feat: Turn protobuf predicates into predicate::Predicate

* fix: Take buf lint's suggestions

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-02-14 17:56:56 +00:00
Nga Tran d1c71ba5d8
feat: predicate pushdown for Ingester's QueryableBatch (#3728)
* feat: predicate pushdown for Ingester's QueryableBatch

* chore: comment cleanup

* chore: Apply suggestions from code review

Co-authored-by: Edd Robinson <me@edd.io>

* refactor: address review comments

Co-authored-by: Edd Robinson <me@edd.io>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-02-14 17:28:52 +00:00
Nga Tran d3bd03e37a
feat: Support Projection Pushdown for a QueryableBatch (#3712)
* feat: projection pushdown for QueryableBatch

* chore: clean up and remove unwrap

* fix: Add Sync to a Snafu source to have the code compile

* chore: cleanup and add comments for tests

* refactor: Add tests for scanning non existing columns and fix related bugs

* chore: modify comment to trigger auto check in github work
2022-02-10 19:29:21 +00:00
Carol (Nichols || Goulding) 73828323ac
feat: Ingester Flight gRPC API (#3623)
* feat: Add a way to run ingester with an in-memory catalog from the CLI

If you set the --catalog-dsn string to "mem", rather than using that as
a Postgres connection URL, create an in-memory catalog.

Planning on using this in tests, so not documenting.

* fix: Set default topic to the same value as SHARED_KAFKA_TOPIC

Namely, both should use an underscore. I don't think there's a way to
directly share these values between a constant and an annotation.

* feat: Add a flight API (handshake only) to ingester

* fix: Create partitions if using file-based write buffer

* fix: Change the server fixture to handle ingester server type

For now, the ingester doesn't implement the deployment API. Not sure if
it should or not.

* feat: Start implementing ingester do_get, namely decoding the query

Skip serialization of the predicate for the moment.

* refactor: Rename ingest protos to ingester to match crate name

* refactor: Rename QueryResults to QueryData

* feat: Move ingester flight client to new querier crate

* fix: Off by one error, different starting indexes in sequencers

* fix: Create new CLI argument to pick the catalog type

* fix: Create a CLI option to set the number of topics to auto-create in the write buffer

* fix: Check the arrow flight service's health to tell that the ingester gRPC is up

* fix: Set postgres as the default catalog type

* fix: Return an error rather than panicking if CLI args aren't right
2022-02-09 19:07:44 +00:00
Paul Dix 59b2141c0b
feat: Add lifecycle manager to ingester (#3645)
This adds the lifecycle manager to the ingester. It will trigger based on a threshold for max partition size or age or based on keeping total memory under a certain threshold.

It defines a new interface for a persister, which is stubbed out for IngesterData. I'm not sure yet how persistence errors should be handled. The assumption here is that the persister continues to retry persistence forever until it succeeds.

There is one scenario I can think of that may cause this lifecycle manager problems. If a single partition is very high throughput, it could cause things to back up as persistence is not parallelized within a single partition. Any given partition can currently only run one persistence operation at a time. We can address this later.

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-02-08 15:23:40 +00:00
Marco Neumann 5de4d6203f
refactor: catalog transaction (#3660)
* refactor: catalog Unit of Work (= transaction)

Setup an inteface to handle Units of Work within our catalog. Previously
both the Postgres and the in-mem backend used "mini-transactions on
demand". Now the caller has a clear way to establish boundaries and
gets read and write isolation. A single `Arc<dyn Catalog>` can create as
many `Box<dyn UnitOfWork>` as you like, but note that depending on the
backend you may not scale infinitely (postgres will likely impose
certain limits and the in-mem backend limits concurrency to 1 to keep
things simple).

* docs: improve wording

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* refactor: rename Unit of Work to Transaction

* test: improve `test_txn_isolation`

* feat: clearify transaction drop semantics

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-02-08 13:38:33 +00:00
Marco Neumann 977ccc1989
fix: use a single metric registry for ingester (#3652)
With this change write buffer ingestion metrics are showing up under
`/metrics`

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-02-07 15:56:54 +00:00
Carol (Nichols || Goulding) 2e30483f1f
refactor: Remove predicate module from predicate crate (#3648)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-02-07 14:54:07 +00:00
Marco Neumann e2db1df11f
refactor: improve writer buffer consumer interface (#3631)
* refactor: improve writer buffer consumer interface

The change looks huge but is actually rather simple. To
understand the interface change, let me first explain what we want:

- be able to fetch watermarks for any sequencer
- have streams:
  - each streams tracks a sequencer and has an offset state (no read
    multiplexing)
  - we can seek a stream
  - seeking and streaming cannot be done at the same time (that would be
    weird and likely leads to many bugs both in write buffer and in the
    user code)
- ideally we don't need to create streams of all sequencers but can
  choose a subset

Before this change we had one mutable consumer struct where you can get
all streams and watermark functions (this mutable-borrows the consumer)
or you can seek a single stream (this also mutable-borrows the
consumer). This is a bit weird for multiple reasons:

- you cannot seek a single stream without dropping all of them
- the mutable-borrow construct makes it really difficult to pass the
  streams into separate threads
- the consumer is boxed (because its mutable) which makes it more
  difficult to handle in a large-scale application

What this change does is the following:

- you have an immutable consumer (similar to the producer)
- the consumer offers the following methods:
  - get the set of sequencer IDs
  - get watermark for any sequencer
  - get a stream handler (see next point) for any sequencer
- the stream handler captures the stream state (offset) and provides you
  a standard `Stream<_>` interface as well as a seek function.
  Mutable-borrows ensure that you cannot use both at the same time.

The stream handler provides you the stream via `handler.stream()`. It
doesn't implement `Stream<_>` itself because the way boxing, dynamic
dispatch work, and pinning interact (i.e. I couldn't get it to work
without the indirection).

As a bonus point (which we don't use however) you can now create
multiple streams for the same sequencer and they all have their own
offset.

* fix: review comments

Co-authored-by: Carol (Nichols || Goulding) <193874+carols10cents@users.noreply.github.com>

Co-authored-by: Carol (Nichols || Goulding) <193874+carols10cents@users.noreply.github.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-02-07 12:24:17 +00:00
Paul Dix ce46bbaada
feat: wire up the write buffer to the ingester process (#3533)
This adds the scaffolding for the ingester server to consume data from Kafka. This ingests data in an in memory structure while creating records in the catalog for any partitions that don't yet exist.

I've removed catalog_update.rs in ingester for now. That was mostly a placeholder and will be going in a combination of handler.rs and data.rs on my next PR which will have some primitive lifecycle wired up.

There's one ugly bit here where the DML write is cloned because it's getting borrowed to output spans and metrics. I'll need to follow up with a refactor to make it so that the DML write's tables can be consumed without it gumming up the metrics stuff.

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-02-03 11:47:28 +00:00
Marco Neumann 22778a3a80
chore: upgrade rskafka and parking_lot (#3592) 2022-02-01 11:50:42 +00:00
kodiakhq[bot] 8bef2c105c
Merge branch 'main' into cn/persist 2022-01-31 18:50:45 +00:00
Andrew Lamb 7b96a37165
chore: Update datafusion (#3586)
* chore: update DataFusion to f849968057ddddccc9aa19915ef3ea56bf14d80d

* fix: reduce overhead of creating physical expressions

* chore: use MemTrackingMetrics

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-01-31 18:15:28 +00:00
Carol (Nichols || Goulding) 4006dc14b3
fix: Correct typo in function name 2022-01-31 10:48:30 -05:00
Carol (Nichols || Goulding) 749989a937
refactor: Simplify type, eliminating empty vec creation
If there aren't any record batches, there isn't any metadata, and vice
versa. Make this relationship clearer by putting the Option around both
the vec of recordbatches and the metadata.
2022-01-31 10:48:30 -05:00
Carol (Nichols || Goulding) 093d5acfd4
fix: Unify temporary multiple definitions of IoxMetadata 2022-01-31 10:48:29 -05:00
Carol (Nichols || Goulding) 8f81ce5501
refactor: Share parquet_file::storage code between new and old metadata 2022-01-31 10:36:33 -05:00
Carol (Nichols || Goulding) bf89162fa5
refactor: Move IoxMetadata to parquet_file 2022-01-31 10:36:33 -05:00
Carol (Nichols || Goulding) dd9620da0c
feat: Create a new proto definition for the new design's IoxMetadata 2022-01-31 10:36:32 -05:00
Carol (Nichols || Goulding) 81647f253c
feat: Use IoxMetadata and a list of RecordBatches 2022-01-31 10:36:32 -05:00
Carol (Nichols || Goulding) fef968f75c
fix: Remove catalog insertion; will be handled elsewhere 2022-01-31 10:36:32 -05:00
Carol (Nichols || Goulding) 8b47ad6885
test: Add more tests 2022-01-31 10:36:32 -05:00
Carol (Nichols || Goulding) d413157b99
feat: Extract a fn for creating the parquet file paths and test it 2022-01-31 10:36:32 -05:00
Carol (Nichols || Goulding) 5e0e0d8aa7
feat: Write parquet to object storage in a similar way as parquet_file::Storage 2022-01-31 10:36:32 -05:00
Carol (Nichols || Goulding) ea18c71e6d
feat: Create an object store path for a new parquet file 2022-01-31 10:36:32 -05:00
Carol (Nichols || Goulding) c633c9bc5c
feat: Wire object store into ingester persistence 2022-01-31 10:36:30 -05:00
Nga Tran ac247e4de5
feat: update catalog after persistence (#3581)
* feat: update catalog after persistence

* test: add a negative test for the update catalog

* chore: add IDs into the messages

Co-authored-by: Carol (Nichols || Goulding) <193874+carols10cents@users.noreply.github.com>

* chore: Apply suggestions from code review

Co-authored-by: Carol (Nichols || Goulding) <193874+carols10cents@users.noreply.github.com>

* refactor: address review comments

Co-authored-by: Carol (Nichols || Goulding) <193874+carols10cents@users.noreply.github.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-01-31 15:23:16 +00:00
Nga Tran 8735ede74f
feat: IoxMetadata for parquet file (#3547)
* feat: IoxMetadata for parquet file

* fix: typos

* refactor: address review comments

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-01-28 14:41:59 +00:00
Nga Tran fb33a88dc8
test: Delete application during Ingester's compaction (#3542)
* test: Delete application during Ingester's compaction

* fix: typos

Co-authored-by: Andrew Lamb <alamb@influxdata.com>

* chore: remove comments

Co-authored-by: Andrew Lamb <alamb@influxdata.com>

Co-authored-by: Andrew Lamb <alamb@influxdata.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-01-27 16:53:17 +00:00
Andrew Lamb 5488c257d1
chore: Update datafusion, upgrade to arrow/parqet/arrow-flight 8.0.0 (#3517)
* chore: Update datafusion

* chore: update to arrow 8

* fix: update to use new DataFusion APIs

* fix: update case for sortedness

* fix: cargo hakari
2022-01-27 13:33:27 +00:00
Carol (Nichols || Goulding) bc44d33108
feat: Implement a snapshot method on DataBuffer (#3518)
* feat: Implement a snapshot method on DataBuffer

Fixes #3510.

* test: Add a test snapshotting batches with different but compatible schemas

* fix: Simplify min/max sequencer number collection

The first batch should always have the min sequencer number. The last
batch should always have the max sequencer number. The min should always
be less than (or equal to, in case there's only one batch) the max.
2022-01-26 15:22:51 +00:00
Nga Tran 52866fe6a9
fix: merge record batches into one batch (#3535)
* fix: merge record batches into one batch

refactor: address review comments

* chore: update test output
2022-01-25 23:29:16 +00:00
Nga Tran d559561fd7
refactor: have the deduplicate work without chunk statistics (#3519)
* refactor: have the deduplicate work without chunk statistics

* test: more tests for duplicates data on different combinations of record batches

* refactor: address review comments
2022-01-25 17:00:25 +00:00
NGA-TRAN c6a195b0e6 refactor: address review comments 2022-01-24 13:05:44 -05:00
NGA-TRAN 797ba459b9 chore: merge main to branch 2022-01-24 12:06:23 -05:00
NGA-TRAN 939ea536d4 feat: add but ignore a few compaction tests 2022-01-24 12:00:23 -05:00
NGA-TRAN ee0a468b4d feat: a few tests for compaction 2022-01-21 18:15:23 -05:00
Paul Dix bb893510a0 feat: Add scaffolding for ingester server
* Adds a new ingester command to start an ingester server
* Moves previous ingester server over to handler
* Skeleton for gRPC and HTTP handlers
2022-01-21 18:02:19 -05:00
NGA-TRAN fa41067e3d refactor: for paul 2022-01-21 16:50:49 -05:00
NGA-TRAN cd01b141f3 refactor: for paul 2022-01-21 16:49:02 -05:00
Paul Dix bfa54033bd refactor: Clean up the Catalog API
This updates the catalog API to make it easier to work with for consumers. I also found a bug in the MemCatalog implementation while refactoring the tests to work with the new API definition. Consumers will now be able to Arc wrap the catalog and use it across awaits.
2022-01-21 16:01:13 -05:00
NGA-TRAN 191adc9fc7 feat: initial implementation for ingester's compaction 2022-01-20 18:22:41 -05:00
NGA-TRAN 029f4bb41e fix: comment 2022-01-19 18:11:00 -05:00
NGA-TRAN dcf952bb27 chore: merge main to branch 2022-01-19 17:59:05 -05:00
NGA-TRAN 4ede10b3a0 refactor: add new fields and comments in ingest data buffer 2022-01-19 17:53:58 -05:00
Paul Dix 860e5a30ca refactor: update ingester to get sequencer record and not attempt to create 2022-01-19 17:15:10 -05:00
NGA-TRAN be3e523312 fix: use PersistingBatch 2022-01-19 13:25:03 -05:00
NGA-TRAN 9977f174b7 refactor: use wrapper ID 2022-01-19 12:51:04 -05:00
NGA-TRAN edb97f51cf refactor: add persisting struct 2022-01-19 12:36:18 -05:00
NGA-TRAN 8a17e1c132 refactor: address review comments 2022-01-19 11:20:20 -05:00
NGA-TRAN fe9a41ee9a chore: remove non-longer needed dependency 2022-01-18 21:45:20 -05:00
NGA-TRAN b89c250ccc refactor: use RepoColection instead of MemCatalog 2022-01-18 21:39:22 -05:00
NGA-TRAN b57f027e35 refactor: address review comments 2022-01-18 20:57:13 -05:00
NGA-TRAN 367a9fb812 fix: add workspace-hack 2022-01-18 18:10:42 -05:00
NGA-TRAN 1c970a2064 fix: format 2022-01-18 18:01:47 -05:00
NGA-TRAN 667ec5bfc5 fix: the code is now compile without warnings 2022-01-18 18:01:06 -05:00
NGA-TRAN b20d1757d0 feat: initialize ingester data 2022-01-18 17:43:03 -05:00
NGA-TRAN 125285ae9a feat: commit in order to pull and merge new commit from main 2022-01-18 16:11:25 -05:00
NGA-TRAN 23290fd2ff fix: new data structures suggested by reviewers 2022-01-18 14:04:07 -05:00
NGA-TRAN ef336b4659 feat: add ingester crate and a few basic data structures for its data lifecycle 2022-01-17 15:38:03 -05:00