Commit Graph

49 Commits (4509e3db57096587abb27108495d5595c61afb3c)

Author SHA1 Message Date
Andrew Lamb 3592aa52d8
chore: Update datafusion + `arrow`/`parquet`/`arrow-flight` to `15.0.0` (#4743)
* chore: Update datafusion + `arrow`/`parquet`/`arrow-flight` to `15.0.0`

* chore: Update APIs

* chore: Run cargo hakari tasks

* feat: normalize parquet file metadata

* chore: update size tests

* chore: add docs on metadata stripping

* chore: TEMP UPDATE TO DF BRANCH

* chore: Update for new API

* fix: Update to latest DF

* fix: cargo hakari

Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
Co-authored-by: Raphael Taylor-Davies <r.taylordavies@googlemail.com>
2022-06-03 10:32:26 +00:00
Andrew Lamb 633117e595
feat: avoid catalog access on each query (#4650)
* feat: cache catalog access on query

* fix: Apply suggestions from code review

Co-authored-by: Carol (Nichols || Goulding) <193874+carols10cents@users.noreply.github.com>

Co-authored-by: Carol (Nichols || Goulding) <193874+carols10cents@users.noreply.github.com>
2022-05-26 20:44:22 +00:00
Marco Neumann a08a91c5ba
fix: ensure querier cache is refreshed for partition sort key (#4660)
* test: call `maybe_start_logging` in auto-generated cases

* fix: ensure querier cache is refreshed for partition sort key

Fixes #4631.

* docs: explain querier sort key handling and test

* test: test another version of issue 4631

* fix: correctly invalidate partition sort keys

* fix: fix `table_not_found_on_ingester`
2022-05-25 10:44:42 +00:00
Andrew Lamb e877a64462
feat: Add `ParquetFiles` cache and memory size estimation for ParquetMetadata (#4661)
* feat: Add `ParquetFiles` cache

* fix: Apply suggestions from code review

Co-authored-by: Marko Mikulicic <mkm@influxdata.com>

* fix: remove commented out debugging println

* refactor: Improve size calculation

* fix: mark `ParquetFileCache::clear` test only

* fix: assert on metric count

Co-authored-by: Marko Mikulicic <mkm@influxdata.com>
2022-05-23 17:11:38 +00:00
Dom Dwyer 2e6c49be83 refactor: remove IoxMetadata min & max timestamp
Removes the min/max timestamp fields from the IoxMetadata proto
structure embedded within a Parquet file's metadata.

These values are redundant as they already exist within the Parquet
column statistics, and precluded streaming serialisation as these
removed min/max values were needed before serialising the file.
2022-05-23 16:27:08 +01:00
Dom Dwyer a142a9eb57 refactor: remove row_count from IoxMetadata
Remove the redundant row_count from the IoxMetadata structure that is
serialised into the Parquet file.

The reasoning is twofold:

    * The Parquet file's native metadata already contains a row count
    * Needing to know the number of rows up-front precludes streaming
2022-05-23 16:18:35 +01:00
Dom Dwyer b9a745d42d feat: RecordBatch stream to Parquet file upload
Implements an upload() method on the ParquetStorage type, consuming a
stream of RecordBatch, serialising the Parquet file, and uploading the
result to object storage. Returns the IOx-specific file metadata.

Currently while the upload() method accepts a stream of RecordBatch, the
actual resulting Parquet file is buffered in memory before uploading to
object store, due to lack of streaming upload functionality in the
ObjectStore abstraction - this isn't the end of the world, as the files
tend to be relatively small with our current usage.

This impl should be easily modified to be fully streaming once streaming
object store puts are implemented:

    https://github.com/influxdata/object_store_rs/issues/9
2022-05-20 15:17:40 +01:00
Dom Dwyer baa86d846f refactor: use ParquetStore instead of ObjectStore
Changes the code paths that interact with Parquet files in the object
store to reference the ParquetStorage directly (DRY refactor).

This change takes us from a dependency graph of:

                    ┌─────────────────┐
                    │                 │
                    ▼                 │
            Parquet Consumer          │
                    │         ┌──────────────┐
                    ├────────▶│ParquetStorage│
                    ▼         └──────────────┘
            ┌──────────────┐
            │ ObjectStore  │
            └──────────────┘
                    │
               ┌────┴────┐
               ▼         ▼
             File       s3
            System    (etc)

to:

                Parquet Consumer
                        │
                        ▼
                ┌──────────────┐
                │ParquetStorage│
                └──────────────┘
                        │
                        ▼
                ┌──────────────┐
                │ ObjectStore  │
                └──────────────┘
                        │
                   ┌────┴────┐
                   ▼         ▼
                 File       s3
                System    (etc)

With the ParquetStorage being solely responsible for managing
interactions with the object store when dealing with Parquet files.
2022-05-19 13:52:51 +01:00
Dom Dwyer d3548653d5 refactor: rename Storage -> ParquetStorage
Renames the Storage type so the context is clear in usage (i.e. fn
args), rather than having to rely on knowing the fully-qualified import
path to know what the type stores.
2022-05-19 13:51:07 +01:00
Marco Neumann 52346642a0
ci: fix cargo deny (#4629)
* ci: fix cargo deny

* chore: downgrade `socket2`, version 0.4.5 was yanked

* chore: rename `query` to `iox_query`

`query` is already taken on crates.io and yanked and I am getting tired
of working around that.
2022-05-18 09:38:35 +00:00
Andrew Lamb 3a33e806c7
chore: Update datafusion + `arrow`/`parquet`/`arrow-flight` to `14.0.0` (#4619)
* chore: Update datafusion deps

* chore: update arrow/parquet/arrow flight deps

* chore: Run cargo hakari tasks

* chore: Update location of utils

* chore: Update some more APIs

Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
2022-05-17 14:13:03 +00:00
Raphael Taylor-Davies f2bb0fdf77
feat: update to crates.io object_store version (#4595)
* feat: update to crates.io object_store version

* chore: Run cargo hakari tasks

* fix: tests

* chore: remove object store integration test plumbing

Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
2022-05-13 16:26:07 +00:00
Carol (Nichols || Goulding) 55313d290a
fix: Update or remove comments that mention NG or OG
Connects to #4450.
2022-05-12 16:09:08 -04:00
Raphael Taylor-Davies 8b379c83cc
refactor: simplify object_store path handling (#4534)
* refactor: simplify object_store path handling

* fix: aws integration tests

* chore: lint

* fix: update gcs tests

* refactor: move errors into submodules

* chore: lint

* chore: review feedback

* refactor: replace provider with Display

* fix: failing tests

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-05-09 18:43:22 +00:00
Carol (Nichols || Goulding) 077884c925
fix: Remove allow dead_code annotations from undead code 2022-05-06 16:58:02 -04:00
Carol (Nichols || Goulding) 068096e7e1
fix: Rename data_types2 to data_types 2022-05-06 14:45:39 -04:00
Carol (Nichols || Goulding) ea46830954
fix: Remove iox_object_store crate; move ParquetFilePath to parquet_file 2022-05-06 14:45:36 -04:00
Andrew Lamb 02893e598c
chore: Update datafusion and upgrade arrow/parquet/arrow-flight to 13 (#4516)
* chore: Tool for automating arrow version update

* chore: Update datafusion and arrow/parquet/arrow-flight

* fix: update for changes in Arrow API

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-05-05 00:21:02 +00:00
Nga Tran fa2c1febf4
feat: use stored partition sort key to deduplicate data (#4360)
* feat: use stored sort key to deduplicate data

* refactor: verify if one is a super sort key of the other

* test: unit tests for scan and deduplication plans

* fix: typo

* refactor: refactor and add comments

* feat: cache partition sort key to read during planning as needed

* test: tests for query plans with different overlap groups

* chore: cleanup

* chore: resolve merge conflicts

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-04-26 20:36:32 +00:00
Marco Neumann bd600bbac6
refactor: allow ingester to be integrated into query tests (#4427)
* refactor: improve `IngesterData` public interface

* feat: impl `Debug` for `Test{Namespace,Sequencer}`

* refactor: trait interface for `LifecyleHandle`

This is required to mock the lifecycle for query tests.

* refactor: trait for partitioner
2022-04-26 13:44:30 +00:00
Marco Neumann 2337935660
test: chunks in ingester stage (#4415)
* refactor: document and improve `MockIngesterConnection`

* refactor: split `OldOneMeasurementFourChunksWithDuplicates` for `EXPLAIN` queries

* fix: mark "IngsterPartition" chunks as unsorted

* fix: "group by" queries may require sorted comparison

* refactor: re-export a few more types from querier

* fix: ensure that test parquet files are de-duped

* test: chunks in ingester stage

* docs: explain test code
2022-04-26 07:55:19 +00:00
二手掉包工程师 4b47d723b1
refactor: Rename time to iox_time (#4416)
Signed-off-by: hi-rustin <rustin.liu@gmail.com>

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-04-26 00:19:59 +00:00
Marco Neumann f444e63960
test: include materialized delete predicates in NG query tests (#4371)
* refactor: move `batch_filter` to `datafusion_util`

* fix: outdated docstring

* feat: allow passing record batches to `iox_tests` parquet files

* test: include materialized delete predicates in NG query tests

* docs: improve wording

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-04-21 13:00:13 +00:00
Andrew Lamb 73bed810da
chore: Update arrow, arrow-flight, parquet, tonic, prost, etc (#4357)
* chore: Update datafusion

* chore: Update arrow/arrow-flight/parquet to 12

* chore: update datafusion correctly

* chore: Update prost, tonic, and dependents

* fix: Fixup some api changes

* fix: Update test output in db

* fix: Update test output in parquet_file

* fix: remove old pbjson types

* fix: Add "--experimental_allow_proto3_optional" flag

* chore: Run cargo hakari tasks

* fix: compile error

* chore: Update heappy

Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-04-20 11:12:17 +00:00
Marco Neumann d711816548
feat: add sequencer ID and correct partition key to `IngesterPartition` (#4348)
* feat: impl `Debug` for `TestCatalog`

* feat: add sequencer ID and correct partition key to `IngesterPartition`

- simplifies debugging (parquet chunks and ingester chunks now use the
  same partition key naming)
- the sequencer ID is required to correctly reason about tombstones (to
  be implemented in a later PR)

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-04-19 15:59:49 +00:00
Nga Tran 2a601c3099
fix: Revert "chore: Revert "fx: Revert "fix: Revert "feat: Use the sort key stored in the catalog during compaction" (#4299)" (#4303)" (#4327)" (#4328)
* fix: Revert "chore: Revert "fx: Revert "fix: Revert "feat: Use the sort key stored in the catalog during compaction" (#4299)" (#4303)" (#4327)"

This reverts commit 7e5d719027.

* chore: resolve merge conflict

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-04-18 15:27:39 +00:00
Nga Tran 7e5d719027
chore: Revert "fix: Revert "fix: Revert "feat: Use the sort key stored in the catalog during compaction" (#4299)" (#4303)" (#4327)
This reverts commit fe8d9948d5.
2022-04-14 17:11:55 +00:00
Carol (Nichols || Goulding) fe8d9948d5
fix: Revert "fix: Revert "feat: Use the sort key stored in the catalog during compaction" (#4299)" (#4303)
This reverts commit 7ddbf7c025.

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-04-14 15:42:28 +00:00
Carol (Nichols || Goulding) 94dcde4996
fix: Do fewer queries for metadata
By adding another _with_metadata catalog function. Also introduce a new
type rather than passing around tuples everywhere.
2022-04-13 10:43:20 -04:00
Carol (Nichols || Goulding) 02fee3b84f
feat: Request parquet metadata from the catalog when needed only 2022-04-13 10:43:19 -04:00
Carol (Nichols || Goulding) 7ddbf7c025
fix: Revert "feat: Use the sort key stored in the catalog during compaction" (#4299)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-04-13 14:11:10 +00:00
Carol (Nichols || Goulding) d75b161f44
feat: Add some logging when changing sort key in tests 2022-04-11 14:09:46 -04:00
Carol (Nichols || Goulding) 628d54836a
refactor: Use a more specific function name 2022-04-11 14:09:46 -04:00
Carol (Nichols || Goulding) 55fe3b8d50
feat: Use the sort key stored in the catalog during compaction
Fixes #4249.
2022-04-11 14:09:45 -04:00
Nga Tran a6eb83d47d
feat: compact small contiguous files of the same partition even if they do not overlap (#4197)
* feat: compact small contiguous files of the same partition even if they do not overlap

* test: more tests

* chore: Apply suggestions from code review

Co-authored-by: Carol (Nichols || Goulding) <193874+carols10cents@users.noreply.github.com>

* refactor: address review comments

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
Co-authored-by: Carol (Nichols || Goulding) <193874+carols10cents@users.noreply.github.com>
2022-04-01 15:26:43 +00:00
Nga Tran ddc2c8304f
fix: have the compaction level set correctly (#4184)
* fix: have the compaction level set correctly, especially for compacted file from the compactor

* fix: typo
2022-03-30 21:23:40 +00:00
Carol (Nichols || Goulding) f3f792fd08
feat: Add namespace_id to the parquet_files table; object store paths need it 2022-03-29 08:15:26 -04:00
Marco Neumann ec88df6cb7
feat: cache namespace schemas (#4142)
* feat: `TestColumn` for IOx tests

* feat: cache namespace schemas

In preparation to remove the querier sync loop completely.

Ref #4123.
2022-03-29 08:19:21 +00:00
Nga Tran 80b7e9cce1
feat: delete fully processed tombstones & integration tests for find_and_compact (#4116)
* feat: remove fully processed tombstones

* test: first few tests

* fix: delete SQL

* fix: test how IN (...) works in PG

* fix: test how IN (?) works in PG

* fix: test how IN (?) works in PG

* fix: dynamically add  IN (?, ?, ...)

* fix: dynamically add  IN (?, ?, ...) & its dynamic values

* fix: add argument directly in the SQL

* test: more tests for catalog read and update functions

* chore: move a subfunction to make it easier to read)

* test: first test for find_can_compact but disabled due to bug

* test: integration tests and a bug fix for find_and_compact

* chore: cleanup

* refactor: address review comments

* fix: put 2 delete processed  tombstones and tombstones in a transaction
2022-03-28 18:35:54 +00:00
Andrew Lamb 5c69a3f43b
chore: Update deps: datafusion, arrow/arrow-flight/parquet to 11, zstd to 0.11 (#4119)
* chore: update datafusion

* chore(deps): Bump arrow from 10.0.0 to 11.0.0

Bumps [arrow](https://github.com/apache/arrow-rs) from 10.0.0 to 11.0.0.
- [Release notes](https://github.com/apache/arrow-rs/releases)
- [Changelog](https://github.com/apache/arrow-rs/blob/master/CHANGELOG.md)
- [Commits](https://github.com/apache/arrow-rs/compare/10.0.0...11.0.0)

---
updated-dependencies:
- dependency-name: arrow
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>

* chore(deps): Bump arrow-flight from 10.0.0 to 11.0.0

Bumps [arrow-flight](https://github.com/apache/arrow-rs) from 10.0.0 to 11.0.0.
- [Release notes](https://github.com/apache/arrow-rs/releases)
- [Changelog](https://github.com/apache/arrow-rs/blob/master/CHANGELOG.md)
- [Commits](https://github.com/apache/arrow-rs/compare/10.0.0...11.0.0)

---
updated-dependencies:
- dependency-name: arrow-flight
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>

* chore: update parquet to 11.0.0

* fix: error on create schema, test for same

* fix: upgrade zstd

* chore: Run cargo hakari tasks

* fix: fix logical merge conflict

* fix: hakari

* fix: hakari

* fix: update newly introduced dep

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-03-24 15:27:36 +00:00
Marco Neumann 51da6dd7fa
feat: store sort key in NG metadata (#4110)
The sort key is optional and currently only produced by `iox_tests`.
Writing it within the ingester/compactor is tracked by #3968. The sort
key is read by the querier (and this will be verified by the query tests
and is required to merge #4103).

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-03-23 18:24:46 +00:00
Nga Tran 886f9dc8c1
feat: split compacted data into 2 compacted sets (#4088)
* feat: split compacted data into 2 compacted sets

* chore: clean up

* refactor: address review comments

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-03-22 13:28:32 +00:00
Dom Dwyer 5585dd3c21 refactor: switch to using DynObjectStore
Changes all consumers of the object store to use the dynamically
dispatched DynObjectStore type, instead of using a hardcoded concrete
implementation type.
2022-03-15 16:32:52 +00:00
Dom Dwyer 1d5066c421 refactor: rename ObjectStore -> ObjectStoreImpl
Frees up the name for so we can use `dyn ObjectStore` throughout the
code instead of `ObjectStoreApi`.
2022-03-15 16:29:43 +00:00
Marco Neumann 4b5cf6a70e
feat: cache processed tombstones (#4030)
* test: allow to mock time in `iox_test`

* feat: cache processed tombstones

For #3974.

* refactor: introduce `TTL_NOT_PROCESSED`
2022-03-15 10:28:08 +00:00
Nga Tran 5a29d070ea
feat: Implement the compact function for NG Compactor (#4001)
* feat: initial implementation of compact a given list of overlapped parquet files

* feat: Add QueryableParquetChunk and some refactoring

* feat:  build queryable parquet chunks for parquet files with tombstones

* feat: second half the implementation for Compactor's compact. Tests will be next

* fix: comments for trait funnctions fof QueryChunkMeta

* test: add tests for compactor's compact function

* fix: typos

* refactor: address Jake's review comments

* refactor: address Andrew's comments and add one more test for files in different order in the vector

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-03-11 20:25:19 +00:00
Carol (Nichols || Goulding) ecd06c6ec3
fix: ParquetFileRepo create should be responsible for setting INITIAL_COMPACTION_LEVEL
When created in the catalog, parquet files should always have compaction
level 0. Updating the compaction level should always happen in the
compactor.

Only the catalog should need to know about the initial compaction level
value.
2022-03-10 13:51:18 -05:00
Carol (Nichols || Goulding) ff31407dce
refactor: Extract a ParquetFileParams type for create
This has the advantages of:

- Not needing to create fake parquet file IDs or fake deleted_at
  values that aren't used by create before insertion
- Not needing too many arguments for create
- Naming the arguments so it's easier to see what value is what
  argument, especially in tests
- Easier to reuse arguments or parts of arguments by using copies of
  params, which makes it easier to see differences, especially in tests
2022-03-10 13:51:18 -05:00
Nga Tran f03ebd79ab
refactor: move querier's test utils to a new crate to get reused by tests in other crates (#4013)
* refactor: move querier's test utils to a new crate to be able resued by tests in other crates

* chore: remove unused import
2022-03-10 18:17:58 +00:00