Commit Graph

135 Commits (4509e3db57096587abb27108495d5595c61afb3c)

Author SHA1 Message Date
Nga Tran b60e1be0cf
chore: remove irrelaevant comments (#4791) 2022-06-07 00:43:56 +00:00
Nga Tran 3e89daa0d4
feat: compact all overlapped files no matter how large they are (#4779)
* feat: add an option to compact all overlapped files no matter how large they are

* chore: Apply suggestions from code review

* feat: always compact oerlapped files no matter how large they are

* chore: cleaup
2022-06-06 23:39:09 +00:00
dependabot[bot] 04c685b3b7
chore(deps): Bump tokio-util from 0.7.2 to 0.7.3 (#4784)
Bumps [tokio-util](https://github.com/tokio-rs/tokio) from 0.7.2 to 0.7.3.
- [Release notes](https://github.com/tokio-rs/tokio/releases)
- [Commits](https://github.com/tokio-rs/tokio/compare/tokio-util-0.7.2...tokio-util-0.7.3)

---
updated-dependencies:
- dependency-name: tokio-util
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-06-06 14:46:27 +00:00
dependabot[bot] e03bf94420
chore(deps): Bump tokio from 1.18.2 to 1.19.1 (#4783)
Bumps [tokio](https://github.com/tokio-rs/tokio) from 1.18.2 to 1.19.1.
- [Release notes](https://github.com/tokio-rs/tokio/releases)
- [Commits](https://github.com/tokio-rs/tokio/compare/tokio-1.18.2...tokio-1.19.1)

---
updated-dependencies:
- dependency-name: tokio
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-06-06 14:15:12 +00:00
Carol (Nichols || Goulding) aa510ae4e6
fix: Remove test uses of parquet chunks and document as unused
The querier is now using read buffer chunks only, but we're leaving the
parquet chunk code around for the moment.
2022-06-03 09:16:04 -04:00
Andrew Lamb 3592aa52d8
chore: Update datafusion + `arrow`/`parquet`/`arrow-flight` to `15.0.0` (#4743)
* chore: Update datafusion + `arrow`/`parquet`/`arrow-flight` to `15.0.0`

* chore: Update APIs

* chore: Run cargo hakari tasks

* feat: normalize parquet file metadata

* chore: update size tests

* chore: add docs on metadata stripping

* chore: TEMP UPDATE TO DF BRANCH

* chore: Update for new API

* fix: Update to latest DF

* fix: cargo hakari

Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
Co-authored-by: Raphael Taylor-Davies <r.taylordavies@googlemail.com>
2022-06-03 10:32:26 +00:00
dependabot[bot] 9a21292db8
chore(deps): Bump async-trait from 0.1.53 to 0.1.56 (#4774)
Bumps [async-trait](https://github.com/dtolnay/async-trait) from 0.1.53 to 0.1.56.
- [Release notes](https://github.com/dtolnay/async-trait/releases)
- [Commits](https://github.com/dtolnay/async-trait/compare/0.1.53...0.1.56)

---
updated-dependencies:
- dependency-name: async-trait
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-06-03 09:10:40 +00:00
Ryan Russell d279deddad
docs(various): Improve Readability (#4768)
Signed-off-by: Ryan Russell <git@ryanrussell.org>

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-06-02 18:01:06 +00:00
Nga Tran 79895b995c
chore: add debug info to see how many concurrent partitions being compacted in each cycle (#4772)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-06-02 15:19:08 +00:00
Dom Dwyer 9ae58c89b6 refactor: constructor for ParquetFileWithTombstone
Use a constructor to initialise a ParquetFileWithTombstone struct,
rather than making the fields pub.

This allows IDEs to "go to" places where this is constructed when
browsing the code, but also keeps the type closed for modification of
internals (SOLID).
2022-06-01 15:58:06 +01:00
Nga Tran 79220720be
chore: increase size of a compactor job and level of concurrency (#4746)
* fix: let us not compact no-data

* fix: split time must be greater min_time, too

* fix: resolve merge conflict

* chore: increase size of a compactor job and level of concurrency

Co-authored-by: Dom <dom@itsallbroken.com>
2022-05-31 19:57:06 +00:00
Nga Tran dfd35c05a1
fix: let us not compact no-data (#4744)
* fix: let us not compact no-data

* fix: split time must be greater min_time, too

* fix: resolve merge conflict

Co-authored-by: Dom <dom@itsallbroken.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-05-31 17:02:14 +00:00
Dom Dwyer 70864b9f48 refactor: always use correct chunk sort key
Don't use the same sort key for all files - sort keys may grow over
time, and the information is already at hand.
2022-05-30 17:41:41 +01:00
Dom Dwyer 6aa2a6958a refactor: assert consistent parquet file metadata
Assert consistent metadata when evaluating candidate parquet files for
compaction.

Asserts all files have the same:
    * Sequencer ID
    * Namespace ID
    * Table ID
    * Partition ID
    * Sort key
2022-05-30 17:41:41 +01:00
Dom Dwyer 0f16d6cabb refactor: consistent SortKey source
Changes the compaction logic to always reference the same SortKey
instance, rather than repeatedly querying for it.

The Partition metadata is always read from the catalog as part of
compact_partition(), where it previously threw away all metadata except
the sort key, which was passed into compact(). Then compact() would
always re-query the catalog to look up just the sort key again, and mix
up the two instances during use - one passed into the fn, one freshly
queried within the fn.

Now the Partition metadata is resolved in compact_partition() as it was
previously, but the entire Partition reference is passed to compact(),
and this is consistently used do access the sort key. This also removes
a catalog query per compaction call.
2022-05-30 17:41:41 +01:00
kodiakhq[bot] 842ef8e308
Merge branch 'main' into cn/fetch-from-parquet-file 2022-05-27 17:08:28 +00:00
Andrew Lamb dde3c3922c
refactor: use consistent spelling of serialize (#4717) 2022-05-27 14:42:59 +00:00
Nga Tran ea81152fac
refactor: add partition ID into debug info and panic earlier to identify the bug easier (#4716)
* chore: point tests to the new ticket

* chore: cleanup

* refactor: add partition ID into debug info and panic earlier to identify the bug easier
2022-05-27 12:20:36 +00:00
Carol (Nichols || Goulding) 5fd3ffc17f
refactor: Rename ParquetChunkAdapter to only ChunkAdapter
It might be creating chunks of different kinds other than ParquetChunks.
2022-05-26 16:52:14 -04:00
Carol (Nichols || Goulding) df10452e2e
refactor: Rename methods from new_querier_chunk to new_querier_parquet_chunk 2022-05-25 17:19:10 -04:00
Nga Tran 6cc767efcc
feat: teach compactor to compact smaller number of files (#4671)
* refactor: split compact_partition into two functions to handle concurrency better

* feat: limit number of files to compact

* test: add test for limit num files

* chore: fix cipply

* feat: split group if over max size

* fix: split the overlapped group to limit size or file num

* chore: reduce config values

* test: add tests and clearer comments for the split_overlapped_groups and test_limit_size_and_num_files

* chore: more comments

* chore: cleanup
2022-05-25 19:54:34 +00:00
Andrew Lamb 935743b525
refactor: Implement `new_querier_chunk` and `new_querier_chunk_from_file_with_metadata` (#4685) 2022-05-24 21:58:27 +00:00
Dom Dwyer c885b845dc refactor: concurrent StreamSplitExec execution
Changes the compactor to consume both StreamSplitExec output partitions
concurrently.

Practically speaking this means both Parquet files will be generated
concurrently, and uploaded to object store concurrently.
2022-05-24 14:10:46 +01:00
Dom Dwyer 8f05250c96 feat: steaming compaction
This commit changes the Compactor::compact() method to stream the
RecordBatch instances directly to the parquet serialiser, before being
uploaded directly to object storage.
2022-05-24 14:09:10 +01:00
Dom Dwyer 2e6c49be83 refactor: remove IoxMetadata min & max timestamp
Removes the min/max timestamp fields from the IoxMetadata proto
structure embedded within a Parquet file's metadata.

These values are redundant as they already exist within the Parquet
column statistics, and precluded streaming serialisation as these
removed min/max values were needed before serialising the file.
2022-05-23 16:27:08 +01:00
Dom Dwyer a142a9eb57 refactor: remove row_count from IoxMetadata
Remove the redundant row_count from the IoxMetadata structure that is
serialised into the Parquet file.

The reasoning is twofold:

    * The Parquet file's native metadata already contains a row count
    * Needing to know the number of rows up-front precludes streaming
2022-05-23 16:18:35 +01:00
Dom f0d0f1ba0c
Merge branch 'main' into dom/codec-object-store 2022-05-23 15:39:54 +01:00
Dom Dwyer 7df7c4844c refactor: remove redundant ParquetChunk errors
Eliminates unused / refactors away unnecessary errors for the
parquet::chunk module.
2022-05-20 15:17:40 +01:00
Dom Dwyer b9a745d42d feat: RecordBatch stream to Parquet file upload
Implements an upload() method on the ParquetStorage type, consuming a
stream of RecordBatch, serialising the Parquet file, and uploading the
result to object storage. Returns the IOx-specific file metadata.

Currently while the upload() method accepts a stream of RecordBatch, the
actual resulting Parquet file is buffered in memory before uploading to
object store, due to lack of streaming upload functionality in the
ObjectStore abstraction - this isn't the end of the world, as the files
tend to be relatively small with our current usage.

This impl should be easily modified to be fully streaming once streaming
object store puts are implemented:

    https://github.com/influxdata/object_store_rs/issues/9
2022-05-20 15:17:40 +01:00
Carol (Nichols || Goulding) 5fcf18cc02
fix: Add missing assert call around contains tests
`contains` is now must_use. Thanks Rust!
2022-05-19 14:39:51 -04:00
Dom Dwyer baa86d846f refactor: use ParquetStore instead of ObjectStore
Changes the code paths that interact with Parquet files in the object
store to reference the ParquetStorage directly (DRY refactor).

This change takes us from a dependency graph of:

                    ┌─────────────────┐
                    │                 │
                    ▼                 │
            Parquet Consumer          │
                    │         ┌──────────────┐
                    ├────────▶│ParquetStorage│
                    ▼         └──────────────┘
            ┌──────────────┐
            │ ObjectStore  │
            └──────────────┘
                    │
               ┌────┴────┐
               ▼         ▼
             File       s3
            System    (etc)

to:

                Parquet Consumer
                        │
                        ▼
                ┌──────────────┐
                │ParquetStorage│
                └──────────────┘
                        │
                        ▼
                ┌──────────────┐
                │ ObjectStore  │
                └──────────────┘
                        │
                   ┌────┴────┐
                   ▼         ▼
                 File       s3
                System    (etc)

With the ParquetStorage being solely responsible for managing
interactions with the object store when dealing with Parquet files.
2022-05-19 13:52:51 +01:00
Dom Dwyer d3548653d5 refactor: rename Storage -> ParquetStorage
Renames the Storage type so the context is clear in usage (i.e. fn
args), rather than having to rely on knowing the fully-qualified import
path to know what the type stores.
2022-05-19 13:51:07 +01:00
Dom Dwyer e20b02b914 refactor: tidy ParquetChunk constructor
Removes two unused constructors for a ParquetChunk, and moves the bare
fn constructor that is actually used to be an associated method (a
conventional constructor).
2022-05-19 13:51:07 +01:00
Marco Neumann 770293a973
feat: add LRU cache metrics (#4632)
* refactor: require `Resource`s to be convertible to `u64`

* refactor: require `Resource`s to have a unit name

* refactor: make LRU cache IDs static

* feat: add LRU cache metrics

* docs: improve type names in LRU doctest

* docs: epxlain `MeasuredT`

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* docs: explain `test_metrics`

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
2022-05-19 08:05:17 +00:00
Marco Neumann 52346642a0
ci: fix cargo deny (#4629)
* ci: fix cargo deny

* chore: downgrade `socket2`, version 0.4.5 was yanked

* chore: rename `query` to `iox_query`

`query` is already taken on crates.io and yanked and I am getting tired
of working around that.
2022-05-18 09:38:35 +00:00
Andrew Lamb 3a33e806c7
chore: Update datafusion + `arrow`/`parquet`/`arrow-flight` to `14.0.0` (#4619)
* chore: Update datafusion deps

* chore: update arrow/parquet/arrow flight deps

* chore: Run cargo hakari tasks

* chore: Update location of utils

* chore: Update some more APIs

Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
2022-05-17 14:13:03 +00:00
Marco Neumann 779f0e9cdf
feat: querier RAM pool (#4593)
* feat: `SortKey::size`

* feat: `FunctionEstimator`

* feat: querier RAM pool

Let's put all the caches into a single RAM pool, so we can at least
somewhat control RAM usage. Note that this does NOT limit the peak
memory during query execution though, but should at least stop unlimited
cache growth. A follow-up PR will add metrics.

* refactor: improve some size calculations

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-05-17 13:11:20 +00:00
dependabot[bot] 259d2486c1
chore(deps): Bump tokio-util from 0.7.1 to 0.7.2 (#4605)
Bumps [tokio-util](https://github.com/tokio-rs/tokio) from 0.7.1 to 0.7.2.
- [Release notes](https://github.com/tokio-rs/tokio/releases)
- [Commits](https://github.com/tokio-rs/tokio/compare/tokio-util-0.7.1...tokio-util-0.7.2)

---
updated-dependencies:
- dependency-name: tokio-util
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-05-16 11:42:31 +00:00
Nga Tran 9530e73925
chore: move noisy debug to trace and fix some comments (#4598)
* chore: move noisy debug to trace and fix some comments

* chore: Apply suggestions from code review

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* chore: fix format

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-05-13 19:18:15 +00:00
Raphael Taylor-Davies f2bb0fdf77
feat: update to crates.io object_store version (#4595)
* feat: update to crates.io object_store version

* chore: Run cargo hakari tasks

* fix: tests

* chore: remove object store integration test plumbing

Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
2022-05-13 16:26:07 +00:00
kodiakhq[bot] 0f8f294319
Merge branch 'main' into cn/remove-chunk-addr 2022-05-13 13:54:44 +00:00
Carol (Nichols || Goulding) 55313d290a
fix: Update or remove comments that mention NG or OG
Connects to #4450.
2022-05-12 16:09:08 -04:00
Carol (Nichols || Goulding) 07c7c75067
fix: Remove ng_chunk method
Connects to #4450.
2022-05-12 16:09:08 -04:00
Carol (Nichols || Goulding) b581a42fde
fix: Rename new_id_for_ng to new_id
Connects to #4450.
2022-05-12 16:09:07 -04:00
Carol (Nichols || Goulding) faba90d992
fix: Remove ChunkAddr 2022-05-12 15:50:41 -04:00
Nga Tran f9e3495e47
feat: add more metrics for compactor (#4575)
* feat: add more metrics for compactor

* chore: clearer comment
2022-05-12 13:20:43 +00:00
Raphael Taylor-Davies 8b379c83cc
refactor: simplify object_store path handling (#4534)
* refactor: simplify object_store path handling

* fix: aws integration tests

* chore: lint

* fix: update gcs tests

* refactor: move errors into submodules

* chore: lint

* chore: review feedback

* refactor: replace provider with Display

* fix: failing tests

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-05-09 18:43:22 +00:00
Carol (Nichols || Goulding) d458e390ad
fix: Move allow dead code to be more specific in compactor; remove actually dead code 2022-05-06 16:58:02 -04:00
Jake Goulding e07bcd40c2 refactor: Remove unused dependencies
These were found by iterating over all of the dependencies of each
Cargo.toml, then grepping that crate for the dependency's name. If it
didn't show up, I attempted to remove it.

I left a few dependencies that this process flagged:

* generated_types
  - `pbjson`,`serde`. Apparently used by the generated code.

* grpc-router-test-gen
  - `prost`. Apparently used by the generated code.

* influxdb_iox
  - `heappy`. Doesn't appear used, but is behind enough feature
    flags that I don't care to reason about and it's already optional.
  - `tikv_jemalloc_sys`. Appears to be setting a feature flag of an
    indirect dependency.

* iox_gitops_adapter
  - `k8s_openapi`. Appears to be setting a feature flag of an indirect
    dependency.
2022-05-06 15:57:58 -04:00
Carol (Nichols || Goulding) 068096e7e1
fix: Rename data_types2 to data_types 2022-05-06 14:45:39 -04:00