Commit Graph

8326 Commits (16a8d29b9ff0a39c54148289aa58c2ffe61edd4b)

Author SHA1 Message Date
Marko Mikulicic 16a8d29b9f
fix: Fix typo in const name (#4993) 2022-06-30 07:51:39 +00:00
Raphael Taylor-Davies 835e1c91c7
chore: update object_store to 0.3.0 (#4707)
* chore: update object_store to 0.3.0

* chore: review feedback

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-06-29 21:44:03 +00:00
Nga Tran 0cca975167
fix: Split overlapped files based on the order of sequence numbers and only group non-overlapped contigous small files (#4968)
* fix: Split overlapped files based on the order of sequence numbers and only group non-overlapped contigous small files

* test: add one more test for group contiguous files:

* refactor: address review comments

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-06-29 20:09:51 +00:00
Jacob Marble bacd2ea470
chore: unsuppress a few security notifications (#4967)
Helps #2884

- RUSTSEC-2020-0159 (withdrawn)
- RUSTSEC-2021-0127 (cargo deny says this isn't needed)
- "query" (cargo deny says this isn't needed)

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-06-29 19:49:50 +00:00
Andrew Lamb 01fb2e132d
chore: Update datafusion pin (#4969)
* chore: Update datafusion pin

* fix: Update for api

* fix: Explicitly set coalsce batch size

* fix: Update batch size as well

* fix: update tests for new explain plan, and improved coercion
2022-06-29 17:52:37 +00:00
Markus Westerlind 002dfb4702
Merge pull request #4979 from Marwes/once_cell
refactor: Replace all uses of lazy_static with once_cell
2022-06-29 17:33:31 +02:00
Markus Westerlind edf3f08e81 refactor: Replace all uses of lazy_static with once_cell
Went through and remove all lazy_static uses with once_cell (while waiting for the project to compile). There are still dependencies using lazy_static so it is still in the crate graph but at least there isn't an explicit dependency on it (and it is easier to update to `std::lazy::Lazy` once that is stable).
2022-06-29 16:22:02 +02:00
Nga Tran cfcc4b8426
refactor: change level 1 to level 2 preparing for next design changes (#4954)
* refactor: change level 1 to level 2 preparing for next design changes

* fix: make level-2 consistent everywhere

* chore: remove unused comments

* refactor: change all the name level_1 to level_2 to completely replace 1 with 2 to amke everything consistent

* chore: add correspinding constants for the comapction levels in the comments

Co-authored-by: Dom <dom@itsallbroken.com>
2022-06-29 14:08:58 +00:00
Marco Neumann fba58dbc5f
ci: IOx still needs to push its image tags (#4977)
This will be replaced by a pull-based approach soon, but for the time
being we still need perform the final push.
2022-06-29 09:52:44 +00:00
Marco Neumann 847c84a6b4
ci: fix `docker load` paths (#4961)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-06-29 08:03:02 +00:00
Marco Neumann 9b66d02229
fix: parquet reader sort order (#4964) 2022-06-28 15:28:38 +00:00
Ryan Russell 106d84c6e1
chore: readability improvements (#4955)
* chore(arrow_util): readability improvements

Signed-off-by: Ryan Russell <git@ryanrussell.org>

* chore(tracker): readability improvements

Signed-off-by: Ryan Russell <git@ryanrussell.org>

* chore(cache_system): improve readability

Signed-off-by: Ryan Russell <git@ryanrussell.org>

* refactor(lru test): rename `test_panic_id_collision`

Signed-off-by: Ryan Russell <git@ryanrussell.org>

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-06-28 14:28:14 +00:00
Marco Neumann c69225c5a1
Merge pull request #4960 from influxdata/crepererum/fix_ci3
ci: fix `docker save` paths
2022-06-28 15:33:14 +02:00
Marco Neumann cede296eb1 ci: fix `docker save` paths 2022-06-28 10:41:04 +02:00
Marco Neumann 1eac304305
refactor: fetch RB chunks in parallel (#4952)
Currently the querier fetches RB in a serial manner, which is probably
not good since each cache miss takes between 10ms and 250ms.

Let's try to fetch 2 in parallel and if that works well, make this a
proper config.

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-06-28 07:54:58 +00:00
Marco Neumann 2ebb7b195b
ci: fix image deploy (#4953)
The Influx deployment pipeline was changed so the an image push is used
as a signal for deployment (instead of a magic script that was used
before). So we need to adopt our CI to only push images when all tests
pass.

Old workflow:
- build release: builds docker images and push commit-based tags to
  registry
- deploy release: pulls built images from registry, adds+pushes branch
  tags, calls magic deploy script

New workflow:
- build release: builds docker image, saves them to disk
- deploy release: load image files, tags them, pushes tags

You may wonder why there are two steps if we could just use a single
one. The reason is: time-to-deploy. We can already build the image while we
are waiting for the tests. If the tests fail, the image will just not be
published.

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-06-28 07:48:41 +00:00
Andrew Lamb bfddb032ce
docs: improve docs for `persist_partition_size_threshold_bytes` / `INFLUXDB_IOX_PERSIST_PARTITION_SIZE_THRESHOLD_BYTES` (#4877)
* docs: improve docs for `persist_partition_size_threshold_bytes` / `INFLUXDB_IOX_PERSIST_PARTITION_SIZE_THRESHOLD_BYTES`

* docs: improve comments about LifecycleConfig::partition_size_threshold

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-06-27 21:52:40 +00:00
Ryan Russell 77a4246432
docs: Readability improvements (#4946)
Signed-off-by: Ryan Russell <git@ryanrussell.org>

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-06-27 21:46:18 +00:00
kodiakhq[bot] b28cecf26e
Merge pull request #4951 from influxdata/dom/schema-api
refactor(schema-api): column data type enum
2022-06-27 21:40:07 +00:00
kodiakhq[bot] c22aed4347
Merge branch 'main' into dom/schema-api 2022-06-27 21:34:07 +00:00
Marco Neumann 215f297162
refactor: parquet file metadata from catalog (#4949)
* refactor: remove `ParquetFileWithMetadata`

* refactor: remove `ParquetFileRepo::parquet_metadata`

* refactor: parquet file metadata from catalog

Closes #4124.
2022-06-27 15:38:39 +00:00
Marco Neumann 9b8086df74
fix: size estimates (#4950)
* fix: `Tombstone::size` must include serialized predicate

* fix: `CachedPartition::size` must include `Arc` heap allocation

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-06-27 15:25:32 +00:00
Marco Neumann 1a74f84494
refactor: remove `ParquetFileWithMetadata` usage outside the catalog (#4948)
* refactor: remove `DecodedParquetFile` from `iox_tests`

* refactor: remove `DecodedParquetFile` from querier

Also pull out all the chunk schema and sort key handling into a function
so that RB chunks and parquet chunks mostly use the same code path.

* refactor: remove `DecodedParquetFile`

* refactor: remove `ParquetFileWithMetadata` usage

* fix: test data consistency
2022-06-27 15:19:29 +00:00
Dom e84529af2c
Merge branch 'main' into dom/schema-api 2022-06-27 16:18:21 +01:00
Dom Dwyer 75c425f375 refactor(schema-api): column data type enum
Previously the column data type was exposed using an internal i32 value.
This commit changes the Schema API to use a self-descriptive proto enum
for the column data type.
2022-06-27 16:14:49 +01:00
Marco Neumann 3b78bf1c48
refactor: remove binary parquet file MD from compactor (#4938)
* refactor: simplify sort key calculation

* refactor: use schema from catalog instead from file

* refactor: do not request parquet file MD in compactor

* test: ensure that `QueryableParquetChunk` works correctly
2022-06-27 15:11:15 +00:00
Marco Neumann b9cbb3dfca
refactor: do not use in-parquet IOx metadata in compactor (*) (#4935)
* refactor: avoid feeding sort key from struct into same struct

* feat: allow namespace schema query by ID

* refactor: do not use binary parquet file MD in compactor tests

* refactor: do not use in-parquet IOx metadata

* refactor: reduce number of catalog queries
2022-06-27 08:06:11 +00:00
dependabot[bot] 7546476e15
chore(deps): Bump smallvec from 1.8.0 to 1.8.1 (#4947)
Bumps [smallvec](https://github.com/servo/rust-smallvec) from 1.8.0 to 1.8.1.
- [Release notes](https://github.com/servo/rust-smallvec/releases)
- [Commits](https://github.com/servo/rust-smallvec/compare/v1.8.0...v1.8.1)

---
updated-dependencies:
- dependency-name: smallvec
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-06-27 07:17:32 +00:00
Nga Tran 92eeb5b232
chore: remove unused sort_key_old from catalog partition (#4944)
* chore: remove unused sort_key_old from catalog partition

* chore: add new line at the end of the SQL file
2022-06-24 15:02:38 +00:00
Marco Neumann 994bc5fefd
refactor: ensure that SQL parquet file column sets are not NULL (#4937)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-06-24 14:26:18 +00:00
Nga Tran 3c0fb6e8ef
fix: avoid using min_time, which can be negative, for ChunkId. Using object store id which is uuid instead (#4942)
* fix: avoid using min_time, which can be negative, for ChunkId. Using object store id which is uuid instead

* chore: Apply suggestions from code review

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* chore: run fmt

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-06-23 19:00:13 +00:00
Nga Tran 35dacf388b
feat: Compact now can split compacted results into multiple non-overlapped files based on config max file size (#4918)
* feat: split times of compacting results based on the max file size

* feat: cosider max file size while computing split time

* test: tests for comput_split_time

* feat: first step to teach the function split_the_steam to know how to split data into n streams using n-1 input PhysicalExprs

* feat: make StreamSplitNode support a list of expression

* docs: explain how StreamSplitNode works

* feat: Teach compute_split_time to split a time range into many contiguous ranges and split compacted result into multiple non-overlapped files based on the config comapction_max_size_bytes

* chore: cleanup

* chore: clean up doc

* chore: address review comments

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-06-23 18:54:03 +00:00
kodiakhq[bot] ce906354f9
Merge pull request #4941 from influxdata/dom/persist-memory-accounting
fix: account for partition memory until persist is completed
2022-06-23 17:36:02 +00:00
Andrew Lamb 49b34e1135 test: add appropriate tests 2022-06-23 11:50:55 -04:00
Andrew Lamb fb4c3ed294 fix: revert test change 2022-06-23 11:34:59 -04:00
Dom Dwyer 9a79d16585 fix: account for partition memory until persisted
The ingester maintains a rough "total memory in use" counter it uses to
try and limit the amount of memory the ingester is using overall.

When a partition is persisted, this total memory usage value is adjusted
to account for releasing the partition memory. Prior to this commit, the
ordering was:

* Writes increase the memory counter
* maybe_persist() is called to trigger persistence
* A partition is identified for persistence
* Partition memory usage is released back to the total memory counter
* Persistence starts

This meant that the partitions in the process of being persisted were
not accounted for in the ingester's total memory counter, and therefore
we could significantly overrun the configured memory limit.

After this commit, the ordering is:

* Writes increase the memory counter
* maybe_persist() is called to trigger persistence
* A partition is identified for persistence
* Persistence starts
* Persistence completes
* Partition memory usage is released back to the total memory counter

This ensures persisting partitions are sill tracked in the total memory
counter, causing pauses to correctly fire.
2022-06-23 15:40:51 +01:00
Marco Neumann bd6c4659af
refactor: slim down parquet chunk (remove Metadata) (#4934)
* feat: conversion from `ParquetFile` to `ParquetFilePath`

* refactor: slim down parquet chunk

- ensure it works without binary parquet metadata
- timestamp range is no longer optional (ensured by the NG type system)
- remove table summary: this is only needed for SOME API users. The
  compactor can perfectly work without statistics since has the timestamp
  range which is sufficient for the current overlap check (we don't use
  any other primary key stats at the moment). The querier currently does
  NOT use parquet chunks (was replaced by read buffer) but if it will
  again in some future it will likely need to find a way to fetch and
  cache the statistics.
- the schema is now provided by the API user since it can be
  reconstructed using the NG catalog only (and "wrong" column orders are
  tolerated as of #4921)

Ref #4124

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-06-23 10:55:16 +00:00
kodiakhq[bot] fd5aa201d2
Merge pull request #4933 from influxdata/dom/remove-unnecessary-errors
refactor: remove unused errors
2022-06-23 10:35:25 +00:00
Dom Dwyer 87af3848d1 refactor: remove unused errors
These errors are not referenced, but are hidden from the "unused" lint
because of the macro magic code generation.
2022-06-23 11:24:30 +01:00
Andrew Lamb 16c558e11e
refactor: Make some structures in `LifecycleManager` non pub (#4929)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-06-23 09:55:39 +00:00
dependabot[bot] 6f9d8b54cf
chore(deps): Bump integer-encoding from 3.0.3 to 3.0.4 (#4932)
Bumps [integer-encoding](https://github.com/dermesser/integer-encoding-rs) from 3.0.3 to 3.0.4.
- [Release notes](https://github.com/dermesser/integer-encoding-rs/releases)
- [Commits](https://github.com/dermesser/integer-encoding-rs/compare/v3.0.3...v3.0.4)

---
updated-dependencies:
- dependency-name: integer-encoding
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-06-23 09:42:34 +00:00
Andrew Lamb 776c34e03d
chore: Update datafusion (#4927)
* chore: Update datafusion

* fix: Update for API changes

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-06-23 09:30:43 +00:00
Andrew Lamb 47def89670
docs: Update tracing.md for NG (#4916)
tracing instructions referred to OG -- update for NG

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-06-23 09:24:38 +00:00
Marco Neumann 463d430d43
refactor: do not fetch parquet MD from catalog in querier (#4926)
Ref #4124
2022-06-23 09:03:19 +00:00
Marco Neumann 4b7d02fad1
feat: do not rely on encoded parquet metadata for RB chunks (#4924)
* fix: use proper sort key in tests

* feat: do not rely on encoded parquet metadata for RB chunks

Ref #4124.

* refactor: allocate less strings

* refactor: use upstream PK calculation

* fix: cache expiration w/o a good reason

* refactor: make namespace cache safer to use

* refactor: make partition cache safer to use
2022-06-23 08:55:52 +00:00
Marco Neumann c899c3a0f4
fix: column handling when reading parquet files (#4921)
* fix: column handling when reading parquet files

This improves/fixes/tests a few aspects when reading parquet files:

- fix usage of `Selection::Some(...)`. This was broken since #4912 but
  apparently no test caught that.
- ensure that the order of `Selection::Some(...)` is preserved
- ensure that schema metadata is attached to output batches
- ignore parquet columns that we don't care about (i.e. do not select)
- allow parquet file to have a different column order than our internal
  bookkeeping, this makes it way simpler to read parquet files w/o
  scanning the metadata first
- extend the test coverage

Ref #4124.

* test: even more tests for parquet reader
2022-06-22 13:51:30 +00:00
Marco Neumann 0534b80886
fix: `ParquetFile::size` must include column set (#4925)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-06-22 13:06:02 +00:00
Marco Neumann 9591bed696
refactor: make querier internals private (#4922)
Queries internals are not meant to be used by other crates. Only a
handful selected interfaces should be used by IOxD and the query tests.

The compactor only used a very small subset just to read parquet files
back into memory. It shall rather use the official `parquet_file`
interface instead.

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-06-22 13:00:08 +00:00
Marco Neumann 751bdce88a
fix: pass write buffer tests w/o Kafka (#4923)
Fixes interaction of `maybe_skip_kafka_integration!` and `should_panic`
by ensuring that `maybe_skip_kafka_integration!` panics to skip
`should_panic` tests.

Without that it is not possible to just run `cargo test -p write_buffer`.

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-06-22 10:41:40 +00:00
dependabot[bot] f7d83ea581
chore(deps): Bump clap from 3.2.5 to 3.2.6 (#4920)
Bumps [clap](https://github.com/clap-rs/clap) from 3.2.5 to 3.2.6.
- [Release notes](https://github.com/clap-rs/clap/releases)
- [Changelog](https://github.com/clap-rs/clap/blob/master/CHANGELOG.md)
- [Commits](https://github.com/clap-rs/clap/compare/v3.2.5...v3.2.6)

---
updated-dependencies:
- dependency-name: clap
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-06-22 10:28:44 +00:00