Commit Graph

524 Commits (ef6eda639912290414a21b31708bad9696c2ab8b)

Author SHA1 Message Date
Nga Tran ea81152fac
refactor: add partition ID into debug info and panic earlier to identify the bug easier (#4716)
* chore: point tests to the new ticket

* chore: cleanup

* refactor: add partition ID into debug info and panic earlier to identify the bug easier
2022-05-27 12:20:36 +00:00
Nga Tran 09b55a209d
chore: point tests to the new ticket (#4715)
* chore: point tests to the new ticket

* chore: cleanup
2022-05-27 11:12:55 +00:00
Nga Tran 372b262f37
test: parquet meta decoded tests and more debug info (#4713)
* test: reproducer for 4695

* chore: some debug info

* test: test with many columns and rows

* chore: cleanup and add debug info

* chore: cleanup

* chore: cleanup

* chore: more debug info
2022-05-27 09:53:07 +00:00
Carol (Nichols || Goulding) b2905650aa
refactor: Extract extract_range to be a method on TableSummary
So that other kinds of chunks can use this code too.
2022-05-26 16:52:14 -04:00
Nga Tran 05151d5c69
test: reproducer for 4695 (#4706)
* test: reproducer for 4695

* chore: Apply suggestions from code review

Co-authored-by: Carol (Nichols || Goulding) <193874+carols10cents@users.noreply.github.com>

Co-authored-by: Carol (Nichols || Goulding) <193874+carols10cents@users.noreply.github.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-05-26 15:32:30 +00:00
Dom Dwyer 6aa626ef84 refactor: retry object store upload
Changes the Storage::upload() method to endlessly retry uploading the
generated Parquet file.
2022-05-24 11:29:42 +01:00
Andrew Lamb e877a64462
feat: Add `ParquetFiles` cache and memory size estimation for ParquetMetadata (#4661)
* feat: Add `ParquetFiles` cache

* fix: Apply suggestions from code review

Co-authored-by: Marko Mikulicic <mkm@influxdata.com>

* fix: remove commented out debugging println

* refactor: Improve size calculation

* fix: mark `ParquetFileCache::clear` test only

* fix: assert on metric count

Co-authored-by: Marko Mikulicic <mkm@influxdata.com>
2022-05-23 17:11:38 +00:00
Dom Dwyer 2e6c49be83 refactor: remove IoxMetadata min & max timestamp
Removes the min/max timestamp fields from the IoxMetadata proto
structure embedded within a Parquet file's metadata.

These values are redundant as they already exist within the Parquet
column statistics, and precluded streaming serialisation as these
removed min/max values were needed before serialising the file.
2022-05-23 16:27:08 +01:00
Dom Dwyer a142a9eb57 refactor: remove row_count from IoxMetadata
Remove the redundant row_count from the IoxMetadata structure that is
serialised into the Parquet file.

The reasoning is twofold:

    * The Parquet file's native metadata already contains a row count
    * Needing to know the number of rows up-front precludes streaming
2022-05-23 16:18:35 +01:00
Dom Dwyer 71555ee55c test: Parquet metadata integration test
Adds two integration tests covering validation of the embedded IOx
metadata within the Parquet file metadata, and validation of the derived
ParquetFileParams metadata used to populate the catalog.
2022-05-23 16:17:56 +01:00
Dom Dwyer af6d3f4d48 docs: remove clone ref comment 2022-05-23 11:46:06 +01:00
Dom Dwyer 00dc95829d style: enable more lints
Enable more lints on the parquet_file crate to keep it a little cleaner
- adds the following:

    clippy::clone_on_ref_ptr,
    unreachable_pub,
    missing_docs,
    clippy::todo,
    clippy::dbg_macro

This commit includes fixes for any new lint failures.
2022-05-20 15:17:40 +01:00
Dom Dwyer 7df7c4844c refactor: remove redundant ParquetChunk errors
Eliminates unused / refactors away unnecessary errors for the
parquet::chunk module.
2022-05-20 15:17:40 +01:00
Dom Dwyer 661f8599a6 refactor: internalise Parquet path generation
Derive the ParquetFilePath from the IoxMetadata within the
ParquetStorage::read_filter() call.

This prevents the "put/get RecordBatches" abstraction from leaking out
the object store path generation concern - an implementation detail of
the ParquetStorage layer.
2022-05-20 15:17:40 +01:00
Dom Dwyer cdb341d45a test: ParquetStorage upload() and read_filter()
Adds tests for the previously untested (directly at least) Parquet
(de)serialisation & persistence layer, provided by the ParquetStorage
type.
2022-05-20 15:17:40 +01:00
Dom Dwyer 302301659e refactor: derive ParquetFilePath from IoxMetadata
Allow directly converting an IoxMetadata to a ParquetFilePath.
2022-05-20 15:17:40 +01:00
Dom Dwyer b9a745d42d feat: RecordBatch stream to Parquet file upload
Implements an upload() method on the ParquetStorage type, consuming a
stream of RecordBatch, serialising the Parquet file, and uploading the
result to object storage. Returns the IOx-specific file metadata.

Currently while the upload() method accepts a stream of RecordBatch, the
actual resulting Parquet file is buffered in memory before uploading to
object store, due to lack of streaming upload functionality in the
ObjectStore abstraction - this isn't the end of the world, as the files
tend to be relatively small with our current usage.

This impl should be easily modified to be fully streaming once streaming
object store puts are implemented:

    https://github.com/influxdata/object_store_rs/issues/9
2022-05-20 15:17:40 +01:00
Dom Dwyer 76e08d14a3 perf: IoxParquetMetaData direct from file metadata
Construct a IoxParquetMetaData instance directly from the FileMetaData
instance returned by the ArrowWriter.

This change will allow us to avoid the inefficient impl currently in
use:

    * Serialise batches into memory
    * Wrap buffer in arrow cursor
    * Read parquet metadata with arrow file reader
    * Serialise schema with thrift
    * Serialise each row group's metadata with thrift
    * Construct our own FileMetaData instance
    * Serialise FileMetaData with thrift
    * zstd encode resulting thrift bytes
    * Wrap in IoxParquetMetaData

Now we "only":

    * Stream batches into opaque Write impl
    * Serialise FileMetaData with thrift
    * zstd encode resulting thrift bytes
    * Wrap in IoxParquetMetaData

Then accessing any data within the IoxParquetMetaData (as before this
change) requires deserialising it first.

There are still a number of easy performance improvements to be had
w.r.t the metadata handling.
2022-05-20 15:17:40 +01:00
Dom Dwyer 70856a645f feat: streaming RecordBatch -> parquet encoding
Implements a streaming RecordBatch to Parquet file serialiser.

This impl automatically discovers the schema of the RecordBatch stream,
and accepts &mut destination types (internalising the handle
cloning/etc) to simplify caller usage.

This encoder returns the resulting FileMetaData to allow callers to
inspect the resulting metadata without reading back the file.

Currently unused / not yet plumbed in.
2022-05-20 15:09:26 +01:00
Marco Neumann addc45327e
fix: ensure that query tokio background tasks are canceled (#4643)
* fix: ensure that query tokio background tasks are canceled

While I am not entirely sure if this explains some of the memory leaks I
am seeing in prod, not canceling the tasks correctly certainly makes
debugging way harder and also renders certain form of throttling (e.g.
max. concurrent queries) somewhat ineffective.

Note that parquet file downloads are currently NOT canceled because
tokios `spawn_blocking` cannot be canceled.

* refactor: `Vec` -> `Option`

* refactor: `spawn_blocking` creates a join handle, even though it is useless

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-05-20 07:18:52 +00:00
Dom Dwyer baa86d846f refactor: use ParquetStore instead of ObjectStore
Changes the code paths that interact with Parquet files in the object
store to reference the ParquetStorage directly (DRY refactor).

This change takes us from a dependency graph of:

                    ┌─────────────────┐
                    │                 │
                    ▼                 │
            Parquet Consumer          │
                    │         ┌──────────────┐
                    ├────────▶│ParquetStorage│
                    ▼         └──────────────┘
            ┌──────────────┐
            │ ObjectStore  │
            └──────────────┘
                    │
               ┌────┴────┐
               ▼         ▼
             File       s3
            System    (etc)

to:

                Parquet Consumer
                        │
                        ▼
                ┌──────────────┐
                │ParquetStorage│
                └──────────────┘
                        │
                        ▼
                ┌──────────────┐
                │ ObjectStore  │
                └──────────────┘
                        │
                   ┌────┴────┐
                   ▼         ▼
                 File       s3
                System    (etc)

With the ParquetStorage being solely responsible for managing
interactions with the object store when dealing with Parquet files.
2022-05-19 13:52:51 +01:00
Dom Dwyer d3548653d5 refactor: rename Storage -> ParquetStorage
Renames the Storage type so the context is clear in usage (i.e. fn
args), rather than having to rely on knowing the fully-qualified import
path to know what the type stores.
2022-05-19 13:51:07 +01:00
Dom Dwyer e20b02b914 refactor: tidy ParquetChunk constructor
Removes two unused constructors for a ParquetChunk, and moves the bare
fn constructor that is actually used to be an associated method (a
conventional constructor).
2022-05-19 13:51:07 +01:00
Dom Dwyer 7a8e6d1a38 refactor: remove unused max_row_group_size
The Parquet writer references an unused max_row_group_size property in
the parquet file metadata.
2022-05-18 16:45:15 +01:00
Andrew Lamb 3a33e806c7
chore: Update datafusion + `arrow`/`parquet`/`arrow-flight` to `14.0.0` (#4619)
* chore: Update datafusion deps

* chore: update arrow/parquet/arrow flight deps

* chore: Run cargo hakari tasks

* chore: Update location of utils

* chore: Update some more APIs

Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
2022-05-17 14:13:03 +00:00
Nga Tran 9530e73925
chore: move noisy debug to trace and fix some comments (#4598)
* chore: move noisy debug to trace and fix some comments

* chore: Apply suggestions from code review

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* chore: fix format

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-05-13 19:18:15 +00:00
Raphael Taylor-Davies f2bb0fdf77
feat: update to crates.io object_store version (#4595)
* feat: update to crates.io object_store version

* chore: Run cargo hakari tasks

* fix: tests

* chore: remove object store integration test plumbing

Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
2022-05-13 16:26:07 +00:00
Carol (Nichols || Goulding) 55313d290a
fix: Update or remove comments that mention NG or OG
Connects to #4450.
2022-05-12 16:09:08 -04:00
Raphael Taylor-Davies 8b379c83cc
refactor: simplify object_store path handling (#4534)
* refactor: simplify object_store path handling

* fix: aws integration tests

* chore: lint

* fix: update gcs tests

* refactor: move errors into submodules

* chore: lint

* chore: review feedback

* refactor: replace provider with Display

* fix: failing tests

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-05-09 18:43:22 +00:00
Jake Goulding e07bcd40c2 refactor: Remove unused dependencies
These were found by iterating over all of the dependencies of each
Cargo.toml, then grepping that crate for the dependency's name. If it
didn't show up, I attempted to remove it.

I left a few dependencies that this process flagged:

* generated_types
  - `pbjson`,`serde`. Apparently used by the generated code.

* grpc-router-test-gen
  - `prost`. Apparently used by the generated code.

* influxdb_iox
  - `heappy`. Doesn't appear used, but is behind enough feature
    flags that I don't care to reason about and it's already optional.
  - `tikv_jemalloc_sys`. Appears to be setting a feature flag of an
    indirect dependency.

* iox_gitops_adapter
  - `k8s_openapi`. Appears to be setting a feature flag of an indirect
    dependency.
2022-05-06 15:57:58 -04:00
Carol (Nichols || Goulding) 068096e7e1
fix: Rename data_types2 to data_types 2022-05-06 14:45:39 -04:00
Carol (Nichols || Goulding) 0541c6e40f
fix: Remove data_types crate where it's no longer used 2022-05-06 14:45:39 -04:00
Carol (Nichols || Goulding) d2671355c3
fix: Move partition metadata types to data_types2 2022-05-06 14:45:37 -04:00
Carol (Nichols || Goulding) ea46830954
fix: Remove iox_object_store crate; move ParquetFilePath to parquet_file 2022-05-06 14:45:36 -04:00
Carol (Nichols || Goulding) b4894c2b46
fix: Remove unused parts of parquet_file 2022-05-06 11:30:36 -04:00
Carol (Nichols || Goulding) ba8191c1eb
fix: Remove persistence_windows 2022-05-06 11:30:35 -04:00
Andrew Lamb 02893e598c
chore: Update datafusion and upgrade arrow/parquet/arrow-flight to 13 (#4516)
* chore: Tool for automating arrow version update

* chore: Update datafusion and arrow/parquet/arrow-flight

* fix: update for changes in Arrow API

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-05-05 00:21:02 +00:00
dependabot[bot] 420c306caa
chore(deps): Bump tokio from 1.17.0 to 1.18.0 (#4453)
Bumps [tokio](https://github.com/tokio-rs/tokio) from 1.17.0 to 1.18.0.
- [Release notes](https://github.com/tokio-rs/tokio/releases)
- [Commits](https://github.com/tokio-rs/tokio/compare/tokio-1.17.0...tokio-1.18.0)

---
updated-dependencies:
- dependency-name: tokio
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-04-28 08:21:17 +00:00
二手掉包工程师 4b47d723b1
refactor: Rename time to iox_time (#4416)
Signed-off-by: hi-rustin <rustin.liu@gmail.com>

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-04-26 00:19:59 +00:00
Marco Neumann 86e8f05ed1
fix: make all catalog IDs 64bit (#4418)
Closes #4365.

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-04-25 16:49:34 +00:00
Nga Tran d963110842
feat: group chunk overlaps based on time range only (#4389)
* feat: overlap for NG querier

* chore: cleanup

* refactor: address review comments

* fix: typo

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-04-25 13:32:07 +00:00
Andrew Lamb 73bed810da
chore: Update arrow, arrow-flight, parquet, tonic, prost, etc (#4357)
* chore: Update datafusion

* chore: Update arrow/arrow-flight/parquet to 12

* chore: update datafusion correctly

* chore: Update prost, tonic, and dependents

* fix: Fixup some api changes

* fix: Update test output in db

* fix: Update test output in parquet_file

* fix: remove old pbjson types

* fix: Add "--experimental_allow_proto3_optional" flag

* chore: Run cargo hakari tasks

* fix: compile error

* chore: Update heappy

Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-04-20 11:12:17 +00:00
Carol (Nichols || Goulding) 94dcde4996
fix: Do fewer queries for metadata
By adding another _with_metadata catalog function. Also introduce a new
type rather than passing around tuples everywhere.
2022-04-13 10:43:20 -04:00
Carol (Nichols || Goulding) 02fee3b84f
feat: Request parquet metadata from the catalog when needed only 2022-04-13 10:43:19 -04:00
Dom Dwyer 6131381b8d refactor: extra debug in compactor
Continues pushing more debug through the compaction processing loop.
2022-04-08 11:20:19 +01:00
Dom Dwyer 3706ac042d refactor: add debug in compaction path
Adds debug!() and friends through the compaction path.
2022-04-07 17:13:45 +01:00
dependabot[bot] 438e739344
chore(deps): Bump parquet from 11.0.0 to 11.1.0 (#4240)
* chore(deps): Bump parquet from 11.0.0 to 11.1.0

Bumps [parquet](https://github.com/apache/arrow-rs) from 11.0.0 to 11.1.0.
- [Release notes](https://github.com/apache/arrow-rs/releases)
- [Changelog](https://github.com/apache/arrow-rs/blob/master/CHANGELOG.md)
- [Commits](https://github.com/apache/arrow-rs/compare/11.0.0...11.1.0)

---
updated-dependencies:
- dependency-name: parquet
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* fix: Update tests

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
2022-04-06 14:51:01 +00:00
Nga Tran ddc2c8304f
fix: have the compaction level set correctly (#4184)
* fix: have the compaction level set correctly, especially for compacted file from the compactor

* fix: typo
2022-03-30 21:23:40 +00:00
Marco Neumann 20bbb88dc5
refactor: remove table name from `TableSummary` (#4170)
This allows us to remove the table name from the low-level chunk
representations (like `ParquetFile`, RUB, ...) since table names are
already tracked by the higher-level data structures (e.g. catalog,
catalog chunk) that manage the low-level chunk representations.

This is similar to #4167.

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-03-30 13:24:00 +00:00
Marco Neumann 036626a576
refactor: remove partition key from `ParquetChunk` (#4167)
The parquet chunk is always wrapped into some higher-level data
structure (e.g. a catalog chunk, a partition, ...) that knows exactly
"where" the chunk is located. There is no need for the parquet chunk to
back-reference container-level attributes. In the contrary:
double-bookkeeping makes the code more complex and costs additional
memory.

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-03-30 09:24:56 +00:00
Marco Neumann 2b76c31157
refactor: make statistics null counts optional (#4160)
Min/max values and distinct counts are already optional, so let's make
the null counts optional as well. This will be helpful for NG to deal w/
partial statistics (e.g. we only populate stats for the time column).

Note that the total count is still mandatory, but we normally have the
chunk/file-level row count at hand.
2022-03-29 17:47:57 +00:00
Carol (Nichols || Goulding) f3f792fd08
feat: Add namespace_id to the parquet_files table; object store paths need it 2022-03-29 08:15:26 -04:00
Andrew Lamb 5c69a3f43b
chore: Update deps: datafusion, arrow/arrow-flight/parquet to 11, zstd to 0.11 (#4119)
* chore: update datafusion

* chore(deps): Bump arrow from 10.0.0 to 11.0.0

Bumps [arrow](https://github.com/apache/arrow-rs) from 10.0.0 to 11.0.0.
- [Release notes](https://github.com/apache/arrow-rs/releases)
- [Changelog](https://github.com/apache/arrow-rs/blob/master/CHANGELOG.md)
- [Commits](https://github.com/apache/arrow-rs/compare/10.0.0...11.0.0)

---
updated-dependencies:
- dependency-name: arrow
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>

* chore(deps): Bump arrow-flight from 10.0.0 to 11.0.0

Bumps [arrow-flight](https://github.com/apache/arrow-rs) from 10.0.0 to 11.0.0.
- [Release notes](https://github.com/apache/arrow-rs/releases)
- [Changelog](https://github.com/apache/arrow-rs/blob/master/CHANGELOG.md)
- [Commits](https://github.com/apache/arrow-rs/compare/10.0.0...11.0.0)

---
updated-dependencies:
- dependency-name: arrow-flight
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>

* chore: update parquet to 11.0.0

* fix: error on create schema, test for same

* fix: upgrade zstd

* chore: Run cargo hakari tasks

* fix: fix logical merge conflict

* fix: hakari

* fix: hakari

* fix: update newly introduced dep

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-03-24 15:27:36 +00:00
Marco Neumann 51da6dd7fa
feat: store sort key in NG metadata (#4110)
The sort key is optional and currently only produced by `iox_tests`.
Writing it within the ingester/compactor is tracked by #3968. The sort
key is read by the querier (and this will be verified by the query tests
and is required to merge #4103).

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-03-23 18:24:46 +00:00
Dom Dwyer 1d5066c421 refactor: rename ObjectStore -> ObjectStoreImpl
Frees up the name for so we can use `dyn ObjectStore` throughout the
code instead of `ObjectStoreApi`.
2022-03-15 16:29:43 +00:00
Carol (Nichols || Goulding) ecd06c6ec3
fix: ParquetFileRepo create should be responsible for setting INITIAL_COMPACTION_LEVEL
When created in the catalog, parquet files should always have compaction
level 0. Updating the compaction level should always happen in the
compactor.

Only the catalog should need to know about the initial compaction level
value.
2022-03-10 13:51:18 -05:00
Carol (Nichols || Goulding) ff31407dce
refactor: Extract a ParquetFileParams type for create
This has the advantages of:

- Not needing to create fake parquet file IDs or fake deleted_at
  values that aren't used by create before insertion
- Not needing too many arguments for create
- Naming the arguments so it's easier to see what value is what
  argument, especially in tests
- Easier to reuse arguments or parts of arguments by using copies of
  params, which makes it easier to see differences, especially in tests
2022-03-10 13:51:18 -05:00
Paul Dix 27999ff72f
feat: add compaction_level and created_at to parquet_file (#3972) 2022-03-10 15:56:57 +00:00
Andrew Lamb 2c3d30ca32
chore: Update datafusion, arrow, flight and parquet (#4000)
* chore: Update datafusion, arrow, flight and parquet

* fix: api change

* fix: fmt

* fix: update test metadata size

* fix: Update sizes in parquet test

* fix: more metadata size update
2022-03-10 12:24:47 +00:00
Nga Tran c6cab3538f
refactor: move parquet chunk's new and decode to parquet_file crate (#3987) 2022-03-08 22:04:32 +00:00
Andrew Lamb e09f39d6a0
chore: Update datafusion (#3943)
* chore: Update datafusion

* refactor: update for new datafusion

* chore: Run cargo hakari tasks

Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
2022-03-04 19:37:46 +00:00
Andrew Lamb 677a272095
refactor: Clean up some future clippy warnings from nightly (#3892)
* refactor: clean up new clippy lints

* refactor: complete other cleanups

* fix: ignore overzealous clippy

* fix: re-remove old code
2022-03-03 19:14:27 +00:00
Carol (Nichols || Goulding) 8f3e44bf76
refactor: Extract a crate for shared data types in the new design 2022-03-02 12:16:15 -05:00
Marco Neumann 33851be3a5
chore: upgrade Rust to 1.59 (#3875)
Mostly a few new clippy crates around `flat_map`, `and_then`, and
"underscore locks" (!!!):
https://rust-lang.github.io/rust-clippy/master/index.html#let_underscore_lock

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-02-28 15:14:19 +00:00
Raphael Taylor-Davies 2a842fbb1a
feat: correctly sort data and store in catalog metadata (#3864)
* feat: respect sort order in ChunkTableProvider (#3214)

feat: persist sort order in catalog (#3845)

refactor: owned SortKey (#3845)

* fix: size tests

* refactor: immutable SortKey

* test: test sort order restart (#3845)

* chore: explicit None for sort key

* chore: test cleanup

* fix: handling of sort keys containing fields

* chore: remove unused selected_sort_key

* chore: more docs

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-02-25 17:56:27 +00:00
Marco Neumann f966f4c7a4
feat: create `ParquetChunk` in querier (#3857)
Adds a small adapter that is able to produce `ParquetChunk`s for NG.

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-02-25 08:54:16 +00:00
Marco Neumann 49d1be30e7
feat: wire up `ParquetFilePath` for NG (#3853)
It's a bit of a duck-type hack, but if we wanna just `ParquetFileChunk`
in the new architecture, we somehow need it to accept new-gen paths.
Also path handling should be somewhat centralized since
ingester/compactor/querier all need to construct them. So having a
`ParquetFilePath` that supports both path styles seems to be a
not-to-bad solution. This should obviously be cleaned up in some
not-to-distant future.
2022-02-24 16:05:38 +00:00
Carol (Nichols || Goulding) 252ced7adf
feat: Add row count to the parquet_file record in the catalog (#3847)
Fixes #3842.
2022-02-24 15:20:50 +00:00
Marco Neumann d62a052394
feat: extend catalog so we can recover `ParquetChunk`s from it (#3852)
* refactor: less parquet data copying

* feat: `PartitionRepo::get_by_id`

* feat: `TableRepo::get_by_id`

* feat: `ParquetFile::file_size_bytes`

* feat: `ParquetFile::parquet_metadata`

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-02-24 13:16:15 +00:00
dependabot[bot] b63f920d4c
chore(deps): Bump parquet from 9.0.2 to 9.1.0 (#3828)
* chore(deps): Bump parquet from 9.0.2 to 9.1.0

Bumps [parquet](https://github.com/apache/arrow-rs) from 9.0.2 to 9.1.0.
- [Release notes](https://github.com/apache/arrow-rs/releases)
- [Changelog](https://github.com/apache/arrow-rs/blob/master/CHANGELOG.md)
- [Commits](https://github.com/apache/arrow-rs/compare/9.0.2...9.1.0)

---
updated-dependencies:
- dependency-name: parquet
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* chore: update chunk size test

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Raphael Taylor-Davies <r.taylordavies@googlemail.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-02-23 11:25:15 +00:00
dependabot[bot] 3b7d31c88a
chore(deps): Bump arrow from 9.0.2 to 9.1.0 (#3826)
Bumps [arrow](https://github.com/apache/arrow-rs) from 9.0.2 to 9.1.0.
- [Release notes](https://github.com/apache/arrow-rs/releases)
- [Changelog](https://github.com/apache/arrow-rs/blob/master/CHANGELOG.md)
- [Commits](https://github.com/apache/arrow-rs/compare/9.0.2...9.1.0)

---
updated-dependencies:
- dependency-name: arrow
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-02-23 09:25:46 +00:00
dependabot[bot] ad3868ed7c
chore(deps): Bump tokio from 1.16.1 to 1.17.0 (#3814)
* chore(deps): Bump tokio from 1.16.1 to 1.17.0

Bumps [tokio](https://github.com/tokio-rs/tokio) from 1.16.1 to 1.17.0.
- [Release notes](https://github.com/tokio-rs/tokio/releases)
- [Commits](https://github.com/tokio-rs/tokio/compare/tokio-1.16.1...tokio-1.17.0)

---
updated-dependencies:
- dependency-name: tokio
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* build: update workspace-hack

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Dom Dwyer <dom@itsallbroken.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-02-22 16:27:43 +00:00
Andrew Lamb a30803e692
chore: Update datafusion, update `arrow`/`parquet`/`arrow-flight` to 9.0 (#3733)
* chore: Update datafusion

* chore: Update arrow

* fix: missing updates

* chore: Update cargo.lock

* fix: update for smaller parquet size

* fix: update test for smaller parquet files

* test: ensure parquet_file tests write multiple row groups

* fix: update callsite

* fix: Update for tests

* fix: harkari

* fix: use IoxObjectStore::existing

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-02-15 12:10:24 +00:00
Carol (Nichols || Goulding) 73828323ac
feat: Ingester Flight gRPC API (#3623)
* feat: Add a way to run ingester with an in-memory catalog from the CLI

If you set the --catalog-dsn string to "mem", rather than using that as
a Postgres connection URL, create an in-memory catalog.

Planning on using this in tests, so not documenting.

* fix: Set default topic to the same value as SHARED_KAFKA_TOPIC

Namely, both should use an underscore. I don't think there's a way to
directly share these values between a constant and an annotation.

* feat: Add a flight API (handshake only) to ingester

* fix: Create partitions if using file-based write buffer

* fix: Change the server fixture to handle ingester server type

For now, the ingester doesn't implement the deployment API. Not sure if
it should or not.

* feat: Start implementing ingester do_get, namely decoding the query

Skip serialization of the predicate for the moment.

* refactor: Rename ingest protos to ingester to match crate name

* refactor: Rename QueryResults to QueryData

* feat: Move ingester flight client to new querier crate

* fix: Off by one error, different starting indexes in sequencers

* fix: Create new CLI argument to pick the catalog type

* fix: Create a CLI option to set the number of topics to auto-create in the write buffer

* fix: Check the arrow flight service's health to tell that the ingester gRPC is up

* fix: Set postgres as the default catalog type

* fix: Return an error rather than panicking if CLI args aren't right
2022-02-09 19:07:44 +00:00
Carol (Nichols || Goulding) 2e30483f1f
refactor: Remove predicate module from predicate crate (#3648)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-02-07 14:54:07 +00:00
Nga Tran 17fbeaaade
feat: insert the persisted info into the catalog in one transaction (#3636)
* feat: add ProcessedTombstoneRepo

* feat: add function add_parquet_file_with_tombstones

* fix: remove unecessary use

* feat: handling transaction when adding parquet file and its processed tombstones

* feat: tests update catalog for parquet file and processed tombstones

* fix: make add parquet file & its processed tombstones fully transactional

* chore: cleanup

* test: add integration tests for new catalog update functions

* chore: remove catalog_update.rs

* chore: cleanup

* fix: assert the right values

* fix: create unique namespace

* fix: support non transaction create_many

* test: remove tests that do not work in a transaction

* fix: one more case with unique namespace

* chore: more verification around for better understanding why certain tests fail

* fix: compare difference rather than absolute becasue the DB already has data

* fix: fix the argument provided to SQL

* fix: return non-empty processed tombstones

* fix: insert the right parquet file

* chore: remove unsed file

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-02-07 14:44:15 +00:00
Carol (Nichols || Goulding) 62a2ad289b
feat: Implement deserializing IoxMetadata from protobuf (#3589)
Fixes #3587.

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-02-02 16:05:21 +00:00
Marco Neumann 22778a3a80
chore: upgrade rskafka and parking_lot (#3592) 2022-02-01 11:50:42 +00:00
Carol (Nichols || Goulding) 093d5acfd4
fix: Unify temporary multiple definitions of IoxMetadata 2022-01-31 10:48:29 -05:00
Carol (Nichols || Goulding) 8f81ce5501
refactor: Share parquet_file::storage code between new and old metadata 2022-01-31 10:36:33 -05:00
Carol (Nichols || Goulding) bf89162fa5
refactor: Move IoxMetadata to parquet_file 2022-01-31 10:36:33 -05:00
Carol (Nichols || Goulding) 0f72a881ef
refactor: Rename Rust struct parquet_file::IoxMetadata to be IoxMetadataOld 2022-01-31 10:36:33 -05:00
Carol (Nichols || Goulding) 1b298bb5bd
refactor: Alias the old proto definitions to make clearer the new ones coming in 2022-01-31 10:36:33 -05:00
Dom 32d7c4cbfe
refactor: remove InfluxColumnType::IOx (#3565)
* refactor: remove InfluxColumnType::IOx

Remove unused column variant - see #3554 for context.

* refactor: reserve SEMANTIC_TYPE_IOX name in proto

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-01-27 21:15:36 +00:00
Andrew Lamb 5488c257d1
chore: Update datafusion, upgrade to arrow/parqet/arrow-flight 8.0.0 (#3517)
* chore: Update datafusion

* chore: update to arrow 8

* fix: update to use new DataFusion APIs

* fix: update case for sortedness

* fix: cargo hakari
2022-01-27 13:33:27 +00:00
Andrew Lamb dd23056efd
chore: update datafusion, arrow, prost, tonic, pbjson, etc (#3455)
* chore: update datafusion, arrow, prost, tonic, etc

* fix: update pprof as well

* chore: update hakari

* fix: update pbjson

* chore: update heappy

* fix: hakari

* fix: workaround https://github.com/influxdata/influxdb_iox/issues/3458

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-01-13 17:07:15 +00:00
Andrew Lamb cdf5c21cd4
fix: Fix max timestamp value comparison in chunk metadata (#3453)
* fix: Fix max timestamp value comparison in chunk metadata

* refactor: rename contains to overlaps

Co-authored-by: Edd Robinson <me@edd.io>
2022-01-13 16:58:30 +00:00
Raphael Taylor-Davies c5cf03511c
fix: parquet column count statistics (#2124) (#3444)
* fix: parquet metadata total_count (#2124)

* chore: review feedback

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-01-11 21:56:24 +00:00
Marco Neumann f3f6f335a9
chore: upgrade to snafu 0.7 (#3440) 2022-01-11 19:22:36 +00:00
Marco Neumann 37bb7f2120 chore: `cargo update`
dependabot currently doesn't work due to
https://github.com/dependabot/dependabot-core/issues/4574

Excluded `quote` due to
https://github.com/dtolnay/quote/issues/204
2022-01-11 14:57:51 +01:00
Nga Tran ec8644a39a refactor: return clearer error message 2021-12-07 12:24:28 -05:00
Nga Tran 561c5ed8e7 refactor: make checking no data happen during reading inout stream 2021-12-07 12:03:41 -05:00
Nga Tran c992c82582 chore: Merge branch 'main' into ntran/compact_os_tests 2021-12-07 11:08:12 -05:00
Raphael Taylor-Davies 5fdaa5b4ab
chore: don't panic with invalid parquet (#3309)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2021-12-06 21:15:35 +00:00
Carol (Nichols || Goulding) 7499eac067
fix: Disable uuid serde feature; we're not actually serializing any UUIDs
Connects to #3117.
2021-12-06 09:37:31 -05:00
Carol (Nichols || Goulding) 02c297e850
fix: Always specify the parking_lot feature of tokio to get potential perf boost 2021-12-06 09:37:15 -05:00
Carol (Nichols || Goulding) 0b24b3c227
fix: Use a consistent version specifier when depending on the futures crate 2021-12-06 09:37:12 -05:00
Raphael Taylor-Davies bca561366b
feat: don't copy parquet files out of disk object store (#3282) (#3293)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2021-12-05 16:31:40 +00:00
Raphael Taylor-Davies 11067bfe3f
feat: simplify parquet reader (#3282) (#3291)
* feat: simplify parquet reader (#3282)

* chore: add back log line
2021-12-03 23:21:58 +00:00
Nga Tran 86f9fe0bcb refactor: no longer need to create and test no-row-groups parquet files 2021-12-03 15:14:04 -05:00
Nga Tran 152281e428 fix: Capture the right 'no data' while parquet has no data 2021-12-03 12:19:48 -05:00
kodiakhq[bot] 2857b6a990
Merge branch 'main' into er/feat/load_chunk_cli 2021-12-02 20:20:56 +00:00
Edd Robinson b4ea9887ba refactor: error name 2021-12-02 20:14:02 +00:00
Carol (Nichols || Goulding) 5d0fd1c603
fix: Allow dead code on fields that are now detected as never read 2021-12-02 11:52:01 -05:00
Edd Robinson 88aedc556e feat: add FromStr implementation 2021-12-02 12:59:52 +00:00
Nga Tran bf74608dc8 docs: not persist of the input stream is empty 2021-12-01 17:53:19 -05:00
Nga Tran f085af034e
refactor: not persist empty chunk resulting from deleting & deduplicating (#3274)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2021-12-01 20:57:30 +00:00
Nga Tran f53cdca010 feat: handling empty compacted stream 2021-11-30 18:13:36 -05:00
Raphael Taylor-Davies 197634ed50
feat: reload chunk back into read buffer (#3209) (#3216)
* feat: reload chunk back into read buffer (#3209)

* chore: fix logical conflict

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2021-11-29 11:34:55 +00:00
kodiakhq[bot] d16a7759ca
Merge branch 'main' into cn/workspace-hack 2021-11-22 17:05:31 +00:00
Raphael Taylor-Davies 73d60539ad
refactor: use ChunkGenerator in parquet_catalog (#2209) (#3167)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2021-11-22 10:29:33 +00:00
Carol (Nichols || Goulding) 9fd4a560f5
feat: Results of running cargo hakari manage-deps 2021-11-19 09:21:57 -05:00
Raphael Taylor-Davies ca4e0ad13b
refactor: add parquet chunk generator (#2209) (#3163)
* refactor: add parquet chunk generator (#2209)

* fix: tests

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2021-11-19 12:35:18 +00:00
Carol (Nichols || Goulding) c8d80e5c28
fix: Change database paths to be under /dbs/ instead of under /[server id]/ 2021-11-05 10:14:06 -04:00
Andrew Lamb 1902c4f8a9
chore: Update DataFusion (#3012)
* chore: Update DataFusion

* fix: restore Cargo.log

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2021-11-02 18:06:21 +00:00
Marco Neumann 4c9570b519 refactor: move `catalog` protobuf to `preserved_catalog`
This makes it clearer what's going since the contained messages are
only for the preserved part, not the in-mem catalog and its management.
2021-11-01 18:07:25 +01:00
dependabot[bot] c540b40f05
chore(deps): bump tokio from 1.12.0 to 1.13.0
Bumps [tokio](https://github.com/tokio-rs/tokio) from 1.12.0 to 1.13.0.
- [Release notes](https://github.com/tokio-rs/tokio/releases)
- [Commits](https://github.com/tokio-rs/tokio/compare/tokio-1.12.0...tokio-1.13.0)

---
updated-dependencies:
- dependency-name: tokio
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2021-11-01 11:21:59 +00:00
Carol (Nichols || Goulding) 990f768cda
fix: Assign a UUID when creating a database 2021-10-28 13:20:28 -04:00
Carol (Nichols || Goulding) 8198c1ff2a
refactor: Rename IoxObjectStore constructors to better match what server does with Databases 2021-10-28 13:20:27 -04:00
Marco Neumann bc7244c48e chore: use Rust edition 2021 2021-10-25 10:58:20 +02:00
Andrew Lamb a82dc6f5f0
chore: Update datafusion + arrow (#2903)
* chore: Update datafusion to latest, arrow to 6.0.0

* fix: Update tests

* fix: bubble internal error

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2021-10-19 17:14:08 +00:00
Marco Neumann d8f35d8ee9 chore: remove unused `parquet_file` => `chrono` dep 2021-10-19 14:45:56 +02:00
Marco Neumann 28195b9c0c chore: new `parquet_catalog` crate 2021-10-14 14:34:59 +02:00
Andrew Lamb 0568452a0c
chore: Update datafusion (#2838)
* chore: update datafusion version

* refactor: Update to use new datafusion apis

* fix: do not upgrade other packages
2021-10-13 20:51:19 +00:00
Marco Neumann 1523e0edcd refactor: clean up preserved catalog interface
1. Remove `new_empty` logic. It's a leftover from the time when the
   `PreservedCatalog` owned the in-memory catalog.
2. Make `db_name` a part of the `PreservedCatalogConfig`.
2021-10-13 13:58:11 +02:00
Raphael Taylor-Davies 8414e6edbb
feat: migrate preserved catalog to TimeProvider (#2722) (#2808)
* feat: migrate preserved catalog to TimeProvider (#2722)

* fix: deterministic catalog prune tests

* fix: failing test

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2021-10-12 14:43:05 +00:00
Raphael Taylor-Davies 3dfe400e6b
feat: migrate write path to TimeProvider (#2722) (#2807) 2021-10-12 12:09:08 +00:00
Raphael Taylor-Davies b39e01f7ba
feat: migrate PersistenceWindows to TimeProvider (#2722) (#2798) 2021-10-11 20:40:00 +00:00
Raphael Taylor-Davies 06c2c23322
refactor: create PreservedCatalogConfig struct (#2793)
* refactor: create PreservedCatalogConfig struct

* chore: fmt

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2021-10-11 15:43:05 +00:00
Carol (Nichols || Goulding) 5da2f7b1b0
Merge branch 'main' into cn/less-database-name 2021-10-11 10:35:42 -04:00
Raphael Taylor-Davies afe34751e7
refactor: split out schema crate (#2781)
* refactor: split out schema crate

* chore: fix doc
2021-10-11 09:45:08 +00:00
Carol (Nichols || Goulding) 8407735e00 fix: Pass the database name into PreservedCatalog 2021-10-08 15:25:10 -04:00
Carol (Nichols || Goulding) 276aef69c9 refactor: Move PreservedCatalog test helper functions to test helpers and use them more 2021-10-08 15:25:10 -04:00
Carol (Nichols || Goulding) 3aff4fcb07 refactor: Extract test helper functions for common catalog operations
This will make the next change easier, and I think it makes the tests
easier to read.
2021-10-08 15:25:10 -04:00
kodiakhq[bot] 559a7e0221
Merge branch 'main' into cn/chunk-addr-smaller 2021-10-08 17:26:20 +00:00
Carol (Nichols || Goulding) fbe76935f4 fix: Remove some calls to iox_object_store.database_name 2021-10-08 09:50:14 -04:00
Marco Neumann 64bda1fc08 feat: improve `Debug`/`Display` for test `ChunkId`s 2021-10-08 13:55:56 +02:00
Marco Neumann d3de6bb6e4 refactor: `max_persisted_timestamp` => `flush_timestamp`
There might be data left before this timestamp that wasn't persisted
(e.g. incoming data while the persistence was running).
2021-10-08 12:36:23 +02:00
Marco Neumann 63a932fa37 refactor: "min unpersisted ts" => "max persisted ts"
Store the "maximum persisted timestamp" instead of the "minimum
unpersisted timestamp". This avoids the need to calculate the next
timestamp from the current one (which was done via "max TS + 1ns").

The old calculation was prone to overflow panics. Since the
timestamps in this calculation originate from user-provided data (and
not the wall clock), this was an easy DoS vector that could be triggered
via the following line protocol:

```text
table_1 foo=1 <i64::MAX>
```

which is

```text
table_1 foo=1 9223372036854775807
```

Bonus points: the timestamp persisted in the partition
checkpoints is now the very same that was used by the split query during
persistence. Consistence FTW!

Fixes #2225.
2021-10-08 11:52:49 +02:00
kodiakhq[bot] 7d6be3f500
Merge branch 'main' into crepererum/issue2748 2021-10-07 09:04:18 +00:00
Marco Neumann 63d74be490 refactor: make `ChunkId` a UUID 2021-10-07 10:23:27 +02:00
Marco Neumann 2a52fd90d9 fix: transaction pruning logic for "nothing to do" 2021-10-07 10:14:42 +02:00
kodiakhq[bot] d72a494198
Merge branch 'main' into crepererum/in_mem_expr_part5 2021-10-05 16:20:24 +00:00
Marco Neumann b8aa4c33ce refactor: use protobuf bytes for transaction UUIDs 2021-10-05 12:27:48 +02:00
Marco Neumann bb7a27e5ed refactor: use proper sets during delete predicate collection
We no longer need hacky pointer tricks to de-duplicate delete predicates
when collecting them for catalog checkpoints. This was once required
when the delete predicates didn't implement `Eq` and `Hash` but now it's
all way easier.
2021-10-05 10:37:34 +02:00
Marco Neumann 28ccf2a8c3 refactor: `TransactionHandle::delete_predicate` cannot fail 2021-10-05 09:41:46 +02:00
Marco Neumann 10c1a72402 refactor: remove unused fields from `DeletePredicate` 2021-10-05 09:29:24 +02:00
Marco Neumann 97881079e8 refactor: make `ChunkOrder` non-zero
This will make it easier to handle missing values.

Helps with #2633.
2021-10-04 17:49:12 +02:00
Marco Neumann 75ac6e8646 refactor: make `DeletePredicate::range` non-optional 2021-10-04 16:36:20 +02:00
Marco Neumann d1835a3eee fix: doc links 2021-10-04 16:36:20 +02:00