Commit Graph

138 Commits (a449d5ef7433fcadcffe5991971c28849be69541)

Author SHA1 Message Date
kodiakhq[bot] b57f397057
Merge branch 'main' into crepererum/checkpoint_during_restore 2021-06-14 13:54:03 +00:00
Marco Neumann 0a7dcc3779 test: adjust read-write parquet test to newest test data 2021-06-14 14:24:24 +02:00
Marco Neumann d6f6ddfdaa fix: fix NULL handling in parquet stats 2021-06-14 14:24:09 +02:00
Marco Neumann eae56630fb test: add test for all-NULL float column metadata 2021-06-14 13:48:34 +02:00
Marco Neumann 3f9bcf7cd9 fix: fix NaN handling in parquet stats 2021-06-14 13:44:52 +02:00
Marco Neumann ea96210e98 test: enable unblocked test 2021-06-14 13:44:52 +02:00
Marco Neumann 518f7c6f15 refactor: wrap upstream parquet MD into struct + clean up interface
This prevents users from `parquet_file::metadata` to also depend on
`parquet` directly. Furthermore they don't need to important dozend of
functions and can instead just use `IoxParquetMetaData` directly.
2021-06-14 13:17:01 +02:00
Marco Neumann 030d0d2b9a feat: create checkpoint during catalog rebuild 2021-06-14 10:55:56 +02:00
Marco Neumann df866f72e0 refactor: store parquet metadata in chunk
This will be useful for #1381.

At the moment we parse schema and stats eagerly and store them alongside
the parquet metadata in memory. Technically this is not required since
this is basically duplicate data. In the future we might trade-off some
of this memory against CPU consumption by parsing schema and stats on
demand.
2021-06-14 10:08:31 +02:00
Marco Neumann e6699ff15a test: ensure that `find_last_transaction_timestamp` considers checkpoints 2021-06-14 10:04:50 +02:00
Marco Neumann f8a518bbed refactor: inline `Table` into `parquet_file::chunk::Chunk`
Note that the resulting size estimations are different because we were
double-counting `Table`. `mem::size_of::<Self>()` is recursive for
non-boxed types since the child will be part of the parent structure.

Issue: #1295.
2021-06-11 11:54:31 +02:00
Marco Neumann 28d1dc4da1 chore: bump preserved catalog version 2021-06-10 16:01:13 +02:00
Marco Neumann 80ee36cd1a refactor: slightly streamline path parsing code in pres. catalog 2021-06-10 15:59:28 +02:00
Marco Neumann 7e7332c9ce refactor: make comparison a bit less confusing 2021-06-10 15:42:21 +02:00
Marco Neumann fd581e2ec9 docs: fix confusion wording in `CatalogState::files` 2021-06-10 15:42:21 +02:00
Marco Neumann be9b3a4853 fix: protobuf lint fixes 2021-06-10 15:42:21 +02:00
Marco Neumann 294c304491 feat: impl catalog checkpointing infrastructure
This implements a way to add checkpoints to the preserved catalog and
speed up replay.

Note: This leaves the "hook it up into the actual DB" for a future PR.

Issue: #1381.
2021-06-10 15:42:21 +02:00
Marco Neumann 188cacec54 refactor: use `Arc` to pass `ParquetFileMetaData`
This will be handy when the catalog state must be able to return
metadata objects so that we can create checkpoints, esp. when we use
multi-chunk parquet files in some midterm future.
2021-06-10 15:42:21 +02:00
Marco Neumann c7412740e4 refactor: prepare to read and write multiple file types for catalog
Prepares #1381.
2021-06-10 15:42:21 +02:00
Marco Neumann 33e364ed78 feat: add encoding info to transaction protobuf
This should help with #1381.
2021-06-10 15:42:21 +02:00
Marco Neumann 4fe2d7af9c chore: enforce `clippy::future_not_send` for `parquet_file` 2021-06-09 18:18:27 +02:00
Andrew Lamb ab0aed0f2e
refactor: Remove a layer of channels in parquet read stream (#1648)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2021-06-07 16:47:04 +00:00
Raphael Taylor-Davies 1e7ef193a6
refactor: use field metadata to store influx types (#1642)
* refactor: use field metadata to store influx types

make SchemaBuilder non-consuming

* chore: remove unused variants

* chore: fix lints
2021-06-07 13:26:39 +00:00
Marco Neumann c830542464 feat: add info log when cleanup limit is reached 2021-06-04 11:12:29 +02:00
Marco Neumann 91df8a30e7 feat: limit number of files during storage cleanup
Since the number of parquet files can potentially be unbound (aka very
very large) and we do not want to hold the transaction lock for too
long and also want to limit memory consumption of the cleanup routine,
let's limit the number of files that we collect for cleanup.
2021-06-03 17:43:11 +02:00
Marco Neumann 85139abbbb fix: use structured logging for cleanup logs 2021-06-03 11:23:29 +02:00
Andrew Lamb 32c6ed1f34
refactor: More cleanup related to multi-table chunks (#1604)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2021-06-02 17:00:23 +00:00
Marco Neumann e5b65e10ac test: ensure that `find_last_transaction_timestamp` indeed returns the last timestamp 2021-06-02 10:15:06 +02:00
Marco Neumann 98e413d5a9 fix: do not unwrap broken timestamps in serialized catalog 2021-06-02 10:15:06 +02:00
Marco Neumann fc0a74920f fix: use clearer error text 2021-06-02 09:41:19 +02:00
Marco Neumann 2a0b2698c6 fix: use structured logging
Co-authored-by: Andrew Lamb <alamb@influxdata.com>
2021-06-02 09:41:19 +02:00
Marco Neumann 64bf8c5182 docs: add code comment explaining why we parse transaction timestamps
Co-authored-by: Andrew Lamb <alamb@influxdata.com>
2021-06-02 09:41:19 +02:00
Marco Neumann 77aeb5ca5d refactor: use protobuf-native Timestamp instead of string 2021-06-02 09:41:19 +02:00
Marco Neumann 9b9400803b refactor!: bump transaction version to 2 2021-06-02 09:41:19 +02:00
Marco Neumann 5f77b7b92b feat: add `parquet_file::catalog::find_last_transaction_timestamp` 2021-06-02 09:41:19 +02:00
Marco Neumann 9aee961e2a test: test loading catalogs from broken protobufs 2021-06-02 09:41:19 +02:00
Marco Neumann 0a625b50e6 feat: store transaction timestamp in preserved catalog 2021-06-02 09:41:19 +02:00
Andrew Lamb d8fbb7b410
refactor: Remove last vestiges of multi-table chunks from PartitionChunk API (#1588)
* refactor: Remove last vestiges of multi-table chunks from PartitionChunk API

* fix: remove test that can no longer fail

* fix: update tests + code review comments

* fix: clippy

* fix: clippy

* fix: restore test_measurement_fields_error test
2021-06-01 16:12:33 +00:00
Andrew Lamb d3711a5591
refactor: Use ParquetExec from DataFusion to read parquet files (#1580)
* refactor: use ParquetExec to read parquet files

* fix: test

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2021-06-01 14:44:07 +00:00
Andrew Lamb 64328dcf1c
feat: cache schema on catalog chunks too (#1575) 2021-06-01 12:42:46 +00:00
Andrew Lamb 00e735ef0d
chore: remove unused dependencies (#1583) 2021-05-29 10:31:57 +00:00
Raphael Taylor-Davies db432de137
feat: add distinct count to StatValues (#1568) 2021-05-28 17:41:34 +00:00
kodiakhq[bot] 6098c7cd00
Merge branch 'main' into crepererum/issue1376 2021-05-28 07:13:15 +00:00
Andrew Lamb f3bec93ef1
feat: Cache TableSummary in Catalog rather than computing it on demand (#1569)
* feat: Cache `TableSummary` in catalog Chunks

* refactor: use consistent table summary
2021-05-27 16:03:05 +00:00
Marco Neumann dd2a976907 feat: add a flag to ignore metadata errors during catalog rebuild 2021-05-27 13:10:14 +02:00
Marco Neumann bc7389dc38 fix: fix typo
Co-authored-by: Andrew Lamb <alamb@influxdata.com>
2021-05-27 12:51:01 +02:00
Marco Neumann 48307e4ab2 docs: adjust error description to reflect internal errors
Co-authored-by: Andrew Lamb <alamb@influxdata.com>
2021-05-27 12:51:01 +02:00
Marco Neumann d6f0dc7059 feat: implement catalog rebuilding from files
Closes #1376.
2021-05-27 12:51:01 +02:00
Marco Neumann 024323912a docs: explain what `PreservedCatalog::wipe` offers 2021-05-27 12:48:41 +02:00
Raphael Taylor-Davies 4fcc04e6c9
chore: enable arrow prettyprint feature (#1566) 2021-05-27 10:28:14 +00:00