Commit Graph

234 Commits (e3e801d29aa31b019b8e3ebaff6875617b9a01a6)

Author SHA1 Message Date
Marco Neumann 3f9bcf7cd9 fix: fix NaN handling in parquet stats 2021-06-14 13:44:52 +02:00
Marco Neumann ea96210e98 test: enable unblocked test 2021-06-14 13:44:52 +02:00
Marco Neumann 518f7c6f15 refactor: wrap upstream parquet MD into struct + clean up interface
This prevents users from `parquet_file::metadata` to also depend on
`parquet` directly. Furthermore they don't need to important dozend of
functions and can instead just use `IoxParquetMetaData` directly.
2021-06-14 13:17:01 +02:00
Marco Neumann 030d0d2b9a feat: create checkpoint during catalog rebuild 2021-06-14 10:55:56 +02:00
Marco Neumann df866f72e0 refactor: store parquet metadata in chunk
This will be useful for #1381.

At the moment we parse schema and stats eagerly and store them alongside
the parquet metadata in memory. Technically this is not required since
this is basically duplicate data. In the future we might trade-off some
of this memory against CPU consumption by parsing schema and stats on
demand.
2021-06-14 10:08:31 +02:00
Marco Neumann e6699ff15a test: ensure that `find_last_transaction_timestamp` considers checkpoints 2021-06-14 10:04:50 +02:00
Marco Neumann f8a518bbed refactor: inline `Table` into `parquet_file::chunk::Chunk`
Note that the resulting size estimations are different because we were
double-counting `Table`. `mem::size_of::<Self>()` is recursive for
non-boxed types since the child will be part of the parent structure.

Issue: #1295.
2021-06-11 11:54:31 +02:00
Marco Neumann 28d1dc4da1 chore: bump preserved catalog version 2021-06-10 16:01:13 +02:00
Marco Neumann 80ee36cd1a refactor: slightly streamline path parsing code in pres. catalog 2021-06-10 15:59:28 +02:00
Marco Neumann 7e7332c9ce refactor: make comparison a bit less confusing 2021-06-10 15:42:21 +02:00
Marco Neumann fd581e2ec9 docs: fix confusion wording in `CatalogState::files` 2021-06-10 15:42:21 +02:00
Marco Neumann be9b3a4853 fix: protobuf lint fixes 2021-06-10 15:42:21 +02:00
Marco Neumann 294c304491 feat: impl catalog checkpointing infrastructure
This implements a way to add checkpoints to the preserved catalog and
speed up replay.

Note: This leaves the "hook it up into the actual DB" for a future PR.

Issue: #1381.
2021-06-10 15:42:21 +02:00
Marco Neumann 188cacec54 refactor: use `Arc` to pass `ParquetFileMetaData`
This will be handy when the catalog state must be able to return
metadata objects so that we can create checkpoints, esp. when we use
multi-chunk parquet files in some midterm future.
2021-06-10 15:42:21 +02:00
Marco Neumann c7412740e4 refactor: prepare to read and write multiple file types for catalog
Prepares #1381.
2021-06-10 15:42:21 +02:00
Marco Neumann 33e364ed78 feat: add encoding info to transaction protobuf
This should help with #1381.
2021-06-10 15:42:21 +02:00
Marco Neumann 4fe2d7af9c chore: enforce `clippy::future_not_send` for `parquet_file` 2021-06-09 18:18:27 +02:00
Andrew Lamb ab0aed0f2e
refactor: Remove a layer of channels in parquet read stream (#1648)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2021-06-07 16:47:04 +00:00
Raphael Taylor-Davies 1e7ef193a6
refactor: use field metadata to store influx types (#1642)
* refactor: use field metadata to store influx types

make SchemaBuilder non-consuming

* chore: remove unused variants

* chore: fix lints
2021-06-07 13:26:39 +00:00
Marco Neumann c830542464 feat: add info log when cleanup limit is reached 2021-06-04 11:12:29 +02:00
Marco Neumann 91df8a30e7 feat: limit number of files during storage cleanup
Since the number of parquet files can potentially be unbound (aka very
very large) and we do not want to hold the transaction lock for too
long and also want to limit memory consumption of the cleanup routine,
let's limit the number of files that we collect for cleanup.
2021-06-03 17:43:11 +02:00
Marco Neumann 85139abbbb fix: use structured logging for cleanup logs 2021-06-03 11:23:29 +02:00
Andrew Lamb 32c6ed1f34
refactor: More cleanup related to multi-table chunks (#1604)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2021-06-02 17:00:23 +00:00
Marco Neumann e5b65e10ac test: ensure that `find_last_transaction_timestamp` indeed returns the last timestamp 2021-06-02 10:15:06 +02:00
Marco Neumann 98e413d5a9 fix: do not unwrap broken timestamps in serialized catalog 2021-06-02 10:15:06 +02:00
Marco Neumann fc0a74920f fix: use clearer error text 2021-06-02 09:41:19 +02:00
Marco Neumann 2a0b2698c6 fix: use structured logging
Co-authored-by: Andrew Lamb <alamb@influxdata.com>
2021-06-02 09:41:19 +02:00
Marco Neumann 64bf8c5182 docs: add code comment explaining why we parse transaction timestamps
Co-authored-by: Andrew Lamb <alamb@influxdata.com>
2021-06-02 09:41:19 +02:00
Marco Neumann 77aeb5ca5d refactor: use protobuf-native Timestamp instead of string 2021-06-02 09:41:19 +02:00
Marco Neumann 9b9400803b refactor!: bump transaction version to 2 2021-06-02 09:41:19 +02:00
Marco Neumann 5f77b7b92b feat: add `parquet_file::catalog::find_last_transaction_timestamp` 2021-06-02 09:41:19 +02:00
Marco Neumann 9aee961e2a test: test loading catalogs from broken protobufs 2021-06-02 09:41:19 +02:00
Marco Neumann 0a625b50e6 feat: store transaction timestamp in preserved catalog 2021-06-02 09:41:19 +02:00
Andrew Lamb d8fbb7b410
refactor: Remove last vestiges of multi-table chunks from PartitionChunk API (#1588)
* refactor: Remove last vestiges of multi-table chunks from PartitionChunk API

* fix: remove test that can no longer fail

* fix: update tests + code review comments

* fix: clippy

* fix: clippy

* fix: restore test_measurement_fields_error test
2021-06-01 16:12:33 +00:00
Andrew Lamb d3711a5591
refactor: Use ParquetExec from DataFusion to read parquet files (#1580)
* refactor: use ParquetExec to read parquet files

* fix: test

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2021-06-01 14:44:07 +00:00
Andrew Lamb 64328dcf1c
feat: cache schema on catalog chunks too (#1575) 2021-06-01 12:42:46 +00:00
Andrew Lamb 00e735ef0d
chore: remove unused dependencies (#1583) 2021-05-29 10:31:57 +00:00
Raphael Taylor-Davies db432de137
feat: add distinct count to StatValues (#1568) 2021-05-28 17:41:34 +00:00
kodiakhq[bot] 6098c7cd00
Merge branch 'main' into crepererum/issue1376 2021-05-28 07:13:15 +00:00
Andrew Lamb f3bec93ef1
feat: Cache TableSummary in Catalog rather than computing it on demand (#1569)
* feat: Cache `TableSummary` in catalog Chunks

* refactor: use consistent table summary
2021-05-27 16:03:05 +00:00
Marco Neumann dd2a976907 feat: add a flag to ignore metadata errors during catalog rebuild 2021-05-27 13:10:14 +02:00
Marco Neumann bc7389dc38 fix: fix typo
Co-authored-by: Andrew Lamb <alamb@influxdata.com>
2021-05-27 12:51:01 +02:00
Marco Neumann 48307e4ab2 docs: adjust error description to reflect internal errors
Co-authored-by: Andrew Lamb <alamb@influxdata.com>
2021-05-27 12:51:01 +02:00
Marco Neumann d6f0dc7059 feat: implement catalog rebuilding from files
Closes #1376.
2021-05-27 12:51:01 +02:00
Marco Neumann 024323912a docs: explain what `PreservedCatalog::wipe` offers 2021-05-27 12:48:41 +02:00
Raphael Taylor-Davies 4fcc04e6c9
chore: enable arrow prettyprint feature (#1566) 2021-05-27 10:28:14 +00:00
Marco Neumann 9f451423d5 feat: log files that are deleted 2021-05-26 12:49:44 +02:00
Marco Neumann 24ec1a472e fix: do NOT delete parquet files that are reachable by time travel 2021-05-26 12:38:54 +02:00
Marco Neumann 5983336366 refactor: rename `parquet_file::{utils => test_utils}` 2021-05-26 11:09:29 +02:00
Marco Neumann d7e3bc569e refactor: shorten time we hold the transaction lock during clean-up 2021-05-26 11:04:57 +02:00
Marco Neumann 18f5dd9ae1 test: ensure transaction lock exists during cleanup planning 2021-05-26 11:04:57 +02:00
Marco Neumann b55eae98da fix: do not delete non-parquet files during catalog-driven cleanup 2021-05-26 11:04:57 +02:00
Marco Neumann 5ed16ff294 refactor: improve error message in `parquet_file::cleanup` 2021-05-26 11:04:57 +02:00
Marco Neumann 14fdf3b7c7 feat: implement object store cleanup core routine 2021-05-26 11:02:40 +02:00
Marco Neumann cc78b5317d feat: add method to get all parquet files from catalog state 2021-05-26 11:02:40 +02:00
Marco Neumann 953114af2e feat: add method to abort catalog transaction 2021-05-26 11:02:40 +02:00
Marco Neumann 92fcd7e940 feat: add a way to get OS, server ID and DB name from catalog 2021-05-26 11:02:40 +02:00
Marco Neumann 9daa4d00d6 test: re-organize `parquet_file` test utils a bit 2021-05-26 11:02:39 +02:00
Marco Neumann 38183928c8 refactor: extract path generator for data location 2021-05-26 10:59:40 +02:00
Marco Neumann 19a2733d30 feat: preserve transaction metadata in parquets 2021-05-25 09:56:12 +02:00
Marco Neumann fe8e6301fe refactor: move `read_schema_from_parquet_metadata` back to `parquet_file::metadata`
Let us pool all metadata handling in a single module, which makes it
easier to review.
2021-05-25 09:37:53 +02:00
Marco Neumann ac83d99f66 feat: add a way to get current revision and UUID from transaction handle 2021-05-25 09:37:53 +02:00
Marco Neumann fdc553b257 refactor: replace unwrap with expect 2021-05-25 09:37:53 +02:00
Andrew Lamb c464ffadad
refactor: remove special case timestamp_range in parquet chunk (#1543)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2021-05-24 16:19:44 +00:00
Andrew Lamb 14ba25f86d
chore: Update datafusion and use released version of arrow crates (#1546)
* chore: Update datafusion and use released version of arrow crate

* fix: Update for change in API
2021-05-24 15:37:22 +00:00
Andrew Lamb 27e5b8fabf
refactor: Remove multiple table support from Parquet Chunk (#1541) 2021-05-24 08:40:31 -04:00
Marco Neumann 8bdddfd475 docs: mention that catalog wiping does not delete parquet files 2021-05-20 10:22:20 +02:00
Marco Neumann b1a06246d6 feat: implement function to wipe a preserved catalog 2021-05-20 10:22:20 +02:00
Marco Neumann 6c405aa6f9 feat: check if preserved catalog exists when creating an empty one 2021-05-20 10:22:20 +02:00
Marco Neumann c6a6005f65 feat: add `PreservedCatalog.exists` 2021-05-20 10:22:20 +02:00
Raphael Taylor-Davies 37880ee89a
refactor: store chunk IDs only in catalog (#1521)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2021-05-20 04:07:14 +00:00
Marco Neumann 8db26485a4 refactor: empty transaction during catalog creation
That involves some refactoring which we are going to need anyway for
hooking up the "read" path of the catalog into the DB startup, namely:

- make `Db::new` require a preserved catalog
- introduce a helper function that can provide that
- as a consequence, all test-creations of a Db are now async

This prepares for #1382.
2021-05-18 17:42:07 +02:00
Marco Neumann cdf0ada6a6 test: test preserved catalog <-> Db write wiring 2021-05-17 13:57:31 +02:00
Marco Neumann 68729dd5ee refactor: avoid string allocation 2021-05-17 12:32:34 +02:00
Marco Neumann adcd8132e7 docs: more comments regarding catalog transaction handling 2021-05-17 12:05:08 +02:00
Marco Neumann a99d53e771 docs: document `OpenTransaction::handle_action*` 2021-05-17 11:48:51 +02:00
Marco Neumann 4fb800c7a6 refactor: make PreservedCatalog easier to integrate 2021-05-17 11:33:22 +02:00
Marco Neumann f4d7154746 fix: table summaries must include timestamp as well 2021-05-17 11:33:22 +02:00
Marco Neumann 7cced3242f feat: add a way to parse infos from parquet paths 2021-05-17 11:33:22 +02:00
Marco Neumann 5969caccb0 feat: return parquet metadata from `write_to_object_store` 2021-05-17 11:33:22 +02:00
Raphael Taylor-Davies f9178dbb5f
feat: push metrics into catalog (#1488)
* feat: push metrics into catalog

* chore: minor cleanup

* fix: include db labels in chunk metric domains

* chore: fmt

* fix: don't allow dropping moving chunks

* chore: further tweaks

* chore: review feedback

* feat: use new_unregistered() for metric instruments instead of default

* chore: use &[KeyValue] instead of &Vec<KeyValue>

* refactor: make GauageValue non default constructible
2021-05-14 17:37:39 +00:00
Nga Tran 9583636748 feat: we now can read parquet files form all kind of object stores 2021-05-12 18:05:34 -04:00
Marco Neumann 795f5bfcb7 refactor: make `StatValues::{min,max}` optional + handle NaNs
This will allow us to:

- handle all-NULL columns correctly
- be in-line with Parquet (where min/max are optional)
- handle NaNs at least somewhat sane (they do not "poison" stats
  anymore)
2021-05-10 17:12:25 +02:00
Nga Tran c6b933eb63 chore: merge main to branch 2021-05-07 18:40:17 -04:00
Nga Tran f2c19ec080 refactor: further address Carol's comment 2021-05-07 17:40:40 -04:00
Nga Tran 971500681f refactor: address Andrew's and Carol's comment 2021-05-07 17:33:19 -04:00
Carol (Nichols || Goulding) e2cc4634bf fix: Use PathBuf rather than debug formatting and back to String
This is the same fix I made in 54c5f98, just found a few more spots :)
2021-05-07 15:58:11 -04:00
Nga Tran 31d49db0ed chore: a litlle more cleanup 2021-05-07 09:38:41 -04:00
Nga Tran ba015ee4df refactor: clean up and add comments 2021-05-07 09:31:41 -04:00
Marco Neumann 1a998d4116 feat: preserve parquet metadata in catalog
Closes #1380.
2021-05-07 09:51:44 +02:00
Marco Neumann c3d523fc4f refactor: add col prefixes to make_chunk & Co 2021-05-07 09:51:44 +02:00
Marco Neumann 5db504300d refactor: use parsed paths instead of raw strings for catalog paths 2021-05-07 09:51:44 +02:00
Nga Tran 55bf848bd2 feat: Now we can query directly from files in object store 2021-05-06 18:02:17 -04:00
Andrew Lamb 884baf7329
feat: add column_type and influxdb_column_type, remove row_count from system.columns (#1415)
* feat: add column_type and influxdb_column_type, remove row_count from system.columns

* fix: update tests

* fix: more test update

* fix: Apply suggestions from code review

Co-authored-by: Carol (Nichols || Goulding) <193874+carols10cents@users.noreply.github.com>

* fix: fmt

* fix: copy/paste type conversion to avoid cross dependency between data_types and internal_types

Co-authored-by: Carol (Nichols || Goulding) <193874+carols10cents@users.noreply.github.com>
2021-05-06 12:59:30 +00:00
Andrew Lamb 86771ea629
chore: update arrow/datafusion deps (#1433)
* chore: update datafusion deps

* chore: update arrow deps

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2021-05-05 22:37:31 +00:00
Nga Tran a5c92fae8a chore: merge main to branch 2021-05-05 13:48:42 -04:00
Nga Tran 3bdb451529 chore: merge main to branch 2021-05-05 13:18:39 -04:00
Raphael Taylor-Davies 411cf134e9
refactor: explode arrow_deps (#1425)
* refactor: explode arrow_deps

* chore: workaround doctest bug
2021-05-05 16:59:12 +00:00
Nga Tran 2b46f51e5b chore: address Dom's comment 2021-05-05 12:55:41 -04:00
Nga Tran a1f3413c89 refactor: move private test helpers to utils module to be used by many modules 2021-05-05 11:41:46 -04:00
Nga Tran fcb37a0b1d feat: more testing scenarios for quering parquet files 2021-05-05 10:57:02 -04:00
Marco Neumann 1f42eb89cd feat: implement parquet metadata handling
Closes #1379 and contributes to #1380.
2021-05-05 13:29:16 +02:00
Marco Neumann 056c29aaa2 feat: add a way to retrieve timestamp range from parquet chunk 2021-05-05 13:29:16 +02:00
Marco Neumann c54109113e feat: add a way to retrieve storage path from parquet chunks 2021-05-05 13:29:16 +02:00
Marco Neumann 136c35cb88 feat: implement transaction handling for catalog
Closes #1253.
2021-05-03 10:04:35 +02:00
Nga Tran 34a3388a49 feat: unload chunks from read buffer but keep them in object store 2021-04-30 16:12:02 -04:00
Nga Tran e87973babe refactor: address review comments 2021-04-29 13:15:43 -04:00
Nga Tran 402d9c748c chore: cargo fmt 2021-04-28 16:52:52 -04:00
Nga Tran 2a2760bd18 feat: complete tests where data in both RUB and OS 2021-04-28 16:14:07 -04:00
Nga Tran 140d96dbea feat: tests ffor loading data to object store and make sure twe still query read buffer 2021-04-28 15:59:17 -04:00
Marco Neumann eddc9319ff docs: deny broken intradoc links 2021-04-27 13:22:28 +02:00
Carol (Nichols || Goulding) 272cdb85ce fix: Use the ServerId type everywhere, for writing, querying, anything 2021-04-26 18:44:32 +00:00
Carol (Nichols || Goulding) b8face3335 refactor: Organize use statements 2021-04-26 18:44:32 +00:00
Jake Goulding 67f5ad841d refactor: Introduce ServerId and CurrentServerId types 2021-04-26 18:44:32 +00:00
Nga Tran 657bfa1b20 refactor: address Andrew's comments 2021-04-16 17:44:46 -04:00
Nga Tran b3e110a241 refactor: address Jake's comment 2021-04-16 17:27:40 -04:00
Nga Tran 4c23ca8888 feat: full implementation of parquet's read_filter for review 2021-04-16 16:03:24 -04:00
Andrew Lamb e226b5a820
feat: Use TimestampNanosecondArray for timestamps in IOx (#1230)
* refactor: Create Arrow arrays using iterators

* feat: use Timestamp64(TimeUnit::Nanosecond) for timestamps

* feat: add support for timestamp array

* fix: update more tests

* fix: remove unecessary code

Co-authored-by: Edd Robinson <me@edd.io>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2021-04-16 15:55:33 +00:00
Nga Tran 231ebb54d4 chore: fix a format 2021-04-14 16:32:25 -04:00
Nga Tran 4e2d59d9a5 feat: saimplement a few more functions as part of supporting query dfrom parquet files 2021-04-14 16:06:47 -04:00
Nga Tran 05bf28ce85 feat: Add 2 main functions table_schema and table_names for Parquet Chunk ato pay a foundation for querying it 2021-04-13 18:23:55 -04:00
Nga Tran 4a6d6bd7ad feat: initial work for querying data from parquet file in object store 2021-04-13 13:57:46 -04:00
Raphael Taylor-Davies 1997324344
feat: mutable buffer snapshotting (#1179)
* feat: mutable buffer snapshotting

* chore: review feedback
2021-04-13 12:14:54 +00:00
Nga Tran 453aeaf1a0 feat: Add tests for writing RB chunks to Object Store 2021-04-09 17:39:23 -04:00
Nga Tran f501a74aea refactor: Address review comments 2021-04-07 21:28:03 -04:00
Nga Tran be6e1e48e4 feat: add writer_id and object_store in Db 2021-04-07 18:36:07 -04:00
Raphael Taylor-Davies c2355aca6d
feat: add basic memory tracking (#1125)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2021-04-07 15:38:24 +00:00
Nga Tran 6e01fbc382 feat: ause TableSummary as metadata for parquet chunk's tables and read buffer's read_filter ot get data 2021-04-05 15:37:34 -04:00
Nga Tran 4bdf8963e6 feat: continue buidling foundation for writing RB chunks to parquet files 2021-04-02 16:06:25 -04:00
Nga Tran 49267114d3 chore: merge main into branch and resolve conflicts 2021-04-01 13:22:49 -04:00
Nga Tran 1463c6645f feat: Add ChunkState::ObjectStore and rename ParquetChunk to Chunk 2021-04-01 11:53:03 -04:00
Nga Tran 19a453a483 feat: finally have some framework with clear todos for writing a chunk into parquet files 2021-03-31 16:21:53 -04:00
Nga Tran cd409b471f feat: continue the implementation 2021-03-30 21:31:51 -04:00
Nga Tran 0bcd52d5c9 feat: Add more changes 2021-03-30 18:31:09 -04:00