Commit Graph

234 Commits (e3e801d29aa31b019b8e3ebaff6875617b9a01a6)

Author SHA1 Message Date
Marco Neumann 18893e76e0 refactor: convert some table name and part. key String to Arcs
This has the (somewhat nice) side effect that it shrinks the in-mem
catalog a bit as well because nw `ParquetChunk` is a bit smaller making
the chunk stage enum smaller as well.
2021-07-08 14:34:28 +02:00
Marco Neumann b528ac2b55 feat: store schemas per table
This way we can:

- check for schema matches even for writes going into different
  partitions
- solve #1768 and #1884 in some future PR

Closes #1897.
2021-07-08 09:18:09 +02:00
Andrew Lamb e6d995cbd8
chore: Update to Rust 1.53.0 (#1922)
* chore: Update to Rust 1.53.0

* fix: Update to latest clippy standards

* fix: bad refactor

* fix: Update escaping

* test: update test output

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2021-07-07 18:02:03 +00:00
Marco Neumann 4ca2d3e148 chore: move persistence windows related code into own crate
The entire persistence windows data structures (including the
checkpoints) have nothing to do with the mutable buffer per se. So lets
move them into their own crate. This also makes `parquet_file` not
longer depend on `mutable_buffer`.
2021-07-05 10:23:58 +02:00
Marco Neumann d96e15c3f7 docs: explain why we store checkpoints in parquet files 2021-07-05 09:42:46 +02:00
Marco Neumann cdab1bed05 feat: persist part+db checkpoint in parquets and catalog
This will be required for replay on server startup.
2021-07-05 09:42:46 +02:00
Jacob Marble 0779b0d9bd
feat: add gRPC listener for new write protocol (#1842)
* feat: add gRPC listener for new write protocol

* chore: clippy happy

* chore: lint

* chore: cargo fmt --all

* chore: cargo clippy

* chore: protobuf-lint

* chore: more formatting

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2021-07-01 16:15:12 +00:00
Marco Neumann 4204127b05 refactor: use protobuf for in-parquet metadata 2021-06-30 16:51:37 +02:00
Marco Neumann ddc9cd49ca chore: bump preserved catalog version 2021-06-29 14:23:06 +02:00
Marco Neumann 3ebb6a3037 refactor: do not capture txn-specific information in parquet files
This helps with #1821.
2021-06-29 14:22:36 +02:00
kodiakhq[bot] eda9532eb2
Merge branch 'main' into crepererum/issue1821-cleanup-lock 2021-06-29 10:48:43 +00:00
Marco Neumann 48df13de05 refactor: use parking lot for catalog cleanup 2021-06-29 12:47:29 +02:00
Marco Neumann f824f235b4
fix: fix info log message
Co-authored-by: Andrew Lamb <alamb@influxdata.com>
2021-06-29 12:35:05 +02:00
Marco Neumann 778a611fb8 docs: add clarifying comment for rebuild test 2021-06-29 11:58:19 +02:00
Marco Neumann 17f89ea8d0 docs: fix comment about lock downgrade 2021-06-29 11:53:55 +02:00
Marco Neumann 2cd5ce98be refactor: do not pass locks around for catalog cleanup 2021-06-29 10:21:41 +02:00
Marco Neumann 730a23faa3 refactor: improve locking around the parquet file cleanup
Instead of (ab)using the transaction lock to prevent the cleanup job
from removing just-written parquet files, use a dedicated lock. This
will later allow us to write parquet files before starting a transaction
(i.e. w/o holding the transaction lock).

This will help with #1821.
2021-06-29 10:20:03 +02:00
Marco Neumann 6ec24353bf refactor: only rebuild a single txn for pres. catalogs
Stop relying on in-parquet transaction information during catalog
rebuilds. This has some downsides (no fork detection, only a single
transaction hence no time travel) but will allow that we remove
transaction information from parquet files, so that we can finally move
the actual parquet file storage out of the transaction lock.

This will help with #1821.
2021-06-28 15:10:44 +02:00
Andrew Lamb 0a03605bbc
refactor: pull Channel --> Stream adapater into its own module (#1793)
* refactor: pull Channel --> Stream adapater into its own module

* docs: Update query/src/exec/stream.rs

Co-authored-by: Marko Mikulicic <mkm@influxdata.com>

Co-authored-by: Marko Mikulicic <mkm@influxdata.com>
2021-06-24 10:35:45 +00:00
kodiakhq[bot] 59993e8b8f
Merge branch 'main' into crepererum/issue1623 2021-06-23 12:40:05 +00:00
Marco Neumann c395409b51 feat: include UUIDv4 into parquet file names
Change schema from

```text
<server_id>/<db_name>/data/<part_key>/<chunk_id>/<table_name>.parquet
```

to

```text
<server_id>/<db_name>/data/<table_name>/<part_key>/<chunk_id>.<uuid>.parquet
```

So parquet files will NEVER be overwritten. This is especially helpful
when dealing with old catalog leftovers (i.e. a parquet file that
belonged to an old but wiped catalog). It also simplifies the reasoning
about file references in the future and follows what other dataset
formats are usually doing (i.e. never replace files).

Also use `ChunkAddr` where it makes sense.
2021-06-23 14:30:28 +02:00
kodiakhq[bot] 70817a474c
Merge branch 'main' into crepererum/issue1740-d 2021-06-23 12:29:54 +00:00
Raphael Taylor-Davies 5cd911c74a
fix: correct row count for object store chunks (#1789) 2021-06-23 12:06:49 +00:00
Marco Neumann 1636f47565 refactor: remove dead code 2021-06-23 10:51:22 +02:00
Marco Neumann cf55df68b5 refactor: remove some `Arc`s around the in-mem catalog
This is for #1740.
2021-06-23 10:51:22 +02:00
Marco Neumann e36b6f9c7a docs: fix intra-doc link 2021-06-23 10:25:05 +02:00
Marco Neumann 67508094b4 fix: double ref
Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>
2021-06-23 10:25:05 +02:00
Marco Neumann d2be641864 refactor: make checkpointing easier to use
Don't mix commit+checkpoint in a single call so that the caller has to
reason about the error type and which of the two operations has failed.
Splitting it also makes it easier to create the correct checkpoint data.
2021-06-23 10:25:05 +02:00
Marco Neumann 4a961694ec refactor: make caller sync mem<>OS view during catalog transactions
This is for #1740. Greatly simplifies the integration of the persisted
catalog into the DB.
2021-06-23 10:25:05 +02:00
Marco Neumann d1db0dfaeb refactor: remove type parameter from preserved catalog
For #1740.
2021-06-22 10:53:10 +02:00
Marco Neumann ff60627500 refactor: make preserved catalog NOT own the in-mem catalog
Works towards #1740.
2021-06-21 18:39:43 +02:00
Marco Neumann 881729bd23 refactor: make caller responsible to create checkpoint data
This decouples the in-mem and preserved catalog a bit and works
towards #1740.
2021-06-21 18:33:23 +02:00
Marco Neumann aba973a6e1 refactor: make catalog `wipe` a freestanding function
It does not interact with the `CatalogState` so users can call this
function without that type.
2021-06-21 09:31:23 +02:00
Andrew Lamb 258a6b1956
chore: remove more dead code (#1760)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2021-06-18 21:28:22 +00:00
Andrew Lamb de67bd3efe
refactor: Remove PartitionChunk::table_schema (#1756)
* refactor: Remove PartitionChunk::table_schema

* docs: update comments
2021-06-18 16:13:16 +00:00
Raphael Taylor-Davies f6dbc8d6f2
refactor: add ChunkAddr to describe location of chunk in catalog (#1745)
* refactor: add ChunkPath to describe location of chunk in catalog

* refactor: rename ChunkPath to ChunkAddr

* chore: further renames

* chore: even more renames

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2021-06-17 12:04:37 +00:00
Marco Neumann e056d97cf6 test: always test transaction aborts 2021-06-16 11:01:14 +02:00
Marco Neumann caaf95c6ec refactor: remove lock from `TestCatalogState` 2021-06-16 10:51:15 +02:00
Marco Neumann c8c412f6fe refactor: rework catalog state interface
This now allows not only for copy-based transaction handling but also
for eager exec and rollbacks. This will be useful to properly implement
transaction aborts for the "real" catalog.
2021-06-16 10:51:15 +02:00
Marco Neumann e064a6bbba test: add test suite for `CatalogState` impls
This makes it easier to check if `CatalogState` correctly implement all
features, including transaction aborting.
2021-06-16 10:50:47 +02:00
Andrew Lamb b756e09904
refactor: Rename parquet_file::Chunk --> ParquetChunk (#1722)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2021-06-15 11:21:49 +00:00
Marco Neumann 64c815dd50
fix: bump catalog version (#1726)
This should have been done in #1714. Also add a note so that future devs
might hopefully not forget. In any case though the code also works w/o
this bump, it's just that the error message is a bit less nice ("cannot
parse IOxMetadata" instead of "unsupported catalog version").

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2021-06-15 10:26:30 +00:00
Marco Neumann 55fc5e564b refactor: remove serverID and DB name args from catalog state
They are no longer required.
2021-06-15 09:35:41 +02:00
Marco Neumann 776b6c011c feat: remove path parsing functionality
Paths to parquet files are an implementation detail and should not be
parsed.

Closes #1506.
2021-06-14 16:24:50 +02:00
Marco Neumann 250ccdcdcd refactor: use `IOxMetadata` instead of path parsing for parquet chunks 2021-06-14 16:24:50 +02:00
Marco Neumann d51e7a127c feat: include table name, partition key, and chunk ID in `IoxMetadata` 2021-06-14 16:24:50 +02:00
kodiakhq[bot] b57f397057
Merge branch 'main' into crepererum/checkpoint_during_restore 2021-06-14 13:54:03 +00:00
Marco Neumann 0a7dcc3779 test: adjust read-write parquet test to newest test data 2021-06-14 14:24:24 +02:00
Marco Neumann d6f6ddfdaa fix: fix NULL handling in parquet stats 2021-06-14 14:24:09 +02:00
Marco Neumann eae56630fb test: add test for all-NULL float column metadata 2021-06-14 13:48:34 +02:00
Marco Neumann 3f9bcf7cd9 fix: fix NaN handling in parquet stats 2021-06-14 13:44:52 +02:00
Marco Neumann ea96210e98 test: enable unblocked test 2021-06-14 13:44:52 +02:00
Marco Neumann 518f7c6f15 refactor: wrap upstream parquet MD into struct + clean up interface
This prevents users from `parquet_file::metadata` to also depend on
`parquet` directly. Furthermore they don't need to important dozend of
functions and can instead just use `IoxParquetMetaData` directly.
2021-06-14 13:17:01 +02:00
Marco Neumann 030d0d2b9a feat: create checkpoint during catalog rebuild 2021-06-14 10:55:56 +02:00
Marco Neumann df866f72e0 refactor: store parquet metadata in chunk
This will be useful for #1381.

At the moment we parse schema and stats eagerly and store them alongside
the parquet metadata in memory. Technically this is not required since
this is basically duplicate data. In the future we might trade-off some
of this memory against CPU consumption by parsing schema and stats on
demand.
2021-06-14 10:08:31 +02:00
Marco Neumann e6699ff15a test: ensure that `find_last_transaction_timestamp` considers checkpoints 2021-06-14 10:04:50 +02:00
Marco Neumann f8a518bbed refactor: inline `Table` into `parquet_file::chunk::Chunk`
Note that the resulting size estimations are different because we were
double-counting `Table`. `mem::size_of::<Self>()` is recursive for
non-boxed types since the child will be part of the parent structure.

Issue: #1295.
2021-06-11 11:54:31 +02:00
Marco Neumann 28d1dc4da1 chore: bump preserved catalog version 2021-06-10 16:01:13 +02:00
Marco Neumann 80ee36cd1a refactor: slightly streamline path parsing code in pres. catalog 2021-06-10 15:59:28 +02:00
Marco Neumann 7e7332c9ce refactor: make comparison a bit less confusing 2021-06-10 15:42:21 +02:00
Marco Neumann fd581e2ec9 docs: fix confusion wording in `CatalogState::files` 2021-06-10 15:42:21 +02:00
Marco Neumann be9b3a4853 fix: protobuf lint fixes 2021-06-10 15:42:21 +02:00
Marco Neumann 294c304491 feat: impl catalog checkpointing infrastructure
This implements a way to add checkpoints to the preserved catalog and
speed up replay.

Note: This leaves the "hook it up into the actual DB" for a future PR.

Issue: #1381.
2021-06-10 15:42:21 +02:00
Marco Neumann 188cacec54 refactor: use `Arc` to pass `ParquetFileMetaData`
This will be handy when the catalog state must be able to return
metadata objects so that we can create checkpoints, esp. when we use
multi-chunk parquet files in some midterm future.
2021-06-10 15:42:21 +02:00
Marco Neumann c7412740e4 refactor: prepare to read and write multiple file types for catalog
Prepares #1381.
2021-06-10 15:42:21 +02:00
Marco Neumann 33e364ed78 feat: add encoding info to transaction protobuf
This should help with #1381.
2021-06-10 15:42:21 +02:00
Marco Neumann 4fe2d7af9c chore: enforce `clippy::future_not_send` for `parquet_file` 2021-06-09 18:18:27 +02:00
Andrew Lamb ab0aed0f2e
refactor: Remove a layer of channels in parquet read stream (#1648)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2021-06-07 16:47:04 +00:00
Raphael Taylor-Davies 1e7ef193a6
refactor: use field metadata to store influx types (#1642)
* refactor: use field metadata to store influx types

make SchemaBuilder non-consuming

* chore: remove unused variants

* chore: fix lints
2021-06-07 13:26:39 +00:00
Marco Neumann c830542464 feat: add info log when cleanup limit is reached 2021-06-04 11:12:29 +02:00
Marco Neumann 91df8a30e7 feat: limit number of files during storage cleanup
Since the number of parquet files can potentially be unbound (aka very
very large) and we do not want to hold the transaction lock for too
long and also want to limit memory consumption of the cleanup routine,
let's limit the number of files that we collect for cleanup.
2021-06-03 17:43:11 +02:00
Marco Neumann 85139abbbb fix: use structured logging for cleanup logs 2021-06-03 11:23:29 +02:00
Andrew Lamb 32c6ed1f34
refactor: More cleanup related to multi-table chunks (#1604)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2021-06-02 17:00:23 +00:00
Marco Neumann e5b65e10ac test: ensure that `find_last_transaction_timestamp` indeed returns the last timestamp 2021-06-02 10:15:06 +02:00
Marco Neumann 98e413d5a9 fix: do not unwrap broken timestamps in serialized catalog 2021-06-02 10:15:06 +02:00
Marco Neumann fc0a74920f fix: use clearer error text 2021-06-02 09:41:19 +02:00
Marco Neumann 2a0b2698c6 fix: use structured logging
Co-authored-by: Andrew Lamb <alamb@influxdata.com>
2021-06-02 09:41:19 +02:00
Marco Neumann 64bf8c5182 docs: add code comment explaining why we parse transaction timestamps
Co-authored-by: Andrew Lamb <alamb@influxdata.com>
2021-06-02 09:41:19 +02:00
Marco Neumann 77aeb5ca5d refactor: use protobuf-native Timestamp instead of string 2021-06-02 09:41:19 +02:00
Marco Neumann 9b9400803b refactor!: bump transaction version to 2 2021-06-02 09:41:19 +02:00
Marco Neumann 5f77b7b92b feat: add `parquet_file::catalog::find_last_transaction_timestamp` 2021-06-02 09:41:19 +02:00
Marco Neumann 9aee961e2a test: test loading catalogs from broken protobufs 2021-06-02 09:41:19 +02:00
Marco Neumann 0a625b50e6 feat: store transaction timestamp in preserved catalog 2021-06-02 09:41:19 +02:00
Andrew Lamb d8fbb7b410
refactor: Remove last vestiges of multi-table chunks from PartitionChunk API (#1588)
* refactor: Remove last vestiges of multi-table chunks from PartitionChunk API

* fix: remove test that can no longer fail

* fix: update tests + code review comments

* fix: clippy

* fix: clippy

* fix: restore test_measurement_fields_error test
2021-06-01 16:12:33 +00:00
Andrew Lamb d3711a5591
refactor: Use ParquetExec from DataFusion to read parquet files (#1580)
* refactor: use ParquetExec to read parquet files

* fix: test

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2021-06-01 14:44:07 +00:00
Andrew Lamb 64328dcf1c
feat: cache schema on catalog chunks too (#1575) 2021-06-01 12:42:46 +00:00
Andrew Lamb 00e735ef0d
chore: remove unused dependencies (#1583) 2021-05-29 10:31:57 +00:00
Raphael Taylor-Davies db432de137
feat: add distinct count to StatValues (#1568) 2021-05-28 17:41:34 +00:00
kodiakhq[bot] 6098c7cd00
Merge branch 'main' into crepererum/issue1376 2021-05-28 07:13:15 +00:00
Andrew Lamb f3bec93ef1
feat: Cache TableSummary in Catalog rather than computing it on demand (#1569)
* feat: Cache `TableSummary` in catalog Chunks

* refactor: use consistent table summary
2021-05-27 16:03:05 +00:00
Marco Neumann dd2a976907 feat: add a flag to ignore metadata errors during catalog rebuild 2021-05-27 13:10:14 +02:00
Marco Neumann bc7389dc38 fix: fix typo
Co-authored-by: Andrew Lamb <alamb@influxdata.com>
2021-05-27 12:51:01 +02:00
Marco Neumann 48307e4ab2 docs: adjust error description to reflect internal errors
Co-authored-by: Andrew Lamb <alamb@influxdata.com>
2021-05-27 12:51:01 +02:00
Marco Neumann d6f0dc7059 feat: implement catalog rebuilding from files
Closes #1376.
2021-05-27 12:51:01 +02:00
Marco Neumann 024323912a docs: explain what `PreservedCatalog::wipe` offers 2021-05-27 12:48:41 +02:00
Raphael Taylor-Davies 4fcc04e6c9
chore: enable arrow prettyprint feature (#1566) 2021-05-27 10:28:14 +00:00
Marco Neumann 9f451423d5 feat: log files that are deleted 2021-05-26 12:49:44 +02:00
Marco Neumann 24ec1a472e fix: do NOT delete parquet files that are reachable by time travel 2021-05-26 12:38:54 +02:00
Marco Neumann 5983336366 refactor: rename `parquet_file::{utils => test_utils}` 2021-05-26 11:09:29 +02:00
Marco Neumann d7e3bc569e refactor: shorten time we hold the transaction lock during clean-up 2021-05-26 11:04:57 +02:00