Commit Graph

579 Commits (c8242c74696bd849e8b296f7b255d909babd7bd5)

Author SHA1 Message Date
Carol (Nichols || Goulding) 7246f2702a fix: Bump transaction version because of a change in the Parquet files 2021-08-19 09:32:37 -04:00
Raphael Taylor-Davies 5a841600d9 feat: make catalog state test deterministic (#2349) 2021-08-19 14:04:27 +01:00
Carol (Nichols || Goulding) 6390156c0e fix: Remove error types not used anywhere 2021-08-18 11:32:39 -04:00
Carol (Nichols || Goulding) ef0e1a3f60 refactor: Extract a transaction file path type 2021-08-18 11:32:39 -04:00
Carol (Nichols || Goulding) 6d5cb9c117 refactor: Extract a ParquetFilePath to handle paths to parquet files in a db's object store 2021-08-18 11:32:39 -04:00
Ning Sun c012e996ab
refactor: remove display methods, use fmt::Display instead. (#2272)
* refactor: remove display methods, use fmt::Display instead.

Signed-off-by: Ning Sun <sunng@protonmail.com>

* refactor: update a few calls from .display to .to_string()

* fix: consistently use `Path` rather than occasionally `DirsAndFileName`

* fix: fixup for merge conflicts

* fix: update test

* fix: Catch another case or two

* fix: fmt

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2021-08-16 18:00:22 +00:00
Carol (Nichols || Goulding) 564238ad8c refactor: Organize uses 2021-08-12 15:05:32 -04:00
Carol (Nichols || Goulding) ae6b0e669b refactor: Extract a database persister type that wraps object store
Connects to #2193.
2021-08-12 15:05:32 -04:00
Carol (Nichols || Goulding) daa534ee32 refactor: Incorporate Path parsing into the TransactionFile type 2021-08-12 09:06:14 -04:00
Carol (Nichols || Goulding) ee3173efb1 refactor: Simplify implementation of parse_file_path 2021-08-12 09:06:14 -04:00
Carol (Nichols || Goulding) dbd1718fd2 refactor: Use the TransactionKey type 2021-08-12 09:06:14 -04:00
Carol (Nichols || Goulding) 7f7a911a9a refactor: Extract a TransactionFile type to manage transaction paths 2021-08-12 09:06:06 -04:00
Dom 3de6b44e23
build: use new rustdoc lint name (#2261)
* fix: nocache feature code rot

The MBChunk::snapshot code when using the "nocache" option no longer
compiles - this commit updates it to match the not(nocache) code.

* build: use updated broken_intra_doc_links name

The broken_intra_doc_links lint was renamed
rustdoc::broken_intra_doc_links

https://doc.rust-lang.org/rustdoc/lints.html
2021-08-11 19:48:51 +00:00
Marco Neumann 8721c5fcd6 fix: improve error messages 2021-08-09 10:54:23 +02:00
Marco Neumann 950286e5b7 feat: make replay planning work w/ unordered checkpoints 2021-08-09 10:54:23 +02:00
Andrew Lamb d41b44d312
feat: use zstd compression when writing parquet files (#2218)
* feat: use ZSTD when writing parquet files

* fix: test
2021-08-06 18:45:55 +00:00
Andrew Lamb e92e94caad
chore: Update deps (including arrow 5.1.0, tonic -> 0.5, and prost 0.5) (#2172)
* chore: Update deps (including arrow 5.0.0 --> arrow 5.1.0)

* chore: update all the things

* refactor: Update serving readiness check due to change in Tonic API

* chore: update more deps

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2021-08-05 15:57:38 +00:00
Andrew Lamb 1ccaa433e8
fix: Temporarily disable parquet predicate pushdown (#2164) 2021-07-30 20:24:30 +00:00
Carol (Nichols || Goulding) 9d15798288 fix: Address or allow Clippy warnings new with Rust 1.54 2021-07-30 09:59:59 -04:00
kodiakhq[bot] 545222303f
Merge branch 'main' into cn/cc-only 2021-07-29 17:18:16 +00:00
Carol (Nichols || Goulding) ad0a9549de fix: Avoid an unnecessary parsing of iox metadata
In one case where ParquetChunk::new was being called, the calling code
had just parsed the IoxMetadata too. In the other case, the calling code
had just *created* the IoxMetadata being parsed. In both cases, this
re-parsing wasn't actually needed; the two bits of info
ParquetChunk::new can be easily passed in.
2021-07-28 14:25:56 -04:00
Carol (Nichols || Goulding) af7866a638 refactor: Remove first/last write times from ParquetFile chunks 2021-07-28 14:12:36 -04:00
Marco Neumann 04e797c706 refactor: pass sequencer numbers directly to DB checkpoint
First of all using a partition checkpoint as some kind of intermediate
representation was kinda a hack because partition checkpoints should
only created for to-be-persisted partitions, not for the others.
API-wise it should only be possible to construct a partition checkpoint
from a flush handle.

Also we were only able to construct partition checkpoints for partitions
that had unpersisted data, otherwise there was no sane way to fill the
`min_unpersisted_timestamp`. We must however scan all partitions no
matter if there is unpersisted data so that we can determine the maximum
seen sequence numbers. This was caught by a replay test resulting in a
catalog state where the last database checkpoint had lower maximum seen
sequence numbers than some partition checkpoint, bailing out with an
error.

So overall it turns out that passing the sequencer numbers directly
instead of wrapping them into a partition checkpoint is the better
implementation.
2021-07-28 17:28:34 +02:00
Andrew Lamb 5fb3e00f2a
fix: Properly record total_count and null_count in statistics (#2103)
* fix: Properly record total_count and null_count in statistics

* fix: fix statistics calculation in mutable_buffer

* refactor: expose null counts in read_buffer

* refactor: expose null_count in parquet_file

* fix: update server crate tests

* fix: update query_tests tests

* docs: tweak comments

* refactor: Use storage_stats rather than adding `null_count`

* refactor: rename test data field for clarity

* fix: fixup merge conflicts

* refactor: rename initial_non_null_count to initial_total_count

* refactor: caculate null_count as row_count - to_add
2021-07-26 18:13:36 +00:00
Carol (Nichols || Goulding) 0acb0efbc9 fix: Bump METADATA and TRANSACTION versions 2021-07-26 10:52:42 -04:00
Jake Goulding d928bc84e6 feat: Thread time_of_{first,last}_write through Parquet metadata 2021-07-23 14:07:35 -04:00
Carol (Nichols || Goulding) 9604ce7084 fix: Don't pass table name around when it's only returned back
The read_statistics, read_statistics_from_parquet_row_group,
load_parquet_from_store, and load_parquet_from_store_for_chunk functions
weren't ever using table name, they just passed it around and passed it
back.
2021-07-23 13:48:16 -04:00
Carol (Nichols || Goulding) 3c794153dd refactor: Organize uses 2021-07-23 13:48:15 -04:00
kodiakhq[bot] 5b5453a020
Merge branch 'main' into pd/add-parquet-cache 2021-07-22 20:21:53 +00:00
Paul Dix 88e29dede9 chore: remove extraneous example code from parquet storage 2021-07-22 16:21:13 -04:00
Andrew Lamb 01c79f1a1a
fix: Print all timestamps using RFC3339 format (#2098)
* fix: Use IOx pretty printer rather than arrow pretty printer

* chore: update tests in the query crate

* chore: update influxdb_iox tests

* chore: Update end to end tests

* chore: update query_tests

* chore: update mutable_buffer tests

* refactor: update parquet_file tests

* refactor: update db tests

* chore: update kafka integration test output

* fix: merge conflict
2021-07-22 19:04:52 +00:00
Marco Neumann 50241bae9e refactor: do not abuse `uint64::MAX` as sentinal for `None` 2021-07-22 12:51:43 +02:00
Paul Dix d95b5df03e refactor: move cache to ObjectStore
Since the consumers of ObjectStore always use the concrete type rather than the ObjectStoreApi trait, it makes more sense to just change the concrete type to have a pointer to the cache. This removes the cache from the ObjectStoreApi trait and changes the ObjectStore to be a regular struct rather than a tuple around the ObjectStoreIntegration. Future work will have the server configure the cache on the ObjectStore struct when its options are set.
2021-07-21 18:27:56 -04:00
Paul Dix d0ea812041 feat: add skeleton for object store file cache 2021-07-21 18:27:56 -04:00
Marco Neumann 57a9d5ade0 refactor: correctly track "seen" ranges in persistence checkpoints
Now we can handle all these cases:

There are two partitions w/ a single write each:

1. A reads sequence number 1
2. B reads sequence number 2
3. we persist A which only knows the sequences up until 1
=> the DB checkpoint needs the global max, otherwise we forget sequences
   during replay (2 in this case, so B would be gone)

1. B reads sequence number 1
2. A reads sequence number 2
3. we persist A which (w/o this commit) would not track the sequencer at
   all in this checkpoint (since there is nothing to replay)
=> we MUST also remember that we already read up until 2, otherwise we'll
   re-read 2 after replay
=> the partition checkpoint needs the local seen max (no matter if there's
   something to to persist)
2021-07-21 19:19:49 +02:00
Marco Neumann a5fc1c7d38 fix: collect min AND max in database checkpoints
This is required to correctly handle the following case:

1. There are two partitions A and B w/ a single write each (from the same
   sequencer).
2. We persist A:
   - The partition checkpoint for A will be empty because after persistence
     there will be nothing to replay (the single write is persisted and
     we're ready).
   - The database checkpoint that contains the global minimum of all ranges
     recognizes that for the sequencer there is indeed something left (the
     minimum sequence number from B).
3. DB restart happens, replay starts
4. We scan all persisted files, figure out that we have a DB checkpoint
   with a sequence minimum but (w/o the change in this commit) there is no
   maximum. Only partition checkpoints contain maxima, and the only partition
   checkpoint that was persisted was the one for partition A and that one was
   empty (see above).
5. So now how do we recover partition B?
2021-07-21 14:48:29 +02:00
Andrew Lamb 4da8a16c18
chore: update to arrow 5.0 and master datafusion (#2049)
* chore: update to arrow 5.0 and master datafusion

* fix: Update test for change in object size
2021-07-19 12:49:51 +00:00
Jake Goulding 42b56ad657 refactor: Use SNAFU's context instead of `ok_or_else` 2021-07-16 09:59:54 -04:00
Jake Goulding 939d15a21f perf: Avoid clone when an error doesn't occur 2021-07-16 09:59:54 -04:00
Marco Neumann f57ba6afdb
fix: use fixed-size timestamps for parquet metadata (#2032)
This fixes flaky tests that rely on predictable files sizes.

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2021-07-16 13:14:02 +00:00
Andrew Lamb 0c86d1dccf
feat: Record parquet bytes size in catalog / parquet_file (#2006)
* feat: Store object store size in parquet_file

* fix: update TRANSACTION_VERSION to 8

* refactor: rename os_bytes --> file_size_bytes
2021-07-15 12:07:11 +00:00
Marco Neumann 40047a76bc refactor: `remove_parquet` cannot fail 2021-07-15 12:07:56 +02:00
Raphael Taylor-Davies 1d00fa2fd8
refactor: track memory metrics in catalog (#1995)
* refactor: track memory metrics in catalog

* chore: update comment
2021-07-14 16:23:00 +00:00
Andrew Lamb d35b74c226
fix: Fix doc build warnings (#1945)
* fix: Fix doc build warnings

* refactor: add deny bare_urls to crates

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2021-07-13 08:03:42 +00:00
Andrew Lamb 670826daf9
refactor: make object_store construction interface consistent (#1944)
* refactor: make object_store construction interface consistent

* fix: benchmarks

* fix: doc build

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2021-07-12 12:56:36 +00:00
Marco Neumann 18893e76e0 refactor: convert some table name and part. key String to Arcs
This has the (somewhat nice) side effect that it shrinks the in-mem
catalog a bit as well because nw `ParquetChunk` is a bit smaller making
the chunk stage enum smaller as well.
2021-07-08 14:34:28 +02:00
Marco Neumann b528ac2b55 feat: store schemas per table
This way we can:

- check for schema matches even for writes going into different
  partitions
- solve #1768 and #1884 in some future PR

Closes #1897.
2021-07-08 09:18:09 +02:00
Andrew Lamb e6d995cbd8
chore: Update to Rust 1.53.0 (#1922)
* chore: Update to Rust 1.53.0

* fix: Update to latest clippy standards

* fix: bad refactor

* fix: Update escaping

* test: update test output

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2021-07-07 18:02:03 +00:00
Marco Neumann 4ca2d3e148 chore: move persistence windows related code into own crate
The entire persistence windows data structures (including the
checkpoints) have nothing to do with the mutable buffer per se. So lets
move them into their own crate. This also makes `parquet_file` not
longer depend on `mutable_buffer`.
2021-07-05 10:23:58 +02:00
Marco Neumann d96e15c3f7 docs: explain why we store checkpoints in parquet files 2021-07-05 09:42:46 +02:00
Marco Neumann cdab1bed05 feat: persist part+db checkpoint in parquets and catalog
This will be required for replay on server startup.
2021-07-05 09:42:46 +02:00
Jacob Marble 0779b0d9bd
feat: add gRPC listener for new write protocol (#1842)
* feat: add gRPC listener for new write protocol

* chore: clippy happy

* chore: lint

* chore: cargo fmt --all

* chore: cargo clippy

* chore: protobuf-lint

* chore: more formatting

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2021-07-01 16:15:12 +00:00
Marco Neumann 4204127b05 refactor: use protobuf for in-parquet metadata 2021-06-30 16:51:37 +02:00
Marco Neumann ddc9cd49ca chore: bump preserved catalog version 2021-06-29 14:23:06 +02:00
Marco Neumann 3ebb6a3037 refactor: do not capture txn-specific information in parquet files
This helps with #1821.
2021-06-29 14:22:36 +02:00
kodiakhq[bot] eda9532eb2
Merge branch 'main' into crepererum/issue1821-cleanup-lock 2021-06-29 10:48:43 +00:00
Marco Neumann 48df13de05 refactor: use parking lot for catalog cleanup 2021-06-29 12:47:29 +02:00
Marco Neumann f824f235b4
fix: fix info log message
Co-authored-by: Andrew Lamb <alamb@influxdata.com>
2021-06-29 12:35:05 +02:00
Marco Neumann 778a611fb8 docs: add clarifying comment for rebuild test 2021-06-29 11:58:19 +02:00
Marco Neumann 17f89ea8d0 docs: fix comment about lock downgrade 2021-06-29 11:53:55 +02:00
Marco Neumann 2cd5ce98be refactor: do not pass locks around for catalog cleanup 2021-06-29 10:21:41 +02:00
Marco Neumann 730a23faa3 refactor: improve locking around the parquet file cleanup
Instead of (ab)using the transaction lock to prevent the cleanup job
from removing just-written parquet files, use a dedicated lock. This
will later allow us to write parquet files before starting a transaction
(i.e. w/o holding the transaction lock).

This will help with #1821.
2021-06-29 10:20:03 +02:00
Marco Neumann 6ec24353bf refactor: only rebuild a single txn for pres. catalogs
Stop relying on in-parquet transaction information during catalog
rebuilds. This has some downsides (no fork detection, only a single
transaction hence no time travel) but will allow that we remove
transaction information from parquet files, so that we can finally move
the actual parquet file storage out of the transaction lock.

This will help with #1821.
2021-06-28 15:10:44 +02:00
Andrew Lamb 0a03605bbc
refactor: pull Channel --> Stream adapater into its own module (#1793)
* refactor: pull Channel --> Stream adapater into its own module

* docs: Update query/src/exec/stream.rs

Co-authored-by: Marko Mikulicic <mkm@influxdata.com>

Co-authored-by: Marko Mikulicic <mkm@influxdata.com>
2021-06-24 10:35:45 +00:00
kodiakhq[bot] 59993e8b8f
Merge branch 'main' into crepererum/issue1623 2021-06-23 12:40:05 +00:00
Marco Neumann c395409b51 feat: include UUIDv4 into parquet file names
Change schema from

```text
<server_id>/<db_name>/data/<part_key>/<chunk_id>/<table_name>.parquet
```

to

```text
<server_id>/<db_name>/data/<table_name>/<part_key>/<chunk_id>.<uuid>.parquet
```

So parquet files will NEVER be overwritten. This is especially helpful
when dealing with old catalog leftovers (i.e. a parquet file that
belonged to an old but wiped catalog). It also simplifies the reasoning
about file references in the future and follows what other dataset
formats are usually doing (i.e. never replace files).

Also use `ChunkAddr` where it makes sense.
2021-06-23 14:30:28 +02:00
kodiakhq[bot] 70817a474c
Merge branch 'main' into crepererum/issue1740-d 2021-06-23 12:29:54 +00:00
Raphael Taylor-Davies 5cd911c74a
fix: correct row count for object store chunks (#1789) 2021-06-23 12:06:49 +00:00
Marco Neumann 1636f47565 refactor: remove dead code 2021-06-23 10:51:22 +02:00
Marco Neumann cf55df68b5 refactor: remove some `Arc`s around the in-mem catalog
This is for #1740.
2021-06-23 10:51:22 +02:00
Marco Neumann e36b6f9c7a docs: fix intra-doc link 2021-06-23 10:25:05 +02:00
Marco Neumann 67508094b4 fix: double ref
Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>
2021-06-23 10:25:05 +02:00
Marco Neumann d2be641864 refactor: make checkpointing easier to use
Don't mix commit+checkpoint in a single call so that the caller has to
reason about the error type and which of the two operations has failed.
Splitting it also makes it easier to create the correct checkpoint data.
2021-06-23 10:25:05 +02:00
Marco Neumann 4a961694ec refactor: make caller sync mem<>OS view during catalog transactions
This is for #1740. Greatly simplifies the integration of the persisted
catalog into the DB.
2021-06-23 10:25:05 +02:00
Marco Neumann d1db0dfaeb refactor: remove type parameter from preserved catalog
For #1740.
2021-06-22 10:53:10 +02:00
Marco Neumann ff60627500 refactor: make preserved catalog NOT own the in-mem catalog
Works towards #1740.
2021-06-21 18:39:43 +02:00
Marco Neumann 881729bd23 refactor: make caller responsible to create checkpoint data
This decouples the in-mem and preserved catalog a bit and works
towards #1740.
2021-06-21 18:33:23 +02:00
Marco Neumann aba973a6e1 refactor: make catalog `wipe` a freestanding function
It does not interact with the `CatalogState` so users can call this
function without that type.
2021-06-21 09:31:23 +02:00
Andrew Lamb 258a6b1956
chore: remove more dead code (#1760)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2021-06-18 21:28:22 +00:00
Andrew Lamb de67bd3efe
refactor: Remove PartitionChunk::table_schema (#1756)
* refactor: Remove PartitionChunk::table_schema

* docs: update comments
2021-06-18 16:13:16 +00:00
Raphael Taylor-Davies f6dbc8d6f2
refactor: add ChunkAddr to describe location of chunk in catalog (#1745)
* refactor: add ChunkPath to describe location of chunk in catalog

* refactor: rename ChunkPath to ChunkAddr

* chore: further renames

* chore: even more renames

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2021-06-17 12:04:37 +00:00
Marco Neumann e056d97cf6 test: always test transaction aborts 2021-06-16 11:01:14 +02:00
Marco Neumann caaf95c6ec refactor: remove lock from `TestCatalogState` 2021-06-16 10:51:15 +02:00
Marco Neumann c8c412f6fe refactor: rework catalog state interface
This now allows not only for copy-based transaction handling but also
for eager exec and rollbacks. This will be useful to properly implement
transaction aborts for the "real" catalog.
2021-06-16 10:51:15 +02:00
Marco Neumann e064a6bbba test: add test suite for `CatalogState` impls
This makes it easier to check if `CatalogState` correctly implement all
features, including transaction aborting.
2021-06-16 10:50:47 +02:00
Andrew Lamb b756e09904
refactor: Rename parquet_file::Chunk --> ParquetChunk (#1722)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2021-06-15 11:21:49 +00:00
Marco Neumann 64c815dd50
fix: bump catalog version (#1726)
This should have been done in #1714. Also add a note so that future devs
might hopefully not forget. In any case though the code also works w/o
this bump, it's just that the error message is a bit less nice ("cannot
parse IOxMetadata" instead of "unsupported catalog version").

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2021-06-15 10:26:30 +00:00
Marco Neumann 55fc5e564b refactor: remove serverID and DB name args from catalog state
They are no longer required.
2021-06-15 09:35:41 +02:00
Marco Neumann 776b6c011c feat: remove path parsing functionality
Paths to parquet files are an implementation detail and should not be
parsed.

Closes #1506.
2021-06-14 16:24:50 +02:00
Marco Neumann 250ccdcdcd refactor: use `IOxMetadata` instead of path parsing for parquet chunks 2021-06-14 16:24:50 +02:00
Marco Neumann d51e7a127c feat: include table name, partition key, and chunk ID in `IoxMetadata` 2021-06-14 16:24:50 +02:00
kodiakhq[bot] b57f397057
Merge branch 'main' into crepererum/checkpoint_during_restore 2021-06-14 13:54:03 +00:00
Marco Neumann 0a7dcc3779 test: adjust read-write parquet test to newest test data 2021-06-14 14:24:24 +02:00
Marco Neumann d6f6ddfdaa fix: fix NULL handling in parquet stats 2021-06-14 14:24:09 +02:00
Marco Neumann eae56630fb test: add test for all-NULL float column metadata 2021-06-14 13:48:34 +02:00
Marco Neumann 3f9bcf7cd9 fix: fix NaN handling in parquet stats 2021-06-14 13:44:52 +02:00
Marco Neumann ea96210e98 test: enable unblocked test 2021-06-14 13:44:52 +02:00
Marco Neumann 518f7c6f15 refactor: wrap upstream parquet MD into struct + clean up interface
This prevents users from `parquet_file::metadata` to also depend on
`parquet` directly. Furthermore they don't need to important dozend of
functions and can instead just use `IoxParquetMetaData` directly.
2021-06-14 13:17:01 +02:00
Marco Neumann 030d0d2b9a feat: create checkpoint during catalog rebuild 2021-06-14 10:55:56 +02:00
Marco Neumann df866f72e0 refactor: store parquet metadata in chunk
This will be useful for #1381.

At the moment we parse schema and stats eagerly and store them alongside
the parquet metadata in memory. Technically this is not required since
this is basically duplicate data. In the future we might trade-off some
of this memory against CPU consumption by parsing schema and stats on
demand.
2021-06-14 10:08:31 +02:00
Marco Neumann e6699ff15a test: ensure that `find_last_transaction_timestamp` considers checkpoints 2021-06-14 10:04:50 +02:00
Marco Neumann f8a518bbed refactor: inline `Table` into `parquet_file::chunk::Chunk`
Note that the resulting size estimations are different because we were
double-counting `Table`. `mem::size_of::<Self>()` is recursive for
non-boxed types since the child will be part of the parent structure.

Issue: #1295.
2021-06-11 11:54:31 +02:00
Marco Neumann 28d1dc4da1 chore: bump preserved catalog version 2021-06-10 16:01:13 +02:00
Marco Neumann 80ee36cd1a refactor: slightly streamline path parsing code in pres. catalog 2021-06-10 15:59:28 +02:00
Marco Neumann 7e7332c9ce refactor: make comparison a bit less confusing 2021-06-10 15:42:21 +02:00
Marco Neumann fd581e2ec9 docs: fix confusion wording in `CatalogState::files` 2021-06-10 15:42:21 +02:00
Marco Neumann be9b3a4853 fix: protobuf lint fixes 2021-06-10 15:42:21 +02:00
Marco Neumann 294c304491 feat: impl catalog checkpointing infrastructure
This implements a way to add checkpoints to the preserved catalog and
speed up replay.

Note: This leaves the "hook it up into the actual DB" for a future PR.

Issue: #1381.
2021-06-10 15:42:21 +02:00
Marco Neumann 188cacec54 refactor: use `Arc` to pass `ParquetFileMetaData`
This will be handy when the catalog state must be able to return
metadata objects so that we can create checkpoints, esp. when we use
multi-chunk parquet files in some midterm future.
2021-06-10 15:42:21 +02:00
Marco Neumann c7412740e4 refactor: prepare to read and write multiple file types for catalog
Prepares #1381.
2021-06-10 15:42:21 +02:00
Marco Neumann 33e364ed78 feat: add encoding info to transaction protobuf
This should help with #1381.
2021-06-10 15:42:21 +02:00
Marco Neumann 4fe2d7af9c chore: enforce `clippy::future_not_send` for `parquet_file` 2021-06-09 18:18:27 +02:00
Andrew Lamb ab0aed0f2e
refactor: Remove a layer of channels in parquet read stream (#1648)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2021-06-07 16:47:04 +00:00
Raphael Taylor-Davies 1e7ef193a6
refactor: use field metadata to store influx types (#1642)
* refactor: use field metadata to store influx types

make SchemaBuilder non-consuming

* chore: remove unused variants

* chore: fix lints
2021-06-07 13:26:39 +00:00
Marco Neumann c830542464 feat: add info log when cleanup limit is reached 2021-06-04 11:12:29 +02:00
Marco Neumann 91df8a30e7 feat: limit number of files during storage cleanup
Since the number of parquet files can potentially be unbound (aka very
very large) and we do not want to hold the transaction lock for too
long and also want to limit memory consumption of the cleanup routine,
let's limit the number of files that we collect for cleanup.
2021-06-03 17:43:11 +02:00
Marco Neumann 85139abbbb fix: use structured logging for cleanup logs 2021-06-03 11:23:29 +02:00
Andrew Lamb 32c6ed1f34
refactor: More cleanup related to multi-table chunks (#1604)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2021-06-02 17:00:23 +00:00
Marco Neumann e5b65e10ac test: ensure that `find_last_transaction_timestamp` indeed returns the last timestamp 2021-06-02 10:15:06 +02:00
Marco Neumann 98e413d5a9 fix: do not unwrap broken timestamps in serialized catalog 2021-06-02 10:15:06 +02:00
Marco Neumann fc0a74920f fix: use clearer error text 2021-06-02 09:41:19 +02:00
Marco Neumann 2a0b2698c6 fix: use structured logging
Co-authored-by: Andrew Lamb <alamb@influxdata.com>
2021-06-02 09:41:19 +02:00
Marco Neumann 64bf8c5182 docs: add code comment explaining why we parse transaction timestamps
Co-authored-by: Andrew Lamb <alamb@influxdata.com>
2021-06-02 09:41:19 +02:00
Marco Neumann 77aeb5ca5d refactor: use protobuf-native Timestamp instead of string 2021-06-02 09:41:19 +02:00
Marco Neumann 9b9400803b refactor!: bump transaction version to 2 2021-06-02 09:41:19 +02:00
Marco Neumann 5f77b7b92b feat: add `parquet_file::catalog::find_last_transaction_timestamp` 2021-06-02 09:41:19 +02:00
Marco Neumann 9aee961e2a test: test loading catalogs from broken protobufs 2021-06-02 09:41:19 +02:00
Marco Neumann 0a625b50e6 feat: store transaction timestamp in preserved catalog 2021-06-02 09:41:19 +02:00
Andrew Lamb d8fbb7b410
refactor: Remove last vestiges of multi-table chunks from PartitionChunk API (#1588)
* refactor: Remove last vestiges of multi-table chunks from PartitionChunk API

* fix: remove test that can no longer fail

* fix: update tests + code review comments

* fix: clippy

* fix: clippy

* fix: restore test_measurement_fields_error test
2021-06-01 16:12:33 +00:00
Andrew Lamb d3711a5591
refactor: Use ParquetExec from DataFusion to read parquet files (#1580)
* refactor: use ParquetExec to read parquet files

* fix: test

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2021-06-01 14:44:07 +00:00
Andrew Lamb 64328dcf1c
feat: cache schema on catalog chunks too (#1575) 2021-06-01 12:42:46 +00:00
Andrew Lamb 00e735ef0d
chore: remove unused dependencies (#1583) 2021-05-29 10:31:57 +00:00
Raphael Taylor-Davies db432de137
feat: add distinct count to StatValues (#1568) 2021-05-28 17:41:34 +00:00
kodiakhq[bot] 6098c7cd00
Merge branch 'main' into crepererum/issue1376 2021-05-28 07:13:15 +00:00
Andrew Lamb f3bec93ef1
feat: Cache TableSummary in Catalog rather than computing it on demand (#1569)
* feat: Cache `TableSummary` in catalog Chunks

* refactor: use consistent table summary
2021-05-27 16:03:05 +00:00
Marco Neumann dd2a976907 feat: add a flag to ignore metadata errors during catalog rebuild 2021-05-27 13:10:14 +02:00
Marco Neumann bc7389dc38 fix: fix typo
Co-authored-by: Andrew Lamb <alamb@influxdata.com>
2021-05-27 12:51:01 +02:00
Marco Neumann 48307e4ab2 docs: adjust error description to reflect internal errors
Co-authored-by: Andrew Lamb <alamb@influxdata.com>
2021-05-27 12:51:01 +02:00
Marco Neumann d6f0dc7059 feat: implement catalog rebuilding from files
Closes #1376.
2021-05-27 12:51:01 +02:00
Marco Neumann 024323912a docs: explain what `PreservedCatalog::wipe` offers 2021-05-27 12:48:41 +02:00
Raphael Taylor-Davies 4fcc04e6c9
chore: enable arrow prettyprint feature (#1566) 2021-05-27 10:28:14 +00:00
Marco Neumann 9f451423d5 feat: log files that are deleted 2021-05-26 12:49:44 +02:00
Marco Neumann 24ec1a472e fix: do NOT delete parquet files that are reachable by time travel 2021-05-26 12:38:54 +02:00
Marco Neumann 5983336366 refactor: rename `parquet_file::{utils => test_utils}` 2021-05-26 11:09:29 +02:00
Marco Neumann d7e3bc569e refactor: shorten time we hold the transaction lock during clean-up 2021-05-26 11:04:57 +02:00
Marco Neumann 18f5dd9ae1 test: ensure transaction lock exists during cleanup planning 2021-05-26 11:04:57 +02:00
Marco Neumann b55eae98da fix: do not delete non-parquet files during catalog-driven cleanup 2021-05-26 11:04:57 +02:00
Marco Neumann 5ed16ff294 refactor: improve error message in `parquet_file::cleanup` 2021-05-26 11:04:57 +02:00
Marco Neumann 14fdf3b7c7 feat: implement object store cleanup core routine 2021-05-26 11:02:40 +02:00
Marco Neumann cc78b5317d feat: add method to get all parquet files from catalog state 2021-05-26 11:02:40 +02:00
Marco Neumann 953114af2e feat: add method to abort catalog transaction 2021-05-26 11:02:40 +02:00
Marco Neumann 92fcd7e940 feat: add a way to get OS, server ID and DB name from catalog 2021-05-26 11:02:40 +02:00
Marco Neumann 9daa4d00d6 test: re-organize `parquet_file` test utils a bit 2021-05-26 11:02:39 +02:00
Marco Neumann 38183928c8 refactor: extract path generator for data location 2021-05-26 10:59:40 +02:00
Marco Neumann 19a2733d30 feat: preserve transaction metadata in parquets 2021-05-25 09:56:12 +02:00
Marco Neumann fe8e6301fe refactor: move `read_schema_from_parquet_metadata` back to `parquet_file::metadata`
Let us pool all metadata handling in a single module, which makes it
easier to review.
2021-05-25 09:37:53 +02:00
Marco Neumann ac83d99f66 feat: add a way to get current revision and UUID from transaction handle 2021-05-25 09:37:53 +02:00
Marco Neumann fdc553b257 refactor: replace unwrap with expect 2021-05-25 09:37:53 +02:00
Andrew Lamb c464ffadad
refactor: remove special case timestamp_range in parquet chunk (#1543)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2021-05-24 16:19:44 +00:00
Andrew Lamb 14ba25f86d
chore: Update datafusion and use released version of arrow crates (#1546)
* chore: Update datafusion and use released version of arrow crate

* fix: Update for change in API
2021-05-24 15:37:22 +00:00
Andrew Lamb 27e5b8fabf
refactor: Remove multiple table support from Parquet Chunk (#1541) 2021-05-24 08:40:31 -04:00
Marco Neumann 8bdddfd475 docs: mention that catalog wiping does not delete parquet files 2021-05-20 10:22:20 +02:00
Marco Neumann b1a06246d6 feat: implement function to wipe a preserved catalog 2021-05-20 10:22:20 +02:00
Marco Neumann 6c405aa6f9 feat: check if preserved catalog exists when creating an empty one 2021-05-20 10:22:20 +02:00
Marco Neumann c6a6005f65 feat: add `PreservedCatalog.exists` 2021-05-20 10:22:20 +02:00
Raphael Taylor-Davies 37880ee89a
refactor: store chunk IDs only in catalog (#1521)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2021-05-20 04:07:14 +00:00
Marco Neumann 8db26485a4 refactor: empty transaction during catalog creation
That involves some refactoring which we are going to need anyway for
hooking up the "read" path of the catalog into the DB startup, namely:

- make `Db::new` require a preserved catalog
- introduce a helper function that can provide that
- as a consequence, all test-creations of a Db are now async

This prepares for #1382.
2021-05-18 17:42:07 +02:00
Marco Neumann cdf0ada6a6 test: test preserved catalog <-> Db write wiring 2021-05-17 13:57:31 +02:00
Marco Neumann 68729dd5ee refactor: avoid string allocation 2021-05-17 12:32:34 +02:00
Marco Neumann adcd8132e7 docs: more comments regarding catalog transaction handling 2021-05-17 12:05:08 +02:00
Marco Neumann a99d53e771 docs: document `OpenTransaction::handle_action*` 2021-05-17 11:48:51 +02:00
Marco Neumann 4fb800c7a6 refactor: make PreservedCatalog easier to integrate 2021-05-17 11:33:22 +02:00
Marco Neumann f4d7154746 fix: table summaries must include timestamp as well 2021-05-17 11:33:22 +02:00
Marco Neumann 7cced3242f feat: add a way to parse infos from parquet paths 2021-05-17 11:33:22 +02:00
Marco Neumann 5969caccb0 feat: return parquet metadata from `write_to_object_store` 2021-05-17 11:33:22 +02:00
Raphael Taylor-Davies f9178dbb5f
feat: push metrics into catalog (#1488)
* feat: push metrics into catalog

* chore: minor cleanup

* fix: include db labels in chunk metric domains

* chore: fmt

* fix: don't allow dropping moving chunks

* chore: further tweaks

* chore: review feedback

* feat: use new_unregistered() for metric instruments instead of default

* chore: use &[KeyValue] instead of &Vec<KeyValue>

* refactor: make GauageValue non default constructible
2021-05-14 17:37:39 +00:00
Nga Tran 9583636748 feat: we now can read parquet files form all kind of object stores 2021-05-12 18:05:34 -04:00
Marco Neumann 795f5bfcb7 refactor: make `StatValues::{min,max}` optional + handle NaNs
This will allow us to:

- handle all-NULL columns correctly
- be in-line with Parquet (where min/max are optional)
- handle NaNs at least somewhat sane (they do not "poison" stats
  anymore)
2021-05-10 17:12:25 +02:00
Nga Tran c6b933eb63 chore: merge main to branch 2021-05-07 18:40:17 -04:00
Nga Tran f2c19ec080 refactor: further address Carol's comment 2021-05-07 17:40:40 -04:00
Nga Tran 971500681f refactor: address Andrew's and Carol's comment 2021-05-07 17:33:19 -04:00
Carol (Nichols || Goulding) e2cc4634bf fix: Use PathBuf rather than debug formatting and back to String
This is the same fix I made in 54c5f98, just found a few more spots :)
2021-05-07 15:58:11 -04:00
Nga Tran 31d49db0ed chore: a litlle more cleanup 2021-05-07 09:38:41 -04:00
Nga Tran ba015ee4df refactor: clean up and add comments 2021-05-07 09:31:41 -04:00
Marco Neumann 1a998d4116 feat: preserve parquet metadata in catalog
Closes #1380.
2021-05-07 09:51:44 +02:00
Marco Neumann c3d523fc4f refactor: add col prefixes to make_chunk & Co 2021-05-07 09:51:44 +02:00
Marco Neumann 5db504300d refactor: use parsed paths instead of raw strings for catalog paths 2021-05-07 09:51:44 +02:00
Nga Tran 55bf848bd2 feat: Now we can query directly from files in object store 2021-05-06 18:02:17 -04:00
Andrew Lamb 884baf7329
feat: add column_type and influxdb_column_type, remove row_count from system.columns (#1415)
* feat: add column_type and influxdb_column_type, remove row_count from system.columns

* fix: update tests

* fix: more test update

* fix: Apply suggestions from code review

Co-authored-by: Carol (Nichols || Goulding) <193874+carols10cents@users.noreply.github.com>

* fix: fmt

* fix: copy/paste type conversion to avoid cross dependency between data_types and internal_types

Co-authored-by: Carol (Nichols || Goulding) <193874+carols10cents@users.noreply.github.com>
2021-05-06 12:59:30 +00:00
Andrew Lamb 86771ea629
chore: update arrow/datafusion deps (#1433)
* chore: update datafusion deps

* chore: update arrow deps

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2021-05-05 22:37:31 +00:00
Nga Tran a5c92fae8a chore: merge main to branch 2021-05-05 13:48:42 -04:00
Nga Tran 3bdb451529 chore: merge main to branch 2021-05-05 13:18:39 -04:00
Raphael Taylor-Davies 411cf134e9
refactor: explode arrow_deps (#1425)
* refactor: explode arrow_deps

* chore: workaround doctest bug
2021-05-05 16:59:12 +00:00
Nga Tran 2b46f51e5b chore: address Dom's comment 2021-05-05 12:55:41 -04:00
Nga Tran a1f3413c89 refactor: move private test helpers to utils module to be used by many modules 2021-05-05 11:41:46 -04:00
Nga Tran fcb37a0b1d feat: more testing scenarios for quering parquet files 2021-05-05 10:57:02 -04:00
Marco Neumann 1f42eb89cd feat: implement parquet metadata handling
Closes #1379 and contributes to #1380.
2021-05-05 13:29:16 +02:00
Marco Neumann 056c29aaa2 feat: add a way to retrieve timestamp range from parquet chunk 2021-05-05 13:29:16 +02:00
Marco Neumann c54109113e feat: add a way to retrieve storage path from parquet chunks 2021-05-05 13:29:16 +02:00
Marco Neumann 136c35cb88 feat: implement transaction handling for catalog
Closes #1253.
2021-05-03 10:04:35 +02:00
Nga Tran 34a3388a49 feat: unload chunks from read buffer but keep them in object store 2021-04-30 16:12:02 -04:00
Nga Tran e87973babe refactor: address review comments 2021-04-29 13:15:43 -04:00
Nga Tran 402d9c748c chore: cargo fmt 2021-04-28 16:52:52 -04:00
Nga Tran 2a2760bd18 feat: complete tests where data in both RUB and OS 2021-04-28 16:14:07 -04:00
Nga Tran 140d96dbea feat: tests ffor loading data to object store and make sure twe still query read buffer 2021-04-28 15:59:17 -04:00
Marco Neumann eddc9319ff docs: deny broken intradoc links 2021-04-27 13:22:28 +02:00
Carol (Nichols || Goulding) 272cdb85ce fix: Use the ServerId type everywhere, for writing, querying, anything 2021-04-26 18:44:32 +00:00
Carol (Nichols || Goulding) b8face3335 refactor: Organize use statements 2021-04-26 18:44:32 +00:00
Jake Goulding 67f5ad841d refactor: Introduce ServerId and CurrentServerId types 2021-04-26 18:44:32 +00:00
Nga Tran 657bfa1b20 refactor: address Andrew's comments 2021-04-16 17:44:46 -04:00
Nga Tran b3e110a241 refactor: address Jake's comment 2021-04-16 17:27:40 -04:00
Nga Tran 4c23ca8888 feat: full implementation of parquet's read_filter for review 2021-04-16 16:03:24 -04:00
Andrew Lamb e226b5a820
feat: Use TimestampNanosecondArray for timestamps in IOx (#1230)
* refactor: Create Arrow arrays using iterators

* feat: use Timestamp64(TimeUnit::Nanosecond) for timestamps

* feat: add support for timestamp array

* fix: update more tests

* fix: remove unecessary code

Co-authored-by: Edd Robinson <me@edd.io>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2021-04-16 15:55:33 +00:00
Nga Tran 231ebb54d4 chore: fix a format 2021-04-14 16:32:25 -04:00
Nga Tran 4e2d59d9a5 feat: saimplement a few more functions as part of supporting query dfrom parquet files 2021-04-14 16:06:47 -04:00
Nga Tran 05bf28ce85 feat: Add 2 main functions table_schema and table_names for Parquet Chunk ato pay a foundation for querying it 2021-04-13 18:23:55 -04:00
Nga Tran 4a6d6bd7ad feat: initial work for querying data from parquet file in object store 2021-04-13 13:57:46 -04:00
Raphael Taylor-Davies 1997324344
feat: mutable buffer snapshotting (#1179)
* feat: mutable buffer snapshotting

* chore: review feedback
2021-04-13 12:14:54 +00:00
Nga Tran 453aeaf1a0 feat: Add tests for writing RB chunks to Object Store 2021-04-09 17:39:23 -04:00
Nga Tran f501a74aea refactor: Address review comments 2021-04-07 21:28:03 -04:00
Nga Tran be6e1e48e4 feat: add writer_id and object_store in Db 2021-04-07 18:36:07 -04:00
Raphael Taylor-Davies c2355aca6d
feat: add basic memory tracking (#1125)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2021-04-07 15:38:24 +00:00
Nga Tran 6e01fbc382 feat: ause TableSummary as metadata for parquet chunk's tables and read buffer's read_filter ot get data 2021-04-05 15:37:34 -04:00
Nga Tran 4bdf8963e6 feat: continue buidling foundation for writing RB chunks to parquet files 2021-04-02 16:06:25 -04:00
Nga Tran 49267114d3 chore: merge main into branch and resolve conflicts 2021-04-01 13:22:49 -04:00
Nga Tran 1463c6645f feat: Add ChunkState::ObjectStore and rename ParquetChunk to Chunk 2021-04-01 11:53:03 -04:00
Nga Tran 19a453a483 feat: finally have some framework with clear todos for writing a chunk into parquet files 2021-03-31 16:21:53 -04:00
Nga Tran cd409b471f feat: continue the implementation 2021-03-30 21:31:51 -04:00
Nga Tran 0bcd52d5c9 feat: Add more changes 2021-03-30 18:31:09 -04:00