We need to hold the parquet metadata in memory so that we're able to
create catalog checkpoints. We used to do that by holding the decoded
structure (provided by the upstream `parquet` crate) in memory and
serializing that data on demand to Apache Thrift.
There are two drawbacks:
1. We did not account for the memory usage of the decoded structures (or
at least not fully).
2. We actually don't need the decoded data in-memory, since for the
checkpoint creation we only need to write the serialized data.
So this PR changes our wrapper so it holds the serialized data which is
then only decoded when it's really necessary. Since the serialized data
is a simple byte vector, we can also easily account for the size.
Note that this makes the accounted size of parquet chunks larger.
However this data was always there, we just ignored it up until now. If
the size of the parquet metadata really becomes an issue, we could trait
some CPU time for memory by compressing it.
* refactor: remove display methods, use fmt::Display instead.
Signed-off-by: Ning Sun <sunng@protonmail.com>
* refactor: update a few calls from .display to .to_string()
* fix: consistently use `Path` rather than occasionally `DirsAndFileName`
* fix: fixup for merge conflicts
* fix: update test
* fix: Catch another case or two
* fix: fmt
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* fix: nocache feature code rot
The MBChunk::snapshot code when using the "nocache" option no longer
compiles - this commit updates it to match the not(nocache) code.
* build: use updated broken_intra_doc_links name
The broken_intra_doc_links lint was renamed
rustdoc::broken_intra_doc_links
https://doc.rust-lang.org/rustdoc/lints.html
* chore: Update deps (including arrow 5.0.0 --> arrow 5.1.0)
* chore: update all the things
* refactor: Update serving readiness check due to change in Tonic API
* chore: update more deps
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
In one case where ParquetChunk::new was being called, the calling code
had just parsed the IoxMetadata too. In the other case, the calling code
had just *created* the IoxMetadata being parsed. In both cases, this
re-parsing wasn't actually needed; the two bits of info
ParquetChunk::new can be easily passed in.
First of all using a partition checkpoint as some kind of intermediate
representation was kinda a hack because partition checkpoints should
only created for to-be-persisted partitions, not for the others.
API-wise it should only be possible to construct a partition checkpoint
from a flush handle.
Also we were only able to construct partition checkpoints for partitions
that had unpersisted data, otherwise there was no sane way to fill the
`min_unpersisted_timestamp`. We must however scan all partitions no
matter if there is unpersisted data so that we can determine the maximum
seen sequence numbers. This was caught by a replay test resulting in a
catalog state where the last database checkpoint had lower maximum seen
sequence numbers than some partition checkpoint, bailing out with an
error.
So overall it turns out that passing the sequencer numbers directly
instead of wrapping them into a partition checkpoint is the better
implementation.
* fix: Properly record total_count and null_count in statistics
* fix: fix statistics calculation in mutable_buffer
* refactor: expose null counts in read_buffer
* refactor: expose null_count in parquet_file
* fix: update server crate tests
* fix: update query_tests tests
* docs: tweak comments
* refactor: Use storage_stats rather than adding `null_count`
* refactor: rename test data field for clarity
* fix: fixup merge conflicts
* refactor: rename initial_non_null_count to initial_total_count
* refactor: caculate null_count as row_count - to_add
The read_statistics, read_statistics_from_parquet_row_group,
load_parquet_from_store, and load_parquet_from_store_for_chunk functions
weren't ever using table name, they just passed it around and passed it
back.
Since the consumers of ObjectStore always use the concrete type rather than the ObjectStoreApi trait, it makes more sense to just change the concrete type to have a pointer to the cache. This removes the cache from the ObjectStoreApi trait and changes the ObjectStore to be a regular struct rather than a tuple around the ObjectStoreIntegration. Future work will have the server configure the cache on the ObjectStore struct when its options are set.
Now we can handle all these cases:
There are two partitions w/ a single write each:
1. A reads sequence number 1
2. B reads sequence number 2
3. we persist A which only knows the sequences up until 1
=> the DB checkpoint needs the global max, otherwise we forget sequences
during replay (2 in this case, so B would be gone)
1. B reads sequence number 1
2. A reads sequence number 2
3. we persist A which (w/o this commit) would not track the sequencer at
all in this checkpoint (since there is nothing to replay)
=> we MUST also remember that we already read up until 2, otherwise we'll
re-read 2 after replay
=> the partition checkpoint needs the local seen max (no matter if there's
something to to persist)
This is required to correctly handle the following case:
1. There are two partitions A and B w/ a single write each (from the same
sequencer).
2. We persist A:
- The partition checkpoint for A will be empty because after persistence
there will be nothing to replay (the single write is persisted and
we're ready).
- The database checkpoint that contains the global minimum of all ranges
recognizes that for the sequencer there is indeed something left (the
minimum sequence number from B).
3. DB restart happens, replay starts
4. We scan all persisted files, figure out that we have a DB checkpoint
with a sequence minimum but (w/o the change in this commit) there is no
maximum. Only partition checkpoints contain maxima, and the only partition
checkpoint that was persisted was the one for partition A and that one was
empty (see above).
5. So now how do we recover partition B?