Ok, so... this needed lots of... channels. Channels everywhere.
The stream method on TestWriteBufferStreamHandler previously assumed it
would only be called once. In a test where reset_to_earliest is called,
stream might be called again to get the reset stream.
We want to be able to control which of the streams gets which
operations, so that's why the macro now takes a vec of vec of
operations-- one vec of operations per expected call to stream, and the
stream will send all the operations in its vec.
The test thread needs to wait for the handler stream to consume the last
item from the last receiver stream, so when the
TestWriteBufferStreamHandler has set up the last expected call to
stream, pass back the last transmitter and have it wait until it's at
full expected capacity (which means all operations have been consumed by
the receiver).
The default behavior of the ingester is to panic if the min unpersisted
sequence number in the catalog is unknown to the write buffer due to the
retention policies having evicted that sequence number.
Specifying `--skip-to-oldest-available` changes this behavior to skip to
the oldest sequence number the write buffer does have available and go
from there.
Fixes#4624.
Enable more lints on the parquet_file crate to keep it a little cleaner
- adds the following:
clippy::clone_on_ref_ptr,
unreachable_pub,
missing_docs,
clippy::todo,
clippy::dbg_macro
This commit includes fixes for any new lint failures.
Derive the ParquetFilePath from the IoxMetadata within the
ParquetStorage::read_filter() call.
This prevents the "put/get RecordBatches" abstraction from leaking out
the object store path generation concern - an implementation detail of
the ParquetStorage layer.
Implements an upload() method on the ParquetStorage type, consuming a
stream of RecordBatch, serialising the Parquet file, and uploading the
result to object storage. Returns the IOx-specific file metadata.
Currently while the upload() method accepts a stream of RecordBatch, the
actual resulting Parquet file is buffered in memory before uploading to
object store, due to lack of streaming upload functionality in the
ObjectStore abstraction - this isn't the end of the world, as the files
tend to be relatively small with our current usage.
This impl should be easily modified to be fully streaming once streaming
object store puts are implemented:
https://github.com/influxdata/object_store_rs/issues/9
Construct a IoxParquetMetaData instance directly from the FileMetaData
instance returned by the ArrowWriter.
This change will allow us to avoid the inefficient impl currently in
use:
* Serialise batches into memory
* Wrap buffer in arrow cursor
* Read parquet metadata with arrow file reader
* Serialise schema with thrift
* Serialise each row group's metadata with thrift
* Construct our own FileMetaData instance
* Serialise FileMetaData with thrift
* zstd encode resulting thrift bytes
* Wrap in IoxParquetMetaData
Now we "only":
* Stream batches into opaque Write impl
* Serialise FileMetaData with thrift
* zstd encode resulting thrift bytes
* Wrap in IoxParquetMetaData
Then accessing any data within the IoxParquetMetaData (as before this
change) requires deserialising it first.
There are still a number of easy performance improvements to be had
w.r.t the metadata handling.
Implements a streaming RecordBatch to Parquet file serialiser.
This impl automatically discovers the schema of the RecordBatch stream,
and accepts &mut destination types (internalising the handle
cloning/etc) to simplify caller usage.
This encoder returns the resulting FileMetaData to allow callers to
inspect the resulting metadata without reading back the file.
Currently unused / not yet plumbed in.
* fix: ensure that query tokio background tasks are canceled
While I am not entirely sure if this explains some of the memory leaks I
am seeing in prod, not canceling the tasks correctly certainly makes
debugging way harder and also renders certain form of throttling (e.g.
max. concurrent queries) somewhat ineffective.
Note that parquet file downloads are currently NOT canceled because
tokios `spawn_blocking` cannot be canceled.
* refactor: `Vec` -> `Option`
* refactor: `spawn_blocking` creates a join handle, even though it is useless
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
Changes the code paths that interact with Parquet files in the object
store to reference the ParquetStorage directly (DRY refactor).
This change takes us from a dependency graph of:
┌─────────────────┐
│ │
▼ │
Parquet Consumer │
│ ┌──────────────┐
├────────▶│ParquetStorage│
▼ └──────────────┘
┌──────────────┐
│ ObjectStore │
└──────────────┘
│
┌────┴────┐
▼ ▼
File s3
System (etc)
to:
Parquet Consumer
│
▼
┌──────────────┐
│ParquetStorage│
└──────────────┘
│
▼
┌──────────────┐
│ ObjectStore │
└──────────────┘
│
┌────┴────┐
▼ ▼
File s3
System (etc)
With the ParquetStorage being solely responsible for managing
interactions with the object store when dealing with Parquet files.
Renames the Storage type so the context is clear in usage (i.e. fn
args), rather than having to rely on knowing the fully-qualified import
path to know what the type stores.
Removes two unused constructors for a ParquetChunk, and moves the bare
fn constructor that is actually used to be an associated method (a
conventional constructor).
* refactor: require `Resource`s to be convertible to `u64`
* refactor: require `Resource`s to have a unit name
* refactor: make LRU cache IDs static
* feat: add LRU cache metrics
* docs: improve type names in LRU doctest
* docs: epxlain `MeasuredT`
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
* docs: explain `test_metrics`
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>