* feat: add an option to compact all overlapped files no matter how large they are
* chore: Apply suggestions from code review
* feat: always compact oerlapped files no matter how large they are
* chore: cleaup
Use a constructor to initialise a ParquetFileWithTombstone struct,
rather than making the fields pub.
This allows IDEs to "go to" places where this is constructed when
browsing the code, but also keeps the type closed for modification of
internals (SOLID).
* fix: let us not compact no-data
* fix: split time must be greater min_time, too
* fix: resolve merge conflict
* chore: increase size of a compactor job and level of concurrency
Co-authored-by: Dom <dom@itsallbroken.com>
* fix: let us not compact no-data
* fix: split time must be greater min_time, too
* fix: resolve merge conflict
Co-authored-by: Dom <dom@itsallbroken.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
Assert consistent metadata when evaluating candidate parquet files for
compaction.
Asserts all files have the same:
* Sequencer ID
* Namespace ID
* Table ID
* Partition ID
* Sort key
Changes the compaction logic to always reference the same SortKey
instance, rather than repeatedly querying for it.
The Partition metadata is always read from the catalog as part of
compact_partition(), where it previously threw away all metadata except
the sort key, which was passed into compact(). Then compact() would
always re-query the catalog to look up just the sort key again, and mix
up the two instances during use - one passed into the fn, one freshly
queried within the fn.
Now the Partition metadata is resolved in compact_partition() as it was
previously, but the entire Partition reference is passed to compact(),
and this is consistently used do access the sort key. This also removes
a catalog query per compaction call.
* refactor: split compact_partition into two functions to handle concurrency better
* feat: limit number of files to compact
* test: add test for limit num files
* chore: fix cipply
* feat: split group if over max size
* fix: split the overlapped group to limit size or file num
* chore: reduce config values
* test: add tests and clearer comments for the split_overlapped_groups and test_limit_size_and_num_files
* chore: more comments
* chore: cleanup
Changes the compactor to consume both StreamSplitExec output partitions
concurrently.
Practically speaking this means both Parquet files will be generated
concurrently, and uploaded to object store concurrently.
This commit changes the Compactor::compact() method to stream the
RecordBatch instances directly to the parquet serialiser, before being
uploaded directly to object storage.
Removes the min/max timestamp fields from the IoxMetadata proto
structure embedded within a Parquet file's metadata.
These values are redundant as they already exist within the Parquet
column statistics, and precluded streaming serialisation as these
removed min/max values were needed before serialising the file.
Remove the redundant row_count from the IoxMetadata structure that is
serialised into the Parquet file.
The reasoning is twofold:
* The Parquet file's native metadata already contains a row count
* Needing to know the number of rows up-front precludes streaming
Implements an upload() method on the ParquetStorage type, consuming a
stream of RecordBatch, serialising the Parquet file, and uploading the
result to object storage. Returns the IOx-specific file metadata.
Currently while the upload() method accepts a stream of RecordBatch, the
actual resulting Parquet file is buffered in memory before uploading to
object store, due to lack of streaming upload functionality in the
ObjectStore abstraction - this isn't the end of the world, as the files
tend to be relatively small with our current usage.
This impl should be easily modified to be fully streaming once streaming
object store puts are implemented:
https://github.com/influxdata/object_store_rs/issues/9
Changes the code paths that interact with Parquet files in the object
store to reference the ParquetStorage directly (DRY refactor).
This change takes us from a dependency graph of:
┌─────────────────┐
│ │
▼ │
Parquet Consumer │
│ ┌──────────────┐
├────────▶│ParquetStorage│
▼ └──────────────┘
┌──────────────┐
│ ObjectStore │
└──────────────┘
│
┌────┴────┐
▼ ▼
File s3
System (etc)
to:
Parquet Consumer
│
▼
┌──────────────┐
│ParquetStorage│
└──────────────┘
│
▼
┌──────────────┐
│ ObjectStore │
└──────────────┘
│
┌────┴────┐
▼ ▼
File s3
System (etc)
With the ParquetStorage being solely responsible for managing
interactions with the object store when dealing with Parquet files.
Renames the Storage type so the context is clear in usage (i.e. fn
args), rather than having to rely on knowing the fully-qualified import
path to know what the type stores.
Removes two unused constructors for a ParquetChunk, and moves the bare
fn constructor that is actually used to be an associated method (a
conventional constructor).
* refactor: require `Resource`s to be convertible to `u64`
* refactor: require `Resource`s to have a unit name
* refactor: make LRU cache IDs static
* feat: add LRU cache metrics
* docs: improve type names in LRU doctest
* docs: epxlain `MeasuredT`
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
* docs: explain `test_metrics`
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
* ci: fix cargo deny
* chore: downgrade `socket2`, version 0.4.5 was yanked
* chore: rename `query` to `iox_query`
`query` is already taken on crates.io and yanked and I am getting tired
of working around that.
* feat: `SortKey::size`
* feat: `FunctionEstimator`
* feat: querier RAM pool
Let's put all the caches into a single RAM pool, so we can at least
somewhat control RAM usage. Note that this does NOT limit the peak
memory during query execution though, but should at least stop unlimited
cache growth. A follow-up PR will add metrics.
* refactor: improve some size calculations
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* chore: move noisy debug to trace and fix some comments
* chore: Apply suggestions from code review
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
* chore: fix format
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
These were found by iterating over all of the dependencies of each
Cargo.toml, then grepping that crate for the dependency's name. If it
didn't show up, I attempted to remove it.
I left a few dependencies that this process flagged:
* generated_types
- `pbjson`,`serde`. Apparently used by the generated code.
* grpc-router-test-gen
- `prost`. Apparently used by the generated code.
* influxdb_iox
- `heappy`. Doesn't appear used, but is behind enough feature
flags that I don't care to reason about and it's already optional.
- `tikv_jemalloc_sys`. Appears to be setting a feature flag of an
indirect dependency.
* iox_gitops_adapter
- `k8s_openapi`. Appears to be setting a feature flag of an indirect
dependency.