* fix: `Tombstone::size` must include serialized predicate
* fix: `CachedPartition::size` must include `Arc` heap allocation
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* refactor: remove `DecodedParquetFile` from `iox_tests`
* refactor: remove `DecodedParquetFile` from querier
Also pull out all the chunk schema and sort key handling into a function
so that RB chunks and parquet chunks mostly use the same code path.
* refactor: remove `DecodedParquetFile`
* refactor: remove `ParquetFileWithMetadata` usage
* fix: test data consistency
Previously the column data type was exposed using an internal i32 value.
This commit changes the Schema API to use a self-descriptive proto enum
for the column data type.
* refactor: simplify sort key calculation
* refactor: use schema from catalog instead from file
* refactor: do not request parquet file MD in compactor
* test: ensure that `QueryableParquetChunk` works correctly
* refactor: avoid feeding sort key from struct into same struct
* feat: allow namespace schema query by ID
* refactor: do not use binary parquet file MD in compactor tests
* refactor: do not use in-parquet IOx metadata
* refactor: reduce number of catalog queries
* fix: avoid using min_time, which can be negative, for ChunkId. Using object store id which is uuid instead
* chore: Apply suggestions from code review
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
* chore: run fmt
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* feat: split times of compacting results based on the max file size
* feat: cosider max file size while computing split time
* test: tests for comput_split_time
* feat: first step to teach the function split_the_steam to know how to split data into n streams using n-1 input PhysicalExprs
* feat: make StreamSplitNode support a list of expression
* docs: explain how StreamSplitNode works
* feat: Teach compute_split_time to split a time range into many contiguous ranges and split compacted result into multiple non-overlapped files based on the config comapction_max_size_bytes
* chore: cleanup
* chore: clean up doc
* chore: address review comments
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
The ingester maintains a rough "total memory in use" counter it uses to
try and limit the amount of memory the ingester is using overall.
When a partition is persisted, this total memory usage value is adjusted
to account for releasing the partition memory. Prior to this commit, the
ordering was:
* Writes increase the memory counter
* maybe_persist() is called to trigger persistence
* A partition is identified for persistence
* Partition memory usage is released back to the total memory counter
* Persistence starts
This meant that the partitions in the process of being persisted were
not accounted for in the ingester's total memory counter, and therefore
we could significantly overrun the configured memory limit.
After this commit, the ordering is:
* Writes increase the memory counter
* maybe_persist() is called to trigger persistence
* A partition is identified for persistence
* Persistence starts
* Persistence completes
* Partition memory usage is released back to the total memory counter
This ensures persisting partitions are sill tracked in the total memory
counter, causing pauses to correctly fire.
* feat: conversion from `ParquetFile` to `ParquetFilePath`
* refactor: slim down parquet chunk
- ensure it works without binary parquet metadata
- timestamp range is no longer optional (ensured by the NG type system)
- remove table summary: this is only needed for SOME API users. The
compactor can perfectly work without statistics since has the timestamp
range which is sufficient for the current overlap check (we don't use
any other primary key stats at the moment). The querier currently does
NOT use parquet chunks (was replaced by read buffer) but if it will
again in some future it will likely need to find a way to fetch and
cache the statistics.
- the schema is now provided by the API user since it can be
reconstructed using the NG catalog only (and "wrong" column orders are
tolerated as of #4921)
Ref #4124
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* fix: use proper sort key in tests
* feat: do not rely on encoded parquet metadata for RB chunks
Ref #4124.
* refactor: allocate less strings
* refactor: use upstream PK calculation
* fix: cache expiration w/o a good reason
* refactor: make namespace cache safer to use
* refactor: make partition cache safer to use
* fix: column handling when reading parquet files
This improves/fixes/tests a few aspects when reading parquet files:
- fix usage of `Selection::Some(...)`. This was broken since #4912 but
apparently no test caught that.
- ensure that the order of `Selection::Some(...)` is preserved
- ensure that schema metadata is attached to output batches
- ignore parquet columns that we don't care about (i.e. do not select)
- allow parquet file to have a different column order than our internal
bookkeeping, this makes it way simpler to read parquet files w/o
scanning the metadata first
- extend the test coverage
Ref #4124.
* test: even more tests for parquet reader
Queries internals are not meant to be used by other crates. Only a
handful selected interfaces should be used by IOxD and the query tests.
The compactor only used a very small subset just to read parquet files
back into memory. It shall rather use the official `parquet_file`
interface instead.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
Fixes interaction of `maybe_skip_kafka_integration!` and `should_panic`
by ensuring that `maybe_skip_kafka_integration!` panics to skip
`should_panic` tests.
Without that it is not possible to just run `cargo test -p write_buffer`.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* refactor: `TestPartition::update_sort_key` should return an `Arc`
The whole test framework is built around `Arc`s, so let's fix this
consistency issue.
* fix: actually calculate correct column set in test framework
* feat: check expected parquet file schema
While working on the querier I made some mistakes regarding schemas and
such a check would have greatly improved the debugging experience.
* feat: namespace cache expiration
* fix: improve parquet schema check
* fix: remove clone
Changes the ingester to use the partition key derived in the router, and
transmitted over through the kafka API boundary.
This should have no observable behavioural change, but be more resilient
as we're no longer assuming the partitioning algorithm produces the same
value in both the router (where data is partitioned) and the ingester
(where data is persisted, segregated by partition key).
This is a pre-requisite to allowing the user to specify partitioning
schemes.
The low-level chunk storage shouldn't care about the table name (this is
also true for parquet chunks btw). In fact, the table name is already
only a partial information since it misses the namespace.
If we need a table name, then the high-level chunk/data management is
responsible for that.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* refactor: store per-file column set in catalog
Together with the table-wide schema and the partition-wide sort key, this should
be everything we need to read a parquet file directly into memory
without peeking any file-level metadata.
The querier will use this to directly load parquet files into the read
buffer.
**WARNING: This requires a catalog wipe!**
Ref #4124.
* refactor: use proper `ColumnSet` type
Changes the kafka message wire format to include the partition key for
serialised DML writes on the wire.
After this commit, the kafka messages will contain the partition key for
each op, but this information will go unused in the ingester - this
enables us to roll out the producer side, before making the value's
presence necessary on the consumer side.
A follow-up PR will change the ingester to utilise this embedded
partition key.
This has the unfortunate side effect of making the partition key part of
the public gRPC write API:
https://github.com/influxdata/influxdb_iox/issues/4866
* chore: reduce proptest features
* chore: remove `grpc-router`
This crate is currently unused and we don't have immediate plans to use
it. And there's GIT, so it can always be restored.
* chore: `cargo update`
* refactor(querier): split ingester partitions into chunks
With the new wire protocol the ingester can now transmit multiple
snapshots per partition with different schemas. This changes the querier
to reflect this and and splits uses the individual snapshots as chunks
for the query engine instead of a single partition.
The schema handling was changed so that instead of a table-wide schema
enforcement, we now use the snapshot-specific projections. This means we
do not need to create all-NULL columns any longer because the batches
within the chunks now always have the correct schema.
* refactor: "disassembler" -> "decoder"