* feat: introduce a new way of max_sequence_number for ingester, compactor and querier
* chore: cleanup
* feat: new column max_l0_created_at to order files for deduplication
* chore: cleanup
* chore: debug info for chnaging cpu.parquet
* fix: update test parquet file
Co-authored-by: Marco Neumann <marco@crepererum.net>
We basically assume everywhere that a column falls into one of the three
known categories (time, tag, field), so lets encode this in our type
system instead of defining "unknown" as "undefined behavior, may or may
not crash".
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
Use a proper typed stream instead of peeking the first element. This is
more in line with our remaining stack and shall also improve error
handling.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
Previously when attempting to serialise a stream of one or more
RecordBatch containing no rows (resulting in an empty file), the parquet
serialisation code would panic.
This changes the code path to raise an error instead, to support the
compactor making multiple splits at once, which may overlap a single
chunk:
────────────── Time ────────────▶
│ │
┌█████──────────────────────█████┐
│█████ │ Chunk 1 │ █████│
└█████──────────────────────█████┘
│ │
│ │
Split T1 Split T2
In the example above, the chunk has an unusual distribution of write
timestamps over the time range it covers, with all data having a
timestamp before T1, or after T2. When a running a SplitExec to slice
this chunk at T1 and T2, the middle of the resulting 3 subsets will
contain no rows. Because we store only the min/max timestamps in the
chunk statistics, it is unfortunately impossible to prune one of these
split points from the plan ahead of time.
* refactor: remove min_sequnce_number
* fix: typos
* fix: remove min_sequencer_number from new files from merging main
* fix: add back throwing error if the compactor compacts files persisted by the ingester after the ingester sends max seq_num back to querier
* test: add test_compactor_collision back but modify the input to make it work woth new changes
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* refactor: change level 1 to level 2 preparing for next design changes
* fix: make level-2 consistent everywhere
* chore: remove unused comments
* refactor: change all the name level_1 to level_2 to completely replace 1 with 2 to amke everything consistent
* chore: add correspinding constants for the comapction levels in the comments
Co-authored-by: Dom <dom@itsallbroken.com>
* test: test that hits panic becasue of no column meta data
* chore: Apply suggestions from code review
* chore: run format after applying changes
* chore: Apply suggestions from code review
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
* chore: run clippy
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
* test: reproducer for 4695
* chore: some debug info
* test: test with many columns and rows
* chore: cleanup and add debug info
* chore: cleanup
* chore: cleanup
* chore: more debug info
Removes the min/max timestamp fields from the IoxMetadata proto
structure embedded within a Parquet file's metadata.
These values are redundant as they already exist within the Parquet
column statistics, and precluded streaming serialisation as these
removed min/max values were needed before serialising the file.
Remove the redundant row_count from the IoxMetadata structure that is
serialised into the Parquet file.
The reasoning is twofold:
* The Parquet file's native metadata already contains a row count
* Needing to know the number of rows up-front precludes streaming
Adds two integration tests covering validation of the embedded IOx
metadata within the Parquet file metadata, and validation of the derived
ParquetFileParams metadata used to populate the catalog.