This allows us to remove the table name from the low-level chunk
representations (like `ParquetFile`, RUB, ...) since table names are
already tracked by the higher-level data structures (e.g. catalog,
catalog chunk) that manage the low-level chunk representations.
This is similar to #4167.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
The parquet chunk is always wrapped into some higher-level data
structure (e.g. a catalog chunk, a partition, ...) that knows exactly
"where" the chunk is located. There is no need for the parquet chunk to
back-reference container-level attributes. In the contrary:
double-bookkeeping makes the code more complex and costs additional
memory.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
Min/max values and distinct counts are already optional, so let's make
the null counts optional as well. This will be helpful for NG to deal w/
partial statistics (e.g. we only populate stats for the time column).
Note that the total count is still mandatory, but we normally have the
chunk/file-level row count at hand.
The sort key is optional and currently only produced by `iox_tests`.
Writing it within the ingester/compactor is tracked by #3968. The sort
key is read by the querier (and this will be verified by the query tests
and is required to merge #4103).
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
When created in the catalog, parquet files should always have compaction
level 0. Updating the compaction level should always happen in the
compactor.
Only the catalog should need to know about the initial compaction level
value.
This has the advantages of:
- Not needing to create fake parquet file IDs or fake deleted_at
values that aren't used by create before insertion
- Not needing too many arguments for create
- Naming the arguments so it's easier to see what value is what
argument, especially in tests
- Easier to reuse arguments or parts of arguments by using copies of
params, which makes it easier to see differences, especially in tests
It's a bit of a duck-type hack, but if we wanna just `ParquetFileChunk`
in the new architecture, we somehow need it to accept new-gen paths.
Also path handling should be somewhat centralized since
ingester/compactor/querier all need to construct them. So having a
`ParquetFilePath` that supports both path styles seems to be a
not-to-bad solution. This should obviously be cleaned up in some
not-to-distant future.
* feat: Add a way to run ingester with an in-memory catalog from the CLI
If you set the --catalog-dsn string to "mem", rather than using that as
a Postgres connection URL, create an in-memory catalog.
Planning on using this in tests, so not documenting.
* fix: Set default topic to the same value as SHARED_KAFKA_TOPIC
Namely, both should use an underscore. I don't think there's a way to
directly share these values between a constant and an annotation.
* feat: Add a flight API (handshake only) to ingester
* fix: Create partitions if using file-based write buffer
* fix: Change the server fixture to handle ingester server type
For now, the ingester doesn't implement the deployment API. Not sure if
it should or not.
* feat: Start implementing ingester do_get, namely decoding the query
Skip serialization of the predicate for the moment.
* refactor: Rename ingest protos to ingester to match crate name
* refactor: Rename QueryResults to QueryData
* feat: Move ingester flight client to new querier crate
* fix: Off by one error, different starting indexes in sequencers
* fix: Create new CLI argument to pick the catalog type
* fix: Create a CLI option to set the number of topics to auto-create in the write buffer
* fix: Check the arrow flight service's health to tell that the ingester gRPC is up
* fix: Set postgres as the default catalog type
* fix: Return an error rather than panicking if CLI args aren't right
* feat: add ProcessedTombstoneRepo
* feat: add function add_parquet_file_with_tombstones
* fix: remove unecessary use
* feat: handling transaction when adding parquet file and its processed tombstones
* feat: tests update catalog for parquet file and processed tombstones
* fix: make add parquet file & its processed tombstones fully transactional
* chore: cleanup
* test: add integration tests for new catalog update functions
* chore: remove catalog_update.rs
* chore: cleanup
* fix: assert the right values
* fix: create unique namespace
* fix: support non transaction create_many
* test: remove tests that do not work in a transaction
* fix: one more case with unique namespace
* chore: more verification around for better understanding why certain tests fail
* fix: compare difference rather than absolute becasue the DB already has data
* fix: fix the argument provided to SQL
* fix: return non-empty processed tombstones
* fix: insert the right parquet file
* chore: remove unsed file
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* refactor: remove InfluxColumnType::IOx
Remove unused column variant - see #3554 for context.
* refactor: reserve SEMANTIC_TYPE_IOX name in proto
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
1. Remove `new_empty` logic. It's a leftover from the time when the
`PreservedCatalog` owned the in-memory catalog.
2. Make `db_name` a part of the `PreservedCatalogConfig`.
Store the "maximum persisted timestamp" instead of the "minimum
unpersisted timestamp". This avoids the need to calculate the next
timestamp from the current one (which was done via "max TS + 1ns").
The old calculation was prone to overflow panics. Since the
timestamps in this calculation originate from user-provided data (and
not the wall clock), this was an easy DoS vector that could be triggered
via the following line protocol:
```text
table_1 foo=1 <i64::MAX>
```
which is
```text
table_1 foo=1 9223372036854775807
```
Bonus points: the timestamp persisted in the partition
checkpoints is now the very same that was used by the split query during
persistence. Consistence FTW!
Fixes#2225.
We no longer need hacky pointer tricks to de-duplicate delete predicates
when collecting them for catalog checkpoints. This was once required
when the delete predicates didn't implement `Eq` and `Hash` but now it's
all way easier.
`DeletePredicate` is a simpler version of `Predicate` that is based on
IOx `DeleteExpr` instead of the full-blown DataFusion `Expr`. This will
allow us to do a couple of things (in follow-up changes):
- Order and de-duplicate delete predicates
- Normalize predicates
- Infallible serialization
- Smaller memory footprint
Note that this change only affects delete expressions. Query expressions
that are supported via the API are not changed. The query subsystem also
still uses the full-featured expressions/predicates (delete
expressions/predicates are converted to the more powerful DataFusion
version on-the-fly).
Due to the timing of the "persist" lifecycle action and that delete
predicates might arrive at any time + the fact that we don't wanna hold
transaction locks for too long, we should accept delete predicates for
chunks that are currently "persisting" even though that lifecycle action
might fail.
First step towards #2518. Creates the Rust API to communicate delete
predicates between the preserved catalog and the in-memory catalog and
adds tests ensuring that the in-mem catalog produces the wanted errors
as well as correct checkpoints (similar to how this is done for the
parquet file tracking already).
**This does NOT contain the actual preservation!**
We changed from Google timestamp (which use variable-sized integers) to
our own fixed-sized integer timestamps so that the size of the parquet
metadata does not depend on the timestamp. However with the introduction
of compression this is the case anyways (since slightly different
timestamps lead to different compression results) and we need now
derministic timestamps for tests. So there is now point in using our own
timestamp type. Switching back to the variable-sized type also shrinks
the post-compression results a bit.
This makes it clearer which traits and functions users of the preserved
catalog must implement. This also splits the error types into smaller
enums that are easier to understand.
This change should make it easier to implement new functionality (like
capturing delete predicates).
Two reasons:
1. I wanna decouple `parquet_file` from `query` (nearly done, needs a
small follow-up PR).
2. `predicate` will have more and more features (like serialization)
which justifies a new home
We need to hold the parquet metadata in memory so that we're able to
create catalog checkpoints. We used to do that by holding the decoded
structure (provided by the upstream `parquet` crate) in memory and
serializing that data on demand to Apache Thrift.
There are two drawbacks:
1. We did not account for the memory usage of the decoded structures (or
at least not fully).
2. We actually don't need the decoded data in-memory, since for the
checkpoint creation we only need to write the serialized data.
So this PR changes our wrapper so it holds the serialized data which is
then only decoded when it's really necessary. Since the serialized data
is a simple byte vector, we can also easily account for the size.
Note that this makes the accounted size of parquet chunks larger.
However this data was always there, we just ignored it up until now. If
the size of the parquet metadata really becomes an issue, we could trait
some CPU time for memory by compressing it.
* refactor: remove display methods, use fmt::Display instead.
Signed-off-by: Ning Sun <sunng@protonmail.com>
* refactor: update a few calls from .display to .to_string()
* fix: consistently use `Path` rather than occasionally `DirsAndFileName`
* fix: fixup for merge conflicts
* fix: update test
* fix: Catch another case or two
* fix: fmt
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* fix: nocache feature code rot
The MBChunk::snapshot code when using the "nocache" option no longer
compiles - this commit updates it to match the not(nocache) code.
* build: use updated broken_intra_doc_links name
The broken_intra_doc_links lint was renamed
rustdoc::broken_intra_doc_links
https://doc.rust-lang.org/rustdoc/lints.html
* chore: Update deps (including arrow 5.0.0 --> arrow 5.1.0)
* chore: update all the things
* refactor: Update serving readiness check due to change in Tonic API
* chore: update more deps
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
In one case where ParquetChunk::new was being called, the calling code
had just parsed the IoxMetadata too. In the other case, the calling code
had just *created* the IoxMetadata being parsed. In both cases, this
re-parsing wasn't actually needed; the two bits of info
ParquetChunk::new can be easily passed in.
First of all using a partition checkpoint as some kind of intermediate
representation was kinda a hack because partition checkpoints should
only created for to-be-persisted partitions, not for the others.
API-wise it should only be possible to construct a partition checkpoint
from a flush handle.
Also we were only able to construct partition checkpoints for partitions
that had unpersisted data, otherwise there was no sane way to fill the
`min_unpersisted_timestamp`. We must however scan all partitions no
matter if there is unpersisted data so that we can determine the maximum
seen sequence numbers. This was caught by a replay test resulting in a
catalog state where the last database checkpoint had lower maximum seen
sequence numbers than some partition checkpoint, bailing out with an
error.
So overall it turns out that passing the sequencer numbers directly
instead of wrapping them into a partition checkpoint is the better
implementation.
* fix: Properly record total_count and null_count in statistics
* fix: fix statistics calculation in mutable_buffer
* refactor: expose null counts in read_buffer
* refactor: expose null_count in parquet_file
* fix: update server crate tests
* fix: update query_tests tests
* docs: tweak comments
* refactor: Use storage_stats rather than adding `null_count`
* refactor: rename test data field for clarity
* fix: fixup merge conflicts
* refactor: rename initial_non_null_count to initial_total_count
* refactor: caculate null_count as row_count - to_add
The read_statistics, read_statistics_from_parquet_row_group,
load_parquet_from_store, and load_parquet_from_store_for_chunk functions
weren't ever using table name, they just passed it around and passed it
back.
Since the consumers of ObjectStore always use the concrete type rather than the ObjectStoreApi trait, it makes more sense to just change the concrete type to have a pointer to the cache. This removes the cache from the ObjectStoreApi trait and changes the ObjectStore to be a regular struct rather than a tuple around the ObjectStoreIntegration. Future work will have the server configure the cache on the ObjectStore struct when its options are set.
Now we can handle all these cases:
There are two partitions w/ a single write each:
1. A reads sequence number 1
2. B reads sequence number 2
3. we persist A which only knows the sequences up until 1
=> the DB checkpoint needs the global max, otherwise we forget sequences
during replay (2 in this case, so B would be gone)
1. B reads sequence number 1
2. A reads sequence number 2
3. we persist A which (w/o this commit) would not track the sequencer at
all in this checkpoint (since there is nothing to replay)
=> we MUST also remember that we already read up until 2, otherwise we'll
re-read 2 after replay
=> the partition checkpoint needs the local seen max (no matter if there's
something to to persist)
This is required to correctly handle the following case:
1. There are two partitions A and B w/ a single write each (from the same
sequencer).
2. We persist A:
- The partition checkpoint for A will be empty because after persistence
there will be nothing to replay (the single write is persisted and
we're ready).
- The database checkpoint that contains the global minimum of all ranges
recognizes that for the sequencer there is indeed something left (the
minimum sequence number from B).
3. DB restart happens, replay starts
4. We scan all persisted files, figure out that we have a DB checkpoint
with a sequence minimum but (w/o the change in this commit) there is no
maximum. Only partition checkpoints contain maxima, and the only partition
checkpoint that was persisted was the one for partition A and that one was
empty (see above).
5. So now how do we recover partition B?
This has the (somewhat nice) side effect that it shrinks the in-mem
catalog a bit as well because nw `ParquetChunk` is a bit smaller making
the chunk stage enum smaller as well.
The entire persistence windows data structures (including the
checkpoints) have nothing to do with the mutable buffer per se. So lets
move them into their own crate. This also makes `parquet_file` not
longer depend on `mutable_buffer`.
Instead of (ab)using the transaction lock to prevent the cleanup job
from removing just-written parquet files, use a dedicated lock. This
will later allow us to write parquet files before starting a transaction
(i.e. w/o holding the transaction lock).
This will help with #1821.
Stop relying on in-parquet transaction information during catalog
rebuilds. This has some downsides (no fork detection, only a single
transaction hence no time travel) but will allow that we remove
transaction information from parquet files, so that we can finally move
the actual parquet file storage out of the transaction lock.
This will help with #1821.
* refactor: pull Channel --> Stream adapater into its own module
* docs: Update query/src/exec/stream.rs
Co-authored-by: Marko Mikulicic <mkm@influxdata.com>
Co-authored-by: Marko Mikulicic <mkm@influxdata.com>
Change schema from
```text
<server_id>/<db_name>/data/<part_key>/<chunk_id>/<table_name>.parquet
```
to
```text
<server_id>/<db_name>/data/<table_name>/<part_key>/<chunk_id>.<uuid>.parquet
```
So parquet files will NEVER be overwritten. This is especially helpful
when dealing with old catalog leftovers (i.e. a parquet file that
belonged to an old but wiped catalog). It also simplifies the reasoning
about file references in the future and follows what other dataset
formats are usually doing (i.e. never replace files).
Also use `ChunkAddr` where it makes sense.
Don't mix commit+checkpoint in a single call so that the caller has to
reason about the error type and which of the two operations has failed.
Splitting it also makes it easier to create the correct checkpoint data.
* refactor: add ChunkPath to describe location of chunk in catalog
* refactor: rename ChunkPath to ChunkAddr
* chore: further renames
* chore: even more renames
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
This now allows not only for copy-based transaction handling but also
for eager exec and rollbacks. This will be useful to properly implement
transaction aborts for the "real" catalog.
This should have been done in #1714. Also add a note so that future devs
might hopefully not forget. In any case though the code also works w/o
this bump, it's just that the error message is a bit less nice ("cannot
parse IOxMetadata" instead of "unsupported catalog version").
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
This prevents users from `parquet_file::metadata` to also depend on
`parquet` directly. Furthermore they don't need to important dozend of
functions and can instead just use `IoxParquetMetaData` directly.
This will be useful for #1381.
At the moment we parse schema and stats eagerly and store them alongside
the parquet metadata in memory. Technically this is not required since
this is basically duplicate data. In the future we might trade-off some
of this memory against CPU consumption by parsing schema and stats on
demand.