This is mostly wiring that builds on top of the other PRs linked to #8089.
I think we eventually could make the batching code nicer by adding
better wrappers / helpers, but lets do that if we have other batched
caches and this patterns proofs to be useful.
Closes#8089.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
These places are sorting by `PartitionId` currently, which implements
`Copy`, but are about to be changed to be sorted on `PartitionHashId`,
which does not implement `Copy`.
The ingester can project arbitrary columns at query time, and has no
special requirement that the "time" column be part of that projection.
Because the timestamp summary generation explicitly requires the time
column to exist, it panics when there's no "time" column in the
projection - this is a bit of a modelling mismatch more than anything.
Similar to #8109.
This was once implemented by the RUB but as it stands right now, no
chunk implements this anymore.
If we ever want to bring this back, we should use the output of
`QueryChunk::data` instead (i.e. use a data-based implementation instead
of a per-chunk one).
Closes#8096.
* refactor: convert projection mask earlier
* refactor: bundle projection schema calculation
Same as #8102 but for the projected schema. This now has a nice side
effect:
1. there is no longer a per chunk cache lookup
2. there is no longer ANY per chunk async computation
3. we no longer need an early pruning stage for the chunks (we've used
to do that so we can throw away chunks before doing the more
expensive part of the chunk creation)
This nicely streamlines and simplifies the code.
---------
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
This interface was once specially implemented by the RUB. The only
actual implementation of it is within the querier that just forwards it
to a simple schema scan. Lift this semantic to `iox_query_influxrpc`
instead so all the chunks can use it.
If we ever want to optimize this again, we should use `QueryChunk::data`
instead (i.e. instead of implementing it within the chunk it should use
the data method and do something smart based on that).
First half of #8096.
Do not (ab)use per-chunk delete predicates for the retention policy.
Instead use a per-table predicate.
This makes the code way cleaner, since the scoping is correct (i.e.
delete predicates are a table-wide attribute, not a chunk-based one) and
it is consistent time predicates that the user providers (e.g. via
`WHERE time > x`).
It also allows us to remove delete predicates (in their current,
non-scalable form) from the query path. A potential future version would
likely not use per chunk predicates (and "is processed" markers) but use
the timestamp / chunk order to determine to which data the predicate
should be applied.
Note that the lowering of the retention policy changed slightly from
```text
(time > (now() - retention)) AND (time < MAX)
```
to
```text
time > (now() - retention)
```
Since the `MAX` cut is just an artifact of the lowering and was unnecessary.
Closes#7409.
Closes#7410.
* test: add regression test for high number of partition cache accesses
* refactor: bundle partition cache requests
Instead of accessing the partition cache for every single ingester
partition and parquet file, just collect all the partitions first and
request every partition only ones. Since the cache system needs to do
some locking and some bookkeeping (e.g. for LRU), this alone should be a
minimal perf win (the cache is quite efficient, so this might not be
measurable). However it also enables batching for catalog requests in
the future, see #8089.
* fix: typo
---------
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
For #8089 I would like to request each partition only once. Since
internally we store both the sort key and the column ranges in one cache
value anyways, there is no reason to offer two different methods to look
them up.
This only changes the `PartitionCache` interface. The actual lookups are
still separate, but will be changed in a follow-up.
We need to decode the ingester data in a serial fashion (since it is a
data stream). Cache access during that phase is costly since we cannot
parallize that. To avoid that, we gather the column ranges AFTER
decoding and calculate the chunk statistics accordingly.
This refactoring also removes the partition sort key from ingester
partitions since they are not required anymore. They are a leftover of
the old physical query planning. They were not marked as "unused" since
they were used by some test code.
Required for #8089.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
The circuit breaker needs to act on concurrent requests to the same
ingester. To do that, it performs the following steps per request:
1. check current circuit state (if open, then exit here)
2. perform request (if closed or as a half-open test request)
3. change circuit state based on results
Now only step 1 and step 3 hold locks to allow concurrency. This means
that in the meantime, the circuit state might change. To check that, the
circuit state has a generation counter.
The bug now was an overly strong assumption on the generation counter /
state change. Namely that if we are in step 3 and the state is
"half-open", then nobody else could have changed the state in the
meantime because for a single ingester, there can only be one test
request for the half-open state. While the latter part of this is
correct, the former is wrong. Namely we could have started in step 1
with a closed circuit and ended in a half-open one. Namely if the
following sequence happen:
1. request, blocks on upstream
2. circuit breaks
3. some time passes
4. a half-open requests starts, blocks on upstream
5. request from step 1 returns, finds itself confused
This now fixes the assertion (both in case that the request from step 1
succeeds and fails).
Includes tests for the two scenarios (`test_late_failure_after_half_open`,
`test_late_ok_after_half_open`) and an additional one that I came up with
while thinking about the issue (`test_late_failure_after_recovery`, was
passing on `main` but still good to have).
Fixes#8065.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
This will hold the deterministic ID for partitions.
Until all existing partitions have this value, this is optional/nullable.
The row ID still exists and is used as the main foreign key in the
parquet_file and skipped_compaction tables.
The hash_id has a unique index so that we can look up records based on
it (if it's available).
If the parquet file record has a partition_hash_id value, use that to
generate the object storage path instead of the partition_id.
The reconciler is a leftover from the Kafka-based write path. It doesn't
do anything anymore.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
Move all the gRPC assembly into one single place: `ioxd_querier`. This
way `querier` no longer depends on `service_*` (except for
`service_common` which doesn't really implement gRPC but only the
namespace/database entry point).
Immutable `Box<Vec<T>>`/`Arc<Vec<T>>` are better stored as
`Box<[T]>`/`Arc<[T]>` because:
- allocation always exact (no need for `shrink_to_fit`)
- smaller (the fat pointer is just the memory address and the length, no
capacity required)
- less allocation (`Box`/`Arc` -> slice instead of `Box`/`Arc` -> `Vec`
-> buffer); in fact the vector itself was offen missing in the
accounting code
Found while I was working on #7987.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
Avoid that the querier accesses files that were flagged for deletion a
long time ago. This would happen if the following conditions hold:
- we have very long-running querier pods (e.g. over holidays)
- the table doesn't receive any writes (or the partition if we ever
change the cache granularity), hence the querier is never informed
that its state is out-of-date
- a compactor runs a cold compaction, and by doing so flags a file for
deletion
- the GC finally wants to delete it
This is mostly a safety measure to prevent weird internal server errors
that should nearly never happen. On the other hand I do not want to hunt
Heisenbugs.
This is the major part of #7470. Additional clean ups (e.g. to remove
the actual types from `data_types`) will follow.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
This commit fixes loads of crates (47!) had unused dependencies, or
mis-configured dependencies (test deps as normal deps).
I added the "unused_crate_dependencies" to all crates to help prevent
this mess from growing again!
https://doc.rust-lang.org/beta/nightly-rustc/rustc_lint_defs/builtin/static.UNUSED_CRATE_DEPENDENCIES.html
This has the minor downside of false-positives when specifying
dev-dependencies for test/bench binaries - these are files in /test or
/benches (not normal tests). This commit includes a workaround,
importing them in lib.rs (gated by a feature flag). I think the
trade-off of better dependency management is worth it!
Introduce a new header called `iox-debug` which when set enables certain
debug features. The first one will be the `system.queries` table which
is a process-local, namespace-scoped query log. In most prod setups this
is only useful for debugging and will confuse the user a lot because
when multiple queries are deployed then the K8s routing decides which
pod/process the users hits. This leads to an inconsistent view. However
the log is still useful for debugging.
This also wires the "debug header set" flag through the Flight ticket,
because JDBC proved (integration tests FTW!) that headers are only
passed to `GetFlightInfo` but not to `DoGet` and the ticket must encode
all the relevant information.
Closes#7119.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>