* refactor: concurrent table scan in "field columns"
Similar to #5647 and #5649.
* docs: improve
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
Instead of passing the ShardId into each function for child nodes of the
Shard, store it. This avoids the possibility of mistakenly passing the
wrong value.
* refactor: concurrent table planning in InfluxRPC
Some InfluxRPC can scan multiple tables. Prior to this PR we were always
scanning the tables in sequence, adding up potential latencies (catalog,
ingester, object store). There is no reason we need to do this,
"ordinary" SQL queries would not serialize this way either.
So let's scan tables concurrently. This add concurrency to:
- read filter
- read group
- read window aggregate
There are other query types that could benefit from a similar treatment.
They will be changed in a follow-up.
* docs: improve
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
* test: explain `Send` assertion
* refactor: change `CONCURRENT_TABLE_JOBS` to 10
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
* refactor: read querier parquet files from cache
* refactor: only use parquet files in querier (no RB)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* ci: use same feature set in `build_dev` and `build_release`
* ci: also enable unstable tokio for `build_dev`
* chore: update tokio to 1.21 (to fix console-subscriber 0.1.8
* fix: "must use"
This limit restricts a single partition to containing at most N rows
before it is marked for persistence (note: being marked for persistence
does not currently prevent further ingest for that partition.)
1. Cache converted schema instead of catalog schema. This safes a buch
of memcopies during conversion.
2. Simplify creation of new chunks, we now only need a `CachedTable`
instead of a namespace and a table schema.
In an artificial benchmark, this removed around 10ms from the query
(although that was prior to #5467 which moved schema conversion one
level up). Still I think it is the cleaner cache design.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* chore: Update datafusion pin
* chore: Update now that user is a reserved word
* chore: Update cargo.lock
* fix: update query for user function
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* test: workaround for time > a number
* chore: cargo update
* chore: Revert "chore: cargo update"
This reverts commit 0798e4e14674267ddd2308b12a25031fc35de8b6.
This doesn't really need to be fallible but forces propagation of a ton
of error handling - no shards is always a sign of something being very
wrong, and can be caught in the caller if it's for some reason an
acceptable state / can be recovered from.
* test: add tests for regex_match_on_field
* feat: more general `_field` predicate handling
* fix: remove old comment
* fix: update tests
* fix: improve test a little more
* fix: fmt
* fix: Update predicate/src/rpc_predicate/field_rewrite.rs
Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>
* fix: Handle predicates that can not be evaluated
Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* feat: make querier RAM pool split a proper feature
- use propre pool names
- expose sizing via CLI/env
Closes https://github.com/influxdata/conductor/issues/1102.
* refactor: improve naming and docs
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
This is what DataFusion uses by default and I don't see a reason why we
should use such small batch sizes.
The affect is probably only visible in certain filter-aggregate queries
that don't focus on a single series (because there we likely end up with
1 or 2 batches only, esp. after #5250) for coarse-grained filters, esp.
when the filter key is not the first sort key.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
InfluxQL queries can send (technically incorrect) ranges like this, meaning all time
but excluding the max nanosecond time.
Since this is an important case, we should handle it specially and use the optimized
'all time' handling for meta queries even though this is technically wrong in that
it does not filter out column names / measurement names at MAX_NANO_TIME exactly.
Closes: https://github.com/influxdata/conductor/issues/1072
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* feat: `QueryChunk::as_any`
* feat: allo `ChunkPruner::prune_chunks` to fail
* feat: limit per-table chunk data for every query
Closes#5211.
* fix: address review comments
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
* fix: Fix SeriesKey sort order for special _measurement and _field
* fix: Update expected test output
* fix: Update more tests
* fix: Re-sort tag key when using binary encoding
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
There were some instances were we forgot to pass context (and therefore
tracing) information to `InfluxRpcPlanner`. This removes the `Default`
implementation requires to always pass a context when creating
`InfluxRpcPlanner` to prevent this type of bug.
Ref #5129.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>