Can be useful to call the IOx LP parser from other processes, for example from Go.
I used it to run an online comparison of IOx and influxdb Go LP parser in order to identify compatibility
issues.
Quick&Dirty implementation of a RAM-pool split to see if this has any
effect. I expect the querier performance to improve due to this because
large read buffers can no longer evict precious metadata.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
This is what DataFusion uses by default and I don't see a reason why we
should use such small batch sizes.
The affect is probably only visible in certain filter-aggregate queries
that don't focus on a single series (because there we likely end up with
1 or 2 batches only, esp. after #5250) for coarse-grained filters, esp.
when the filter key is not the first sort key.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
Timestamp ranges come from "untrusted" inputs (via gRPC) and must not
lead to panics. The only case where this could happen is at `start >
end`. Let's just set `start = end` in this case. Reaonsing:
- Semantically this is a sound range, since this is only a somewhat
degenerated case of "empty".
- We already allow `start = end` to represent "empty" ranges.
- We already clamp (and therefore modify) `start` to the valid range.
Fixes https://github.com/influxdata/conductor/issues/1080.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* refactor: increase batch size when reading parquet
This reduced our overhead when reading parquet files quite a lot.
In some internal benchmark, this reduces the size to perform a single
series aggregation of a rather large series with cold caches from 58s to
48s for cold caches. No real difference could be measured for warm
caches (~21ms for both).
This should also help the compactor since the record batches should be
larger.
* refactor: ensure that parquet row group size is in-sync
Ensure that we use the same row group size for reading and writing
parquet files. This is the same value as upstream currently uses as a
default, but let's make sure we don't diverge from that:
3032a521c9/parquet/src/file/properties.rs (L65)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
InfluxQL queries can send (technically incorrect) ranges like this, meaning all time
but excluding the max nanosecond time.
Since this is an important case, we should handle it specially and use the optimized
'all time' handling for meta queries even though this is technically wrong in that
it does not filter out column names / measurement names at MAX_NANO_TIME exactly.
Closes: https://github.com/influxdata/conductor/issues/1072
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* feat: run many compact partitions in parallel
* refactor: Use rust futures fu to run compactor jobs in parallel
* chore: Apply suggestions from code review
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
* fix: reduce log verbosity
* refactor: sleep for a sec if no work, print debug
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* feat: `QueryChunk::as_any`
* feat: allo `ChunkPruner::prune_chunks` to fail
* feat: limit per-table chunk data for every query
Closes#5211.
* fix: address review comments
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
* docs: extend profiling guide
More tools.
* chore: fix docs lint for `localhost` links
* docs: do not duplicate tracing docs
* refactor: clean up `lint_docs` and strip anchors from relative links
* chore: always pass `ROARING_ARCH`
Always pass the `ROARING_ARCH` that we would use for our prod builds.
Otherwise this can easily be missed during testing, profiling or build
system changes (e.g. should we ever move aways for our `Dockerfile`).
This feature was introduced with Rust/Cargo 1.56.
* docs: explain env passing
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
While I could not find evidence that these allocations are a problem,
the metadata and links of spans are rarely used so we shouldn't pay for
them even for heavily traced applications.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* feat: Split large compactions into multiple compacted files
Connects to #5121
* refactor: Extract update catalog function and error type
* refactor: Share physical plan to object store streaming
And only differ in the logical plan building based on split times in
different compaction cases.
* fix: Test for a split time equal to the max time and don't split then
* chore: cherry pick the first 3 commits of branch cn/connect-new-compaction
* fix: modify the test to work correctly with compactor running
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* test: Document the behavior of compute_split_time when min time = max time
* fix: compute_split_time returns one value when min_time = max_time
Co-authored-by: NGA-TRAN <nga-tran@live.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>