This commit removes tombstone support from the ingester, and deletes
associated code/helpers/tests. This commit does NOT remove tombstone
support from any other service, but MAY include removing overlapping
test coverage.
This also removes the tombstone support from the Ingester -> Querier RPC
response message.
This has the nice side effect of removing a whole lot of thread spawning
in the ingester tests for the Executor, speeding everything up!
* feat: initial step to identify where the projection should be provided
* feat: start getting columns of all expressions
* chore: format
* test: test for the table_chunk_stream
* fix: fix a compile error. Thanks @alamb
* test: full tests for table_chunk_stream
* chore: cleanup
* fix: do not cut any columns in case all fields are needed
* test: add one more test case of reading all columns
* refactor: move code that identify columbs ot push down to a function. Add the use of field_columns
* chore: cleanup
* refactor: make sream_from_batch support empty batches
* chore: cleanup
* chore: fix clippy after auto merge
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
- treat OOM protection as "resource exhausted"
- use `DataFusionError` in more places instead of opaque `Box<dyn Error>`
- improve conversion from/into `DataFusionError` to preserve more
semantics
Overall, this improves our error handling. DF can now return errors like
"resource exhausted" and gRPC should now automatically generate a
sensible status code for it.
Fixes#5799.
* feat: send only needed projection columns from querier to ingester in case of normal SQL queries
* refactor: push column index down until we need to convert them strings
* fix: make the test deterministic
* test: test for the projection pushdown
* test: add asserts for the proj pushdown test
* test: implement projection pushdown for partitions of MockIngesterConnection
* chore: cleanup
* chore: address review comments
* chore: Apply suggestions from code review
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
* refactor: address review comments
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* chore: Upgrade to Rust 1.64
* fix: Use iter find instead of a for loop, thanks clippy
* fix: Remove some needless borrows, thanks clippy
* fix: Use then_some rather than then with a closure, thanks clippy
* fix: Use iter retain rather than filter collect, thanks clippy
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* refactor: do not run de-dup in ingester for querier requests
This removes the entire de-dup logic from the inegster for querier
requests. Furthermore, it even removes the entire datafusion execution
from the querier and just dumps the in-memory record batches as quickly
as possible. No filters are applied. Note that even prior to this PR,
we've never applied projections (tracked by #5624).
**Pros:**
- speed up query planning within the querier (since we need the ingester
response for state reconciling)
- lowered ingester CPU load
**Cons:**
- more querier<>ingester network traffic
Closes#5602.
* test: extend query test case
* fix: ingester tests
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* refactor: arc the cached table
* refactor: use cheaper hash keys for projected schemas
Instead of using the column names to address projected schemas, let's
use the column IDs.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* fix: retry ingester requests faster
The retries introduced in #5695 are too slow and block the entire
querier for minutes (until the very long gRPC timeout kicks in).
* fix: add error details on why the query planning failed
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* refactor: retry querier->ingester requests
Esp. for InfluxRPC requests that scan multiple tables, it may be that
one ingester requests fails. We shall retry that request instead of
failing the entire query.
* refactor: improve docs
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
* fix: less foo
* docs: remove outdated TODO
* test: assert that panic happened
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
* refactor: improve consistent access under "remove if"
With all the concurrency introduced in #5668, we should be a bit more
careful with our "remove if" handling, esp. if a removal is triggered
while a load is running concurrently. This change introduces as
`remove_if_and_get` helper that ensures this and the querier over to use
it. The parquet file and tombstone caches required a bit of a larger
change because there the invalidation and the actual GET were kinda
separate. We had this separation for the other caches as well at some
point and decided that this easily leads to API misuse, so I took this
opportunity to "fix" the parquet file and tombstone cache as well.
* docs: improve
* feat: split "pruned" metric into "early" and "late"
* docs: improve
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
* docs: explain `PruningMetrics`
* test: try to test pruning
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Create chunks in querier concurrently after we've pre-filtered them.
Chunk creation still may require a bit of cached information (e.g. the
partition sort key) and we can easily fetch these concurrently instead
of in order.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
This should lower catalog load and eliminate a few costly cache misses.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
In our data model, a chunk always belongs to a partition[^1], so let's
not make this attribute optional. The optional value only leads to
-- mostly surprising -- conditional behavior, ranging from "do not equalize
the partition sort key" (querier) to "always consider the chunk overlapping"
(iox_query when dealing with ingester chunks).
[^1]: This is even true when the chunk belongs to a parquet file that is not
yet added to the catalog, contrary to what a comment in the ingester
stated. The catalog and data model used by the querier are two totally
different things.
* refactor: read querier parquet files from cache
* refactor: only use parquet files in querier (no RB)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* ci: use same feature set in `build_dev` and `build_release`
* ci: also enable unstable tokio for `build_dev`
* chore: update tokio to 1.21 (to fix console-subscriber 0.1.8
* fix: "must use"
Remove our own hand-rolled logic and let DataFusion read the parquet
files.
As a bonus, this now supports predicate pushdown to the deserialization
step, so we can use parquets as in in-mem buffer.
Note that this currently uses some "nested" DataFusion hack due to the
way the `QueryChunk` interface works. Midterm I'll change the interface
so that the `ParquetExec` nodes are directly visible to DataFusion
instead of some opaque `SendableRecordBatchStream`.
* refactor: do not override parquet file size in querier
This is going to be an issue when we actually rely on the size for
reading, see #5531.
* refactor: use selected file size mocking in compactor
Do not blindly override parquet file sizes for all subsystems.
This is going to be an issue when we actually rely on the size for
reading, see #5531.
* refactor: remove ability to override file sizes in catalog
Blindly overriding data for all subsystems is dangerous, because some
parts of our stack actually rely on the actual file size. See #5531.
* docs: explain `size_overrides`
The API user may still use a `Box<dyn ...>` if they want, but they
technically don't have to.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
The API user still CAN use dynamic dispatch but doesn't have to. This
also simplifies the generics a bit.
This is similar to #5520.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
This removes some `Box<dyn ...>` indirection when the user doesn't want
it (you still can, but don't have to) and makes the whole type handling
easier to understand.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
1. Cache converted schema instead of catalog schema. This safes a buch
of memcopies during conversion.
2. Simplify creation of new chunks, we now only need a `CachedTable`
instead of a namespace and a table schema.
In an artificial benchmark, this removed around 10ms from the query
(although that was prior to #5467 which moved schema conversion one
level up). Still I think it is the cleaner cache design.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* refactor: use a single timestamp in policy backend
Prior to this PR we had at least 1 `TimeProvider::now` calls per GET
request (for caches that only used LRU) and up to 3 calls (caches with
LRU + refresh + TTL). Let's instead use a single timestamp that is
created by the policy backend itself (instead of the policies). This has
the following consequences:
- **efficiency:** `SystemProvider::now` is not free, even though under Linux
this doesn't result in a syscall, it uses the stdlib time system which
also checks for monotonicity
- **consistency:** All changes for a single trigger (e.g. a
GET cache call) now use a single timestamp instead of slightly
increasing ones. I argue this is the better semantic, simpler to
understand and better to debug.
For some (slightly artificial) local performance experiment, this shaves
off around 2ms per single-table SQL query. However I expect that there might
be more degenerated cases (e.g. multi-table SQL queries or some
InfluxRPC requests that hit multiple tables).
The majority of this patch is moving the `TimeProvider` from the
policies into the policy backend.
* docs: explain `now` parameter
* fix: hoist repeated computation out of chunk creation
We have hundreds of chunks per table, so it is beneficial to only
do common work once.
* chore: remove TableCache as it is no longer used
* fix: prune chunks both before and after metadata fetch
Fetching the metadata for all the chunks in a table is expensive,
especially when we have a narrow time range query that only
needs a few chunks.
* chore: fix clippy
* fix: fix up some last tests
* fix: review comments
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
This doesn't really need to be fallible but forces propagation of a ton
of error handling - no shards is always a sign of something being very
wrong, and can be caught in the caller if it's for some reason an
acceptable state / can be recovered from.