This allows us to remove the table name from the low-level chunk
representations (like `ParquetFile`, RUB, ...) since table names are
already tracked by the higher-level data structures (e.g. catalog,
catalog chunk) that manage the low-level chunk representations.
This is similar to #4167.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
The parquet chunk is always wrapped into some higher-level data
structure (e.g. a catalog chunk, a partition, ...) that knows exactly
"where" the chunk is located. There is no need for the parquet chunk to
back-reference container-level attributes. In the contrary:
double-bookkeeping makes the code more complex and costs additional
memory.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
Namespaces are now created on demand and contain their full schema.
Tombstones/chunks are created on demand during the query.
Closes#4123.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* feat: remove fully processed tombstones
* test: first few tests
* fix: delete SQL
* fix: test how IN (...) works in PG
* fix: test how IN (?) works in PG
* fix: test how IN (?) works in PG
* fix: dynamically add IN (?, ?, ...)
* fix: dynamically add IN (?, ?, ...) & its dynamic values
* fix: add argument directly in the SQL
* test: more tests for catalog read and update functions
* chore: move a subfunction to make it easier to read)
* test: first test for find_can_compact but disabled due to bug
* test: integration tests and a bug fix for find_and_compact
* chore: cleanup
* refactor: address review comments
* fix: put 2 delete processed tombstones and tombstones in a transaction
The sort key is optional and currently only produced by `iox_tests`.
Writing it within the ingester/compactor is tracked by #3968. The sort
key is read by the querier (and this will be verified by the query tests
and is required to merge #4103).
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
This includes some type changes to dispatch between OG and NG and allows
some tests to be run against the NG querier. This only contains parquet
files though, so it's somewhat a limited scope.
For #3934.
* refactor: dyn-dispatch database in query subsystem
This is similar to #4080 but concerns the database itself.
For #3934.
* docs: improve wording
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* feat: `TombstoneRepo::list_by_table`
* feat: `ParquetFileRepo::list_by_table_not_to_delete`
* refactor: `querier` w/o `db`
Get the `querier` to work w/o relying on `db`. A few notes:
- Testing is kinda shallow, we really need to get `query_tests` working
w/ `querier` (see #3934).
- We still run a sync loop for namespaces, tables and schemas. This will
be a replaced by "update namespace incl. tables and schemas on demand".
Note however that we cannot fetch single tables and schemas on demand
at the moment, because DataFusion doesn't implement async schema
inspection (only `scan` / "give me all the chunks" is async). I think
that's OK for now and we can address this later.
- There is NO cache for parquet files and tombstones at the moment. For
correctness, they need to be fetched in a single transaction (or we
need a kinda tricky sequence number / logical clock tracking) and I am
not sure yet how this makes sense when we have the ingester data wired
up and predicates pushed down to the catalog (see next point). So
let's measure first and then decide on a caching strategy for this.
- Predicates are currently NOT pushed down to the catalog. I'll need to
figure out how to extract time range from generic DataFusion
expressions to make that work (it's easier for InfluxRPC queries, but
they are not tested at the moment, see first point).
Sorry that this commit is kinda huge. I initially planned to only
migrate the chunks away from `db` and leave the tables and schemas for a
follow-up PR, but the DataFusion trait structure (chunks are bound to
their tables) makes this kinda pointless.
Closes#3974.
* docs: explain what we're doing
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
* docs: mention tracking issues
* docs: explain what we're doing
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
* feat: `TableRepo::get_by_namespace_and_name`
* refactor: rework `TableCache`
- dual cache that can also map table names to IDs
- deal w/ missing tables w/o panics
- set proper timeouts to missing data
For #3974.
* test: extend table cache tests
- this is what DataFusion is doing as well; it's also fast enough
because the number of chunks in a query is not THAT massive (it's not
like we are doing row-level dyn dispatching)
- it simplifies abstracting over different databases
- it allows us to drop our enum-based dispatching that we have for
`DbChunk` and that we would also need for the querier (e.g. depending
on if a chunk is backed by a parquet file or ingester data)
- it likely speeds up compile times because the `query` is no longer
contains massive amounts of generic code
For #3934.
* feat: add "dual" cache pattern
This will be useful for certain parts that are addressed internally via
ID but where the user-facing APIs use names.
For #3985.
* refactor: rework "dual" cache construct to be backend based
Pros:
- easiser to reason about the locking and consistency, esp. in
concurrent applications
Cons:
- we are not canceling running queries for the dual cache any longer
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
For OG we can determine the chunks w/o any IO, for NG however this might
require a few catalog queries.
This is likely not the last change of this sort, i.e. the whole schema
handling is currently sync as well.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
Quite a few caches will request data from the catalog w/o knowing if it
exists (e.g. a table by name). We should have different TTLs for "exists"
and "unknown" w/o writing much boilerplate code.
For #3985.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
In theory on a multi-threaded tokio executor, the following could have
happened:
| Thread 1 | Thread 2 |
| --------------------- | ----------------------------------- |
| | Running query begin |
| | ... |
| | `loader.await` finished |
| `Cache::set` begin | |
| state locked | |
| | try state lock, blocking |
| running query removed | |
| ... | |
| state unlocked | |
| `Cache::set` end | |
| | state locked |
| | panic because running query is gone |
Another issue that could happen is if we:
1. issue a get request, loader takes a while, this results in task1
2. side-load data into the running query (task1 still running)
3. the underlying cache backend drops the result very quickly (task1
still running)
4. we request the same data again, resulting in yet another query task
(task2), task1 is still running at this point
In this case the original not-yet-finalized query task (task1) would
remove the new query task (task2) from the active query set, even
though task2 is actually not done.
We fix this by the following measures:
- **task tagging:** tasks are tagged so if two tasks for the same key
are running, we can tell them apart
- **task->backend propagation:** let the query task only write to the
underlying backend if it is actually sure that it is running
- **prefer side-loaded results:** restructure the query task to strongly
prefer side-loaded data over whatever comes from the loader
- **async `Cache::set`:** Let `Cache::set` wait until a running query
task completes. This has NO correctness implications, it's probably
just nicer for resource management.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
Changes all consumers of the object store to use the dynamically
dispatched DynObjectStore type, instead of using a hardcoded concrete
implementation type.
* feat: `Cache::set`
This will be helpful to fill caches if we got the information from
somewhere else.
For #3985.
* docs: improve
Co-authored-by: Edd Robinson <me@edd.io>
* docs: explain lock gap
* feat: add debug log to `Cache`
Co-authored-by: Edd Robinson <me@edd.io>
* feat: `CacheBackend::as_any`
* refactor: add TTL cache backend
This is based on the new `AddressableHeap`, which simplifies the
implementation quite a lot.
For #3985.
* refactor: `TtlBackend::{update->evict_expired}`
* docs: exlain ttl cache eviction
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* feat: add addressable heap for query cache
This will be used as a helper data structure for TTL and LRU. It's
probably not the most performant implementation but it's good enough for
now.
This is for #3985.
* fix: test + explain tie breaking in `AddressableHeap`
* feat: extract "backend" from querier cache
The backend will implement pruning policies like LRU and TTL as well as
where/how the data is stored. Having a proper interface for that
simplifies the implementation since we don't need to have one massive
`Cache` object with a super complex mechanism.
This is for #3985.
* refactor: `Backend` -> `CacheBackend`
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* feat: querier test system, ground work
See #3985 for the motivation.
This introduces a cache system for the querier which can later be
extended to support the remaining features listed in #3985 (e.g.
metrics, LRU/TTL).
All current caches are wired up to go throw the new cache system. Once
we move away from (ab)using `db`, the set of caches will be different
but the system will remain.
* test: explain it
Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>
* refactor: simplify cache result broadcast
* refactor: introduce `Loader` crate
* fix: docs
* docs: explain why we manually drop removed hashmap entries
* docs: fix intra-doc link
Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>
- This is not used by the query engine at all.
- The query engine should not care about ALL chunks but only about the
chunks it gets via `QueryDatabase::chunks` (which includes a table
name and a predicate).
- All other users of that API are NOT really query-related.
- This was not actually used by the query engine.
- The query engine doesn't have a concept of a "partition", it only
cares about chunks.
- Unbound access to all partitions in the database is quite expensive
(esp. on NG).
* refactor: wire exectution context to Deduplicator
* feat: example trace to chunk read_filter
* refactor: make execution context required
* refactor: expose metadata API
* refactor: more span context for chunk read_filter
* refactor: fix build
* refactor: push context into result stream
* refactor: make executor optional
* feat: add `success` column to system.queries
* refactor: Remove lifetime from QueryCompletedToken and thread through flight
* test: update test to make incomplete query clearer
* refactor: use better patter to set complete
* fix: logical merge conflict
Before adding more and more features, here is a bit of a clean up and
prep work:
- Pull out caching into its own module and add proper tests for it.
- Start to build a test infrastructure so tests are shorter and easier
to read. This doesn't fully pay off just yet but gets more and more
important when we actually sync tables and chunks.
* feat: skeleton of querier CLI
* chore: wrap metrics in opt&arc in querier to satisfy new api
* chore: derive debug in querier handler
* chore: add join handles and their shutdown to nascent querier server
* chore: querier server http unimpl -> 404
* fix: join/shutdown fix in querier; removed unused delegates
* feat: Add a way to run ingester with an in-memory catalog from the CLI
If you set the --catalog-dsn string to "mem", rather than using that as
a Postgres connection URL, create an in-memory catalog.
Planning on using this in tests, so not documenting.
* fix: Set default topic to the same value as SHARED_KAFKA_TOPIC
Namely, both should use an underscore. I don't think there's a way to
directly share these values between a constant and an annotation.
* feat: Add a flight API (handshake only) to ingester
* fix: Create partitions if using file-based write buffer
* fix: Change the server fixture to handle ingester server type
For now, the ingester doesn't implement the deployment API. Not sure if
it should or not.
* feat: Start implementing ingester do_get, namely decoding the query
Skip serialization of the predicate for the moment.
* refactor: Rename ingest protos to ingester to match crate name
* refactor: Rename QueryResults to QueryData
* feat: Move ingester flight client to new querier crate
* fix: Off by one error, different starting indexes in sequencers
* fix: Create new CLI argument to pick the catalog type
* fix: Create a CLI option to set the number of topics to auto-create in the write buffer
* fix: Check the arrow flight service's health to tell that the ingester gRPC is up
* fix: Set postgres as the default catalog type
* fix: Return an error rather than panicking if CLI args aren't right