Abstract
========
We need to be careful w/ tombstone that fall exactly in sequence number range of a parquet file.
Current Bug
===========
Imagine the following order of events:
1. Router creates write at sequence number 1:
- `table,selector=1 payload=1 1`
- `table,selector=2 payload=2 2`
2. Ingester pulls write, waits a bit and persists it to parquet file 1:
- `table,selector=1 payload=1 1`
- `table,selector=2 payload=2 2`
4. Router creates write at sequence number 2:
- `table,selector=1 payload=3 3`
- `table,selector=2 payload=4 4`
5. Ingester pulls write
6. Router create delete at sequencer number 3: full time range, `selector=1`
7. Ingeser pulls delete and creates tombstone 1
8. Router creates write at sequence number 4:
- `table,selector=1 payload=5 5`
- `table,selector=2 payload=6 6`
9. Ingester pulls write
10. Ingester persists parquet file 2:
- `table,selector=2 payload=4 4`
- `table,selector=1 payload=5 5`
- `table,selector=2 payload=6 6`
When reading parquet file 2, the tombstone MUST NOT be applied. Otherwise `table,selector=1 payload=5 5` will be
deleted.
Notes
=====
Technically this issue also applies to files created by the compactor, however the compactor marks tombstones as
processed that fall into the sequence number range. It even does that in a single transaction:
fc4635a334/compactor/src/compact.rs (L821-L861)
Alternative
===========
An alternative solution would be if the ingester would mark tombstones that it materialized during persistence as
"processed" (tombstone 1 for parquet file 2 in the example above). However "processed" markers are currently a mere
optimization and don't affect correctness, which is nice for caching on the querier side as well as reasoning.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* feat: Support `SHOW NAMESPACES` in sql repl
* feat: add basic support to clients
* fix: add get_namespaces service test
* fix: proper error handling
* test: end to end test for namespace client
* refactor: Use QuerierDatabase rather than Catalog
* refactor: remove unused function
This allows us to remove the table name from the low-level chunk
representations (like `ParquetFile`, RUB, ...) since table names are
already tracked by the higher-level data structures (e.g. catalog,
catalog chunk) that manage the low-level chunk representations.
This is similar to #4167.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
The parquet chunk is always wrapped into some higher-level data
structure (e.g. a catalog chunk, a partition, ...) that knows exactly
"where" the chunk is located. There is no need for the parquet chunk to
back-reference container-level attributes. In the contrary:
double-bookkeeping makes the code more complex and costs additional
memory.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
Namespaces are now created on demand and contain their full schema.
Tombstones/chunks are created on demand during the query.
Closes#4123.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* feat: remove fully processed tombstones
* test: first few tests
* fix: delete SQL
* fix: test how IN (...) works in PG
* fix: test how IN (?) works in PG
* fix: test how IN (?) works in PG
* fix: dynamically add IN (?, ?, ...)
* fix: dynamically add IN (?, ?, ...) & its dynamic values
* fix: add argument directly in the SQL
* test: more tests for catalog read and update functions
* chore: move a subfunction to make it easier to read)
* test: first test for find_can_compact but disabled due to bug
* test: integration tests and a bug fix for find_and_compact
* chore: cleanup
* refactor: address review comments
* fix: put 2 delete processed tombstones and tombstones in a transaction
The sort key is optional and currently only produced by `iox_tests`.
Writing it within the ingester/compactor is tracked by #3968. The sort
key is read by the querier (and this will be verified by the query tests
and is required to merge #4103).
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
This includes some type changes to dispatch between OG and NG and allows
some tests to be run against the NG querier. This only contains parquet
files though, so it's somewhat a limited scope.
For #3934.
* refactor: dyn-dispatch database in query subsystem
This is similar to #4080 but concerns the database itself.
For #3934.
* docs: improve wording
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* feat: `TombstoneRepo::list_by_table`
* feat: `ParquetFileRepo::list_by_table_not_to_delete`
* refactor: `querier` w/o `db`
Get the `querier` to work w/o relying on `db`. A few notes:
- Testing is kinda shallow, we really need to get `query_tests` working
w/ `querier` (see #3934).
- We still run a sync loop for namespaces, tables and schemas. This will
be a replaced by "update namespace incl. tables and schemas on demand".
Note however that we cannot fetch single tables and schemas on demand
at the moment, because DataFusion doesn't implement async schema
inspection (only `scan` / "give me all the chunks" is async). I think
that's OK for now and we can address this later.
- There is NO cache for parquet files and tombstones at the moment. For
correctness, they need to be fetched in a single transaction (or we
need a kinda tricky sequence number / logical clock tracking) and I am
not sure yet how this makes sense when we have the ingester data wired
up and predicates pushed down to the catalog (see next point). So
let's measure first and then decide on a caching strategy for this.
- Predicates are currently NOT pushed down to the catalog. I'll need to
figure out how to extract time range from generic DataFusion
expressions to make that work (it's easier for InfluxRPC queries, but
they are not tested at the moment, see first point).
Sorry that this commit is kinda huge. I initially planned to only
migrate the chunks away from `db` and leave the tables and schemas for a
follow-up PR, but the DataFusion trait structure (chunks are bound to
their tables) makes this kinda pointless.
Closes#3974.
* docs: explain what we're doing
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
* docs: mention tracking issues
* docs: explain what we're doing
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
* feat: `TableRepo::get_by_namespace_and_name`
* refactor: rework `TableCache`
- dual cache that can also map table names to IDs
- deal w/ missing tables w/o panics
- set proper timeouts to missing data
For #3974.
* test: extend table cache tests
- this is what DataFusion is doing as well; it's also fast enough
because the number of chunks in a query is not THAT massive (it's not
like we are doing row-level dyn dispatching)
- it simplifies abstracting over different databases
- it allows us to drop our enum-based dispatching that we have for
`DbChunk` and that we would also need for the querier (e.g. depending
on if a chunk is backed by a parquet file or ingester data)
- it likely speeds up compile times because the `query` is no longer
contains massive amounts of generic code
For #3934.
* feat: add "dual" cache pattern
This will be useful for certain parts that are addressed internally via
ID but where the user-facing APIs use names.
For #3985.
* refactor: rework "dual" cache construct to be backend based
Pros:
- easiser to reason about the locking and consistency, esp. in
concurrent applications
Cons:
- we are not canceling running queries for the dual cache any longer
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
For OG we can determine the chunks w/o any IO, for NG however this might
require a few catalog queries.
This is likely not the last change of this sort, i.e. the whole schema
handling is currently sync as well.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
Quite a few caches will request data from the catalog w/o knowing if it
exists (e.g. a table by name). We should have different TTLs for "exists"
and "unknown" w/o writing much boilerplate code.
For #3985.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
In theory on a multi-threaded tokio executor, the following could have
happened:
| Thread 1 | Thread 2 |
| --------------------- | ----------------------------------- |
| | Running query begin |
| | ... |
| | `loader.await` finished |
| `Cache::set` begin | |
| state locked | |
| | try state lock, blocking |
| running query removed | |
| ... | |
| state unlocked | |
| `Cache::set` end | |
| | state locked |
| | panic because running query is gone |
Another issue that could happen is if we:
1. issue a get request, loader takes a while, this results in task1
2. side-load data into the running query (task1 still running)
3. the underlying cache backend drops the result very quickly (task1
still running)
4. we request the same data again, resulting in yet another query task
(task2), task1 is still running at this point
In this case the original not-yet-finalized query task (task1) would
remove the new query task (task2) from the active query set, even
though task2 is actually not done.
We fix this by the following measures:
- **task tagging:** tasks are tagged so if two tasks for the same key
are running, we can tell them apart
- **task->backend propagation:** let the query task only write to the
underlying backend if it is actually sure that it is running
- **prefer side-loaded results:** restructure the query task to strongly
prefer side-loaded data over whatever comes from the loader
- **async `Cache::set`:** Let `Cache::set` wait until a running query
task completes. This has NO correctness implications, it's probably
just nicer for resource management.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
Changes all consumers of the object store to use the dynamically
dispatched DynObjectStore type, instead of using a hardcoded concrete
implementation type.
* feat: `Cache::set`
This will be helpful to fill caches if we got the information from
somewhere else.
For #3985.
* docs: improve
Co-authored-by: Edd Robinson <me@edd.io>
* docs: explain lock gap
* feat: add debug log to `Cache`
Co-authored-by: Edd Robinson <me@edd.io>
* feat: `CacheBackend::as_any`
* refactor: add TTL cache backend
This is based on the new `AddressableHeap`, which simplifies the
implementation quite a lot.
For #3985.
* refactor: `TtlBackend::{update->evict_expired}`
* docs: exlain ttl cache eviction
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* feat: add addressable heap for query cache
This will be used as a helper data structure for TTL and LRU. It's
probably not the most performant implementation but it's good enough for
now.
This is for #3985.
* fix: test + explain tie breaking in `AddressableHeap`
* feat: extract "backend" from querier cache
The backend will implement pruning policies like LRU and TTL as well as
where/how the data is stored. Having a proper interface for that
simplifies the implementation since we don't need to have one massive
`Cache` object with a super complex mechanism.
This is for #3985.
* refactor: `Backend` -> `CacheBackend`
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* feat: querier test system, ground work
See #3985 for the motivation.
This introduces a cache system for the querier which can later be
extended to support the remaining features listed in #3985 (e.g.
metrics, LRU/TTL).
All current caches are wired up to go throw the new cache system. Once
we move away from (ab)using `db`, the set of caches will be different
but the system will remain.
* test: explain it
Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>
* refactor: simplify cache result broadcast
* refactor: introduce `Loader` crate
* fix: docs
* docs: explain why we manually drop removed hashmap entries
* docs: fix intra-doc link
Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>
- This is not used by the query engine at all.
- The query engine should not care about ALL chunks but only about the
chunks it gets via `QueryDatabase::chunks` (which includes a table
name and a predicate).
- All other users of that API are NOT really query-related.
- This was not actually used by the query engine.
- The query engine doesn't have a concept of a "partition", it only
cares about chunks.
- Unbound access to all partitions in the database is quite expensive
(esp. on NG).