The old construct uses a single assert-statement for both:
- "bubble-up" scenario, were a panic should fire
- a check, were a panic should not fire
That makes it easy to add new tests. However we need two rather
questionable things to make that work:
- catch panic: to convert an assertion to a check
- a custom panic hook: to make tests not overly verbose (aka caught
panics should not show up on stdout)
Esp. the custom panic hook doesn't work too well w/ multi-threaded tests
since it might swallow error messages from unrelated tests and makes
debugging of CI failures hard.
So instead of using assertions for checks, we now implement a proper
assertion and a check for each test. That's a bit more code per check
but easier probably more stable.
First step towards #2518. Creates the Rust API to communicate delete
predicates between the preserved catalog and the in-memory catalog and
adds tests ensuring that the in-mem catalog produces the wanted errors
as well as correct checkpoints (similar to how this is done for the
parquet file tracking already).
**This does NOT contain the actual preservation!**
The motivations are:
1. The API uses a SINGLE predicate and adds that to many chunks. With
`Arc<Vec<...>>` you gain nothing, with `Vec<Arc<...>>` the predicate
is only stored once (in many vectors)
2. While we currently add predicates blindly to all chunks, we can be way
smarter in the future and prune out tables, partitions or even single
chunks (based on statistics). With that, it will be rare that many
chunks share the exact same set of predicates.
3. It would be nice if we could de-duplicate predicates when writing them
to the preserved catalog without needing to repeat the pruning
discussed in point 2. This is way easier to implement whan chunks
exists in `Arc`s.
4. As a side-note: the `Arc<Vec<...>>` wasn't really cloned around but
instead was created many time. So the new version should be more
memory efficient out of the box.
* fix: Capture query execution traces for storage gRPC queries as well
* refactor: remove debugging droppings
* refactor: do not Box::pin within TracedStream
* refactor: Use Futures::TryStreamExt rather than custom collect function
* fix: remove wild println
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
This makes it clearer which traits and functions users of the preserved
catalog must implement. This also splits the error types into smaller
enums that are easier to understand.
This change should make it easier to implement new functionality (like
capturing delete predicates).
Instead of depending on the `chrono` clock implicitly via `Utc::now()`
we should make it an explicit parameter. This is essential for testing:
1. We have many tests guessing around this value by taking `Utc::now()`
directly before or after the write (this commit doesn't fix that, but
allows us to fix it in a follow-up).
2. We have some tests that ignore the `time_of_write` values in some
comparisons because they cannot control that value (fix not included
here but left as a follow-up).
3. The upcoming compression (#2528) needs to control timestamps within
the compressed payload (and `time_of_write` is embedded in the
parquet metadata) because the compressed size depends on it (even if
the uncompressed size is stable).
In general I argue that a "clock" is always data and should be passed
(either as a value or as a "now"-function) from the API layer. Hidden
clock checks just make mocking and tests a nightmare (we've seen this w/
replay tests as well).
Two reasons:
1. I wanna decouple `parquet_file` from `query` (nearly done, needs a
small follow-up PR).
2. `predicate` will have more and more features (like serialization)
which justifies a new home