Store the "maximum persisted timestamp" instead of the "minimum
unpersisted timestamp". This avoids the need to calculate the next
timestamp from the current one (which was done via "max TS + 1ns").
The old calculation was prone to overflow panics. Since the
timestamps in this calculation originate from user-provided data (and
not the wall clock), this was an easy DoS vector that could be triggered
via the following line protocol:
```text
table_1 foo=1 <i64::MAX>
```
which is
```text
table_1 foo=1 9223372036854775807
```
Bonus points: the timestamp persisted in the partition
checkpoints is now the very same that was used by the split query during
persistence. Consistence FTW!
Fixes#2225.
* fix: nocache feature code rot
The MBChunk::snapshot code when using the "nocache" option no longer
compiles - this commit updates it to match the not(nocache) code.
* build: use updated broken_intra_doc_links name
The broken_intra_doc_links lint was renamed
rustdoc::broken_intra_doc_links
https://doc.rust-lang.org/rustdoc/lints.html
This fixes an edge case where the speculated sequence ranges that can be
obtained from flush handles do not account for overlapping windows. The
symptom being that the resulting partition checkpoint marked sequence
numbers as unpersisted that where already persisted.
Fixes#2206.
We need a partition that is partially persisted for this.
This requires some rework for the time handling in `Db` to make it
mockable.
The remaining bits are test framework extensions.
First of all using a partition checkpoint as some kind of intermediate
representation was kinda a hack because partition checkpoints should
only created for to-be-persisted partitions, not for the others.
API-wise it should only be possible to construct a partition checkpoint
from a flush handle.
Also we were only able to construct partition checkpoints for partitions
that had unpersisted data, otherwise there was no sane way to fill the
`min_unpersisted_timestamp`. We must however scan all partitions no
matter if there is unpersisted data so that we can determine the maximum
seen sequence numbers. This was caught by a replay test resulting in a
catalog state where the last database checkpoint had lower maximum seen
sequence numbers than some partition checkpoint, bailing out with an
error.
So overall it turns out that passing the sequencer numbers directly
instead of wrapping them into a partition checkpoint is the better
implementation.
Now we can handle all these cases:
There are two partitions w/ a single write each:
1. A reads sequence number 1
2. B reads sequence number 2
3. we persist A which only knows the sequences up until 1
=> the DB checkpoint needs the global max, otherwise we forget sequences
during replay (2 in this case, so B would be gone)
1. B reads sequence number 1
2. A reads sequence number 2
3. we persist A which (w/o this commit) would not track the sequencer at
all in this checkpoint (since there is nothing to replay)
=> we MUST also remember that we already read up until 2, otherwise we'll
re-read 2 after replay
=> the partition checkpoint needs the local seen max (no matter if there's
something to to persist)
This is required to correctly handle the following case:
1. There are two partitions A and B w/ a single write each (from the same
sequencer).
2. We persist A:
- The partition checkpoint for A will be empty because after persistence
there will be nothing to replay (the single write is persisted and
we're ready).
- The database checkpoint that contains the global minimum of all ranges
recognizes that for the sequencer there is indeed something left (the
minimum sequence number from B).
3. DB restart happens, replay starts
4. We scan all persisted files, figure out that we have a DB checkpoint
with a sequence minimum but (w/o the change in this commit) there is no
maximum. Only partition checkpoints contain maxima, and the only partition
checkpoint that was persisted was the one for partition A and that one was
empty (see above).
5. So now how do we recover partition B?