This now creates a checkpoint every 10 transactions. To make it a bit
more fair increase the chunk count to 109, so we have some transactions
after the last checkpoint. With that we improve performance from 10.5s
to 1.2s (or even 0.3s if we would keep the chunk count at 100).
This will be useful for #1381.
At the moment we parse schema and stats eagerly and store them alongside
the parquet metadata in memory. Technically this is not required since
this is basically duplicate data. In the future we might trade-off some
of this memory against CPU consumption by parsing schema and stats on
demand.
Note that the resulting size estimations are different because we were
double-counting `Table`. `mem::size_of::<Self>()` is recursive for
non-boxed types since the child will be part of the parent structure.
Issue: #1295.
__Rationale__
We currently use the `tracing` framework to output to both log outputs (e.g. stdout for k8s) and distributed tracing collectors (e.g. opentelemetry jaeger).
However, due to a limitation in the `tracing` SDK, we can only have one "filter" level that applies
to both logs and tracing outputs. This is unpractical because tracing collectors are designed
to receive high verbosity data (which will be then sampled within the opentelemetry library),
while logs generally are limited to the DEBUG level on production.
This PR adds a `FilteredLayer` tracing subscriber layer, that wraps a subscriber layer with a independent
filter, which can filter events goint to the wrapper subscriber layer more agressively than the global layer.
This will allow us to emit logs at INFO or DEBUG level while passing all events to opentelemetry at TRACE
level (and opentelemetry SDK will then sample the events so that only a small part will be sent to the
ot collector)
__Note__
This PR just implements the `FilteredLayer` and a test. Another PR will integrate this with
our log/tracing setup code.
`--traces-exporter-jaeger-max-packet-size` is important also when you run the jaeger collector
on "localhost" by running `docker run jaegertracing/all-in-one ....` which on mac doesn't really
work on the real localhost but has a few hops between tunneling interfaces, so you'd get mysteriously
dropped packets that can easily drive you to doubt your own sanity on an otherwise calm Thursday evening.
This implements a way to add checkpoints to the preserved catalog and
speed up replay.
Note: This leaves the "hook it up into the actual DB" for a future PR.
Issue: #1381.
This will be handy when the catalog state must be able to return
metadata objects so that we can create checkpoints, esp. when we use
multi-chunk parquet files in some midterm future.