influxdb/garbage_collector
Phil Bracikowski e34ec77e8d
feat(garbage-collector): batch parquet existence checks to catalog (#7964)
* feat(garbage-collector): batch parquet existence checks to catalog

The core feature of this PR is batching the existence checks of parquet
files in object store against the catalog. Before, there was 1 catalog
query per each parquet file in object store. This can be a lot of
requests.

This PR can perform one query of at most 100 parquet file uuids against
the catalog in one query. A hundred seems like a decent starting place.

The batch may not reach 100 because there is also a timeout on receiving
object store meta objects from the object store lister thread. That
timeout is set to 100 milliseconds. If more than 100 are received, they
are batched into 100 for the catalog.

Additionally, this PR includes surrounding code changes to make it more
idiomatic (but not perfect). It follows up some suggested work from
 #7652 for watching for shutdown on the threads.

* fixes #7784

* use hashset instead of vec to test for contains
* chore: add test for db failure path
* remove ParquetFileExistsByOSID and other single field structs that are
  just for sql deserialization; map to uuid explicitly
* fix the sqlite query by using a blob literal X'<hex>' for uuids
* comment clarifications
* adjust loggings to warn from debug for expected rare events

Many thanks to Carol for help implementing this!
2023-06-14 07:59:00 -07:00
..
src feat(garbage-collector): batch parquet existence checks to catalog (#7964) 2023-06-14 07:59:00 -07:00
Cargo.toml feat(garbage-collector): batch parquet existence checks to catalog (#7964) 2023-06-14 07:59:00 -07:00