2.7 KiB
Job of Garbage Collector
The garbage_collector
service is responsible for cleaning up the
parquet_file
table in the catalog, as well as eventually removing
old uneeded parquet files in the object store.
Background
As IOx ingests data, it is stored as parquet files on object
store. Each parquet file created is recorded in a row of the
parquet_file
table in the catalog.
Over time, the compactor creates new, more optimized
parquet files by combining several pre-existing files. When a new file
is successfully created on object_store, a new row is added to
parquet_file
and the previous files with the same data are "soft
deleted" by setting the value of the parquet_file.to_delete
to the
current time.
Without the garbage collector, the size of the parquet_file
table
and the number of files in object store would grow without bound.
Interaction with Querier
The Querier caches entries from the parquet_file
table to answer
queries and periodically refreshes this cache. This cache means that
even after a row in parquet_file
has been marked as to_delete
or
actually deleted, the querier may still attempt to read the underlying
parquet file from object store until its cache is refreshed.
Configuration
There are two key configuration knobs that control the behavior of the garbage collector:
-
INFLUXDB_IOX_GC_PARQUETFILE_CUTOFF
: this setting controls when rows are deleted from theparquet_file
table. Any row that has a value ofto_delete
that is greater than this setting will be deleted from the catalog. -
INFLUXDB_IOX_GC_OBJECTSTORE_CUTOFF
: this setting controls when objects are actually deleted from the object store. Any object that was created (according to the object store timestamp) longer than this interval ago and is not referenced in the catalog'sparquet_file
table will be deleted.
Frequently Asked Questions
Q: Why do we need two cutoffs?
A: The querier relies on objects not being deleted until its caches
are refreshed. For example, if INFLUXDB_IOX_GC_OBJECTSTORE_CUTOFF
is
set to 90 days
but a parquet file is a year old, as soon as the row
is removed from parquet_file
the object may be deleted from the
object store, even though it is still referred to in the Querier
cache. Thus INFLUXDB_IOX_GC_PARQUETFILE_CUTOFF
must be set
sufficiently high to ensure the querier cache is refreshed before
objects are candidates for deletion.
Q: Why not delete objects immediately when
INFLUXDB_IOX_GC_PARQUETFILE_CUTOFF
expires?
A: Database backups contain references to parquet files. In order to
ensure all files referred to by these backups are not deleted,
INFLUXDB_IOX_GC_OBJECTSTORE_CUTOFF
can be set sufficiently high.