docs(clustered): data ingest lifecycle (#5369)

* docs: object versioning, lifecycle rule best practices

* docs: trim whitespace

* docs: include appinstance configuration

* docs: update gc env var wording

Co-authored-by: Fraser Savage <fraser@savage.engineer>

* chore: use suggested wording

Co-authored-by: Dom <dom@itsallbroken.com>

* docs: dedicated mention of versioning recommendation

The versioning really depends on an organisation's backups/DR strategy.
I've pulled this more inline with how the lifecylce of the ingested data
can be managed through the garbage-collector.

* docs: wording

* docs: use scenario wording suggestion

Co-authored-by: Dom <dom@itsallbroken.com>

* docs: use suggested extension for gc service

Co-authored-by: Dom <dom@itsallbroken.com>

* docs: no versioning suggestion

Co-authored-by: Dom <dom@itsallbroken.com>

* docs: hint requirements for retention

Co-authored-by: Dom <dom@itsallbroken.com>

* docs: mention object store sizing with lower cutoff

* docs: add warning for 3h cutoff

* suggested edits to the clustered ingest lifecycle guide (#5427)

* docs: use suggestion for configurable cutoff floor

Co-authored-by: Scott Anderson <sanderson@users.noreply.github.com>

---------

Co-authored-by: Fraser Savage <fraser@savage.engineer>
Co-authored-by: Dom <dom@itsallbroken.com>
Co-authored-by: Scott Anderson <sanderson@users.noreply.github.com>
ga-new-download-links^2
Jack 2024-04-16 16:00:46 +01:00 committed by GitHub
parent b6d77062fc
commit caca0a71b6
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
1 changed files with 187 additions and 0 deletions

View File

@ -0,0 +1,187 @@
---
title: Data ingest lifecycle best practices
description: >
Best practices for managing the lifecycle of data ingested into InfluxDB.
menu:
influxdb_clustered:
name: Data ingest lifecycle
parent: write-best-practices
weight: 204
---
Data ingested into InfluxDB must conform to the retention period of the database
in which it is stored.
Points with timestamps outside of the retention period are no longer queryable,
but may still have references maintained in
[Object storage](/influxdb/clustered/reference/internals/storage-engine/#object-store)
or the [Catalog](/influxdb/clustered/reference/internals/storage-engine/#catalog),
resulting in an increase in operational overhead and cost.
To reduce these factors, it is important to manage the lifecycle of ingested data.
Use the following best practices to manage the lifecycle of data in your
InfluxDB cluster:
- [Use appropriate retention periods](#use-appropriate-retention-periods)
- [Tune garbage collection](#tune-garbage-collection)
## Use appropriate retention periods
When [creating or updating a database](/influxdb/clustered/admin/databases/#create-a-database),
use a retention period that is appropriate for your requirements.
Storing data longer than is required adds unnecessary operational cost to your
InfluxDB cluster.
## Tune garbage collection
Once data falls outside of a database's retention period, the garbage collection
service can remove all artifacts associated with the data from the Catalog and Object store.
Tune the garbage collector cutoff period to ensure that data is removed in a timely manner.
Use the following environment variables to tune the garbage collector:
- `INFLUXDB_IOX_GC_OBJECTSTORE_CUTOFF`: the age at which Parquet files not
referenced in the Catalog become eligible for deletion from Object storage.
The default is `30d`.
- `INFLUXDB_IOX_GC_PARQUETFILE_CUTOFF`: how long to retain rows in the Catalog
that reference Parquet files marked for deletion. The default is `30d`.
{{% warn %}}
To ensure there is a grace period before files and references are removed, the
minimum garbage collector (GC) object store and Parquet file cutoff time is
three hours (`3h`).
{{% /warn %}}
We recommend setting these options to a value aligned to your organization's
backup and recovery strategy.
For example, a value of `6h` (6 hours) would be appropriate for running a lean
Catalog that only maintains references to recent data and does not require backups.
### Use case examples
Use the following scenarios as a guide for different use cases:
{{< expand-wrapper >}}
{{% expand "Leading edge data with no backups" %}}
When only the most recent data is important and backups are not required, use a
very low cutoff point for the garbage collector.
Using a low value means that the garbage collection service will promptly delete
files from the Object store and remove rows associated rows from the Catalog.
This results in a lean Catalog with lower operational overhead and less files
in the Object store.
```yaml
apiVersion: kubecfg.dev/v1alpha1
kind: AppInstance
metadata:
name: influxdb
namespace: influxdb
spec:
package:
# ...
spec:
components:
garbage-collector:
template:
containers:
iox:
env:
INFLUXDB_IOX_GC_OBJECTSTORE_CUTOFF: '6h'
INFLUXDB_IOX_GC_PARQUETFILE_CUTOFF: '6h'
```
{{% /expand %}}
{{% expand "Custom backup window _with_ object storage versioning" %}}
When backups are required and you are leveraging the versioning capability of your
Object store (provided by your object store provider), use a low cutoff point
for the garbage collector service. Your object versioning policy ensures expired
files are kept for the specified backup window time.
Object versioning maintains Parquet files in Objects storage after data expires,
but allows the Catalog to remove references to the Parquet files.
Non-current objects should be configured to be expired as soon as possible, but
retained long enough to satisfy your organization's backup policy.
The following illustrates an [AWS S3 lifecycle rule](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html)
that expires non-current objects after 90 days:
```json
{
"Rules": [
{
"ID": "my-lifecycle-rule",
"Filter": {
"Prefix": ""
},
"Status": "Enabled",
"NoncurrentVersionExpiration": {
"NoncurrentDays": 90
}
}
]
}
```
Set the `garbage-collector` to use low cutoff points.
The following example uses `6h`:
```yaml
apiVersion: kubecfg.dev/v1alpha1
kind: AppInstance
metadata:
name: influxdb
namespace: influxdb
spec:
package:
# ...
spec:
components:
garbage-collector:
template:
containers:
iox:
env:
INFLUXDB_IOX_GC_OBJECTSTORE_CUTOFF: '6h'
INFLUXDB_IOX_GC_PARQUETFILE_CUTOFF: '6h'
```
{{% /expand %}}
{{% expand "Custom backup window _without_ object storage versioning" %}}
If you cannot make use of object versioning policies but still requires a backup
window, configure the garbage collector to retain Parquet files for as long as
your backup period requires.
This will likely result in higher operational costs as the Catalog maintains
more references to associated Parquet files and the Parquet files persist for
longer in the Object store.
{{% note %}}
If possible, we recommend using object versioning.
{{% /note %}}
The following example sets the garbage collector cutoffs to `100d`:
```yaml
apiVersion: kubecfg.dev/v1alpha1
kind: AppInstance
metadata:
name: influxdb
namespace: influxdb
spec:
package:
# ...
spec:
components:
garbage-collector:
template:
containers:
iox:
env:
INFLUXDB_IOX_GC_OBJECTSTORE_CUTOFF: '100d'
INFLUXDB_IOX_GC_PARQUETFILE_CUTOFF: '100d'
```
{{% /expand %}}
{{< /expand-wrapper >}}