docs(clustered): data ingest lifecycle (#5369)

* docs: object versioning, lifecycle rule best practices * docs: trim whitespace * docs: include appinstance configuration * docs: update gc env var wording Co-authored-by: Fraser Savage <fraser@savage.engineer> * chore: use suggested wording Co-authored-by: Dom <dom@itsallbroken.com> * docs: dedicated mention of versioning recommendation The versioning really depends on an organisation's backups/DR strategy. I've pulled this more inline with how the lifecylce of the ingested data can be managed through the garbage-collector. * docs: wording * docs: use scenario wording suggestion Co-authored-by: Dom <dom@itsallbroken.com> * docs: use suggested extension for gc service Co-authored-by: Dom <dom@itsallbroken.com> * docs: no versioning suggestion Co-authored-by: Dom <dom@itsallbroken.com> * docs: hint requirements for retention Co-authored-by: Dom <dom@itsallbroken.com> * docs: mention object store sizing with lower cutoff * docs: add warning for 3h cutoff * suggested edits to the clustered ingest lifecycle guide (#5427) * docs: use suggestion for configurable cutoff floor Co-authored-by: Scott Anderson <sanderson@users.noreply.github.com> --------- Co-authored-by: Fraser Savage <fraser@savage.engineer> Co-authored-by: Dom <dom@itsallbroken.com> Co-authored-by: Scott Anderson <sanderson@users.noreply.github.com>
2024-04-16 16:00:46 +01:00 · 2024-04-16 16:00:46 +01:00 · caca0a71b6
parent b6d77062fc
commit caca0a71b6
1 changed files with 187 additions and 0 deletions
--- a/content/influxdb/clustered/write-data/best-practices/data-lifecycle.md
+++ b/content/influxdb/clustered/write-data/best-practices/data-lifecycle.md
@ -0,0 +1,187 @@
+---
+title: Data ingest lifecycle best practices
+description: >
+  Best practices for managing the lifecycle of data ingested into InfluxDB.
+menu:
+  influxdb_clustered:
+    name: Data ingest lifecycle
+    parent: write-best-practices
+weight: 204
+---
+
+Data ingested into InfluxDB must conform to the retention period of the database
+in which it is stored.
+Points with timestamps outside of the retention period are no longer queryable,
+but may still have references maintained in
+[Object storage](/influxdb/clustered/reference/internals/storage-engine/#object-store)
+or the [Catalog](/influxdb/clustered/reference/internals/storage-engine/#catalog),
+resulting in an increase in operational overhead and cost.
+To reduce these factors, it is important to manage the lifecycle of ingested data.
+
+Use the following best practices to manage the lifecycle of data in your
+InfluxDB cluster:
+
+- [Use appropriate retention periods](#use-appropriate-retention-periods)
+- [Tune garbage collection](#tune-garbage-collection)
+
+## Use appropriate retention periods
+
+When [creating or updating a database](/influxdb/clustered/admin/databases/#create-a-database),
+use a retention period that is appropriate for your requirements.
+Storing data longer than is required adds unnecessary operational cost to your
+InfluxDB cluster.
+
+## Tune garbage collection
+
+Once data falls outside of a database's retention period, the garbage collection
+service can remove all artifacts associated with the data from the Catalog and Object store.
+Tune the garbage collector cutoff period to ensure that data is removed in a timely manner.
+
+Use the following environment variables to tune the garbage collector:
+
+- `INFLUXDB_IOX_GC_OBJECTSTORE_CUTOFF`: the age at which Parquet files not
+  referenced in the Catalog become eligible for deletion from Object storage.
+  The default is `30d`.
+- `INFLUXDB_IOX_GC_PARQUETFILE_CUTOFF`: how long to retain rows in the Catalog
+  that reference Parquet files marked for deletion. The default is `30d`.
+
+{{% warn %}}
+To ensure there is a grace period before files and references are removed, the
+minimum garbage collector (GC) object store and Parquet file cutoff time is
+three hours (`3h`).
+{{% /warn %}}
+
+We recommend setting these options to a value aligned to your organization's
+backup and recovery strategy.
+For example, a value of `6h` (6 hours) would be appropriate for running a lean
+Catalog that only maintains references to recent data and does not require backups.
+
+### Use case examples
+
+Use the following scenarios as a guide for different use cases:
+
+{{< expand-wrapper >}}
+{{% expand "Leading edge data with no backups" %}}
+
+When only the most recent data is important and backups are not required, use a
+very low cutoff point for the garbage collector.
+Using a low value means that the garbage collection service will promptly delete
+files from the Object store and remove rows associated rows from the Catalog.
+This results in a lean Catalog with lower operational overhead and less files
+in the Object store.
+
+```yaml
+apiVersion: kubecfg.dev/v1alpha1
+kind: AppInstance
+metadata:
+  name: influxdb
+  namespace: influxdb
+spec:
+  package:
+    # ...
+    spec:
+      components:
+       garbage-collector:
+         template:
+           containers:
+             iox:
+               env:
+                 INFLUXDB_IOX_GC_OBJECTSTORE_CUTOFF: '6h'
+                 INFLUXDB_IOX_GC_PARQUETFILE_CUTOFF: '6h'
+```
+
+{{% /expand %}}
+
+{{% expand "Custom backup window _with_ object storage versioning" %}}
+
+When backups are required and you are leveraging the versioning capability of your
+Object store (provided by your object store provider), use a low cutoff point
+for the garbage collector service. Your object versioning policy ensures expired
+files are kept for the specified backup window time.
+
+Object versioning maintains Parquet files in Objects storage after data expires,
+but allows the Catalog to remove references to the Parquet files.
+Non-current objects should be configured to be expired as soon as possible, but
+retained long enough to satisfy your organization's backup policy.
+
+The following illustrates an [AWS S3 lifecycle rule](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html)
+that expires non-current objects after 90 days:
+
+```json
+{
+    "Rules": [
+        {
+            "ID": "my-lifecycle-rule",
+            "Filter": {
+                "Prefix": ""
+            },
+            "Status": "Enabled",
+            "NoncurrentVersionExpiration": {
+                "NoncurrentDays": 90
+            }
+        }
+    ]
+}
+```
+
+Set the `garbage-collector` to use low cutoff points.
+The following example uses `6h`:
+
+```yaml
+apiVersion: kubecfg.dev/v1alpha1
+kind: AppInstance
+metadata:
+  name: influxdb
+  namespace: influxdb
+spec:
+  package:
+    # ...
+    spec:
+      components:
+       garbage-collector:
+         template:
+           containers:
+             iox:
+               env:
+                 INFLUXDB_IOX_GC_OBJECTSTORE_CUTOFF: '6h'
+                 INFLUXDB_IOX_GC_PARQUETFILE_CUTOFF: '6h'
+```
+{{% /expand %}}
+
+{{% expand "Custom backup window _without_ object storage versioning" %}}
+
+If you cannot make use of object versioning policies but still requires a backup
+window, configure the garbage collector to retain Parquet files for as long as
+your backup period requires.
+
+This will likely result in higher operational costs as the Catalog maintains
+more references to associated Parquet files and the Parquet files persist for
+longer in the Object store.
+
+{{% note %}}
+If possible, we recommend using object versioning.
+{{% /note %}}
+
+The following example sets the garbage collector cutoffs to `100d`:
+
+```yaml
+apiVersion: kubecfg.dev/v1alpha1
+kind: AppInstance
+metadata:
+  name: influxdb
+  namespace: influxdb
+spec:
+  package:
+    # ...
+    spec:
+      components:
+       garbage-collector:
+         template:
+           containers:
+             iox:
+               env:
+                 INFLUXDB_IOX_GC_OBJECTSTORE_CUTOFF: '100d'
+                 INFLUXDB_IOX_GC_PARQUETFILE_CUTOFF: '100d'
+```
+{{% /expand %}}
+{{< /expand-wrapper >}}