docs-v2/content/platform/troubleshoot/disk-usage.md

114 lines
3.9 KiB
Markdown
Raw Normal View History

---
2020-08-27 06:09:04 +00:00
title: Troubleshoot high disk usage
list_title: High disk usage
description: Identify and troubleshoot high disk usage when using the InfluxData 1.x TICK stack.
menu:
platform:
name: Disk usage
2020-08-27 06:09:04 +00:00
parent: Troubleshoot
weight: 4
---
It's very important that components of your TICK stack do not run out of disk.
A machine with 100% disk usage will not function properly.
2020-09-01 14:19:51 +00:00
In a [monitoring dashboard](/platform/monitoring/influxdata-platform/monitoring-dashboards/), high disk usage
will appear in the **Disk Utilization %** metric and look similar to the following:
![High disk usage](/img/platform/troubleshooting-disk-usage.png)
## Potential causes
### Old data not being downsampled
InfluxDB uses retention policies and continuous queries to downsample older data and preserve disk space.
If using an infinite retention policy or one with a lengthy duration, high resolution
data will use more and more disk space.
### Log data not being dropped
Log data is incredibly useful in your monitoring solution, but can also require
more disk space than other types of time series data.
Many times, log data is stored in an infinite retention policy (the default retention
policy duration), meaning it never gets dropped.
This will inevitably lead to high disk utilization.
## Solutions
### Remove unnecessary data
The simplest solution to high disk utilization is removing old or unnecessary data.
This can be done by brute force (deleting/dropping data) or in a more graceful
manner by tuning the duration of your retention policies and adjusting the downsampling
rates in your continuous queries.
#### Log data retention policies
Log data should only be stored in a finite
[retention policy](/influxdb/v1/query_language/database_management/#retention-policy-management).
The duration of your retention policy is determined by how long you want to keep
log data around.
Whether or not you use a [continuous query](/influxdb/v1/query_language/continuous_queries/)
to downsample log data at the end of its retention period is up to you, but old log
data should either be downsampled or dropped altogether.
### Scale your machine's disk capacity
If removing or downsampling data isn't an option, you can always scale your machine's
disk capacity. How this is done depends on your hardware or virtualization configuration
and is not covered in this documentation.
## Recommendations
### Set up a disk usage alert
To preempt disk utilization issues, create a task that alerts you if disk usage
crosses certain thresholds. The example TICKscript [below](#example-tickscript-alert-for-disk-usage)
sets warning and critical disk usage thresholds and sends a message to Slack
whenever those thresholds are crossed.
_For information about Kapacitor tasks and alerts, see the [Kapacitor alerts](/kapacitor/v1/working/alerts/) documentation._
#### Example TICKscript alert for disk usage
```
// Disk usage alerts
// Alert when disks are this % full
var warn_threshold = 80
var crit_threshold = 90
// Use a larger period here, as the telegraf data can be a little late
// if the server is under load.
var period = 10m
// How often to query for the period.
var every = 20m
var data = batch
|query('''
SELECT last(used_percent) FROM "telegraf"."default".disk
WHERE ("path" = '/influxdb/conf' or "path" = '/')
''')
.period(period)
.every(every)
.groupBy('host', 'path')
data
|alert()
.id('Alert: Disk Usage, Host: {{ index .Tags "host" }}, Path: {{ index .Tags "path" }}')
.warn(lambda: "last" > warn_threshold)
.message('{{ .ID }}, Used Percent: {{ index .Fields "last" | printf "%0.0f" }}%')
.details('')
.stateChangesOnly()
.slack()
data
|alert()
.id('Alert: Disk Usage, Host: {{ index .Tags "host" }}, Path: {{ index .Tags "path" }}')
.crit(lambda: "last" > crit_threshold)
.message('{{ .ID }}, Used Percent: {{ index .Fields "last" | printf "%0.0f" }}%')
.details('')
.slack()
```