roll over Platform troubleshooting docs PR
parent
5511693391
commit
9e2401dc0f
|
@ -0,0 +1,34 @@
|
|||
---
|
||||
title: Troubleshooting issues using InfluxData Platform monitoring
|
||||
description: placeholder
|
||||
menu:
|
||||
platform:
|
||||
name: Troubleshooting issues
|
||||
weight: 25
|
||||
---
|
||||
|
||||
With a [monitored TICK stack](/platform/monitoring), identifying, diagnosing, and resolving problems is much easier.
|
||||
This section walks through recognizing and solving important issues that commonly appear in the recommended monitored metrics.
|
||||
|
||||
## [Out-of-memory loops](/platform/troubleshooting/oom-loops)
|
||||
|
||||
How to identify and resolve out-of-memory (OOM) loops in your TICK stack.
|
||||
|
||||
<!-- ## [Runaway series cardinality](/platform/troubleshooting/runaway-series-cardinality)
|
||||
How to identify and resolve runaway series cardinality in your TICK stack.
|
||||
|
||||
## [Hinted Handoff Queue buildup](/platform/troubleshooting/hhq-buildup)
|
||||
How to identify and resolve hinted handoff queue (HHQ) buildup in InfluxDB Enterprise. -->
|
||||
|
||||
## [Disk usage](/platform/troubleshooting/disk-usage)
|
||||
|
||||
How to identify and resolve high disk usage in your TICK stack.
|
||||
|
||||
<!-- ## [IOPS](/platform/troubleshooting/iops)
|
||||
How to identify and resolve high input/output operations per second (IOPS) in your TICK stack.
|
||||
|
||||
## [Log analysis](/platform/troubleshooting/log-analysis)
|
||||
How to identify and resolve issues by analyzing log output in your TICK stack.
|
||||
|
||||
## [Volume of reads/writes](/platform/troubleshooting/read-write-volume)
|
||||
How to identify and resolve issues with read and write volumes in your TICK stack. -->
|
|
@ -0,0 +1,112 @@
|
|||
---
|
||||
title: Troubleshooting disk usage
|
||||
description: How to identify and troubleshoot high disk usage when using InfluxData's TICK stack.
|
||||
menu:
|
||||
platform:
|
||||
name: Disk usage
|
||||
parent: Troubleshooting issues
|
||||
weight: 4
|
||||
---
|
||||
|
||||
It's very important that components of your TICK stack do not run out of disk.
|
||||
A machine with 100% disk usage will not function properly.
|
||||
|
||||
In a [monitoring dashboard](/platform/monitoring/monitoring-dashboards), high disk usage
|
||||
will appear in the **Disk Utilization %** metric and look similar to the following:
|
||||
|
||||
![High disk usage](/img/platform/troubleshooting-disk-usage.png)
|
||||
|
||||
## Potential causes
|
||||
|
||||
### Old data not being downsampled
|
||||
|
||||
InfluxDB uses retention policies and continuous queries to downsample older data and preserve disk space.
|
||||
If using an infinite retention policy or one with a lengthy duration, high resolution
|
||||
data will use more and more disk space.
|
||||
|
||||
### Log data not being dropped
|
||||
|
||||
|
||||
Log data is incredibly useful in your monitoring solution, but can also require
|
||||
more disk space than other types of time series data.
|
||||
Many times, log data is stored in an infinite retention policy (the default retention
|
||||
policy duration), meaning it never gets dropped.
|
||||
This will inevitably lead to high disk utilization.
|
||||
|
||||
## Solutions
|
||||
|
||||
### Remove unnecessary data
|
||||
|
||||
The simplest solution to high disk utilization is removing old or unnecessary data.
|
||||
This can be done by brute force (deleting/dropping data) or in a more graceful
|
||||
manner by tuning the duration of your retention policies and adjusting the downsampling
|
||||
rates in your continuous queries.
|
||||
|
||||
#### Log data retention policies
|
||||
|
||||
Log data should only be stored in a finite
|
||||
[retention policy](/influxdb/latest/query_language/database_management/#retention-policy-management).
|
||||
The duration of your retention policy is determined by how long you want to keep
|
||||
log data around.
|
||||
|
||||
Whether or not you use a [continuous query](/influxdb/latest/query_language/continuous_queries/)
|
||||
to downsample log data at the end of its retention period is up to you, but old log
|
||||
data should either be downsampled or dropped altogether.
|
||||
|
||||
### Scale your machine's disk capacity
|
||||
|
||||
If removing or downsampling data isn't an option, you can always scale your machine's
|
||||
disk capacity. How this is done depends on your hardware or virtualization configuration
|
||||
and is not covered in this documentation.
|
||||
|
||||
## Recommendations
|
||||
|
||||
### Set up a disk usage alert
|
||||
|
||||
To preempt disk utilization issues, create a task that alerts you if disk usage
|
||||
crosses certain thresholds. The example TICKscript [below](#example-tickscript-alert-for-disk-usage)
|
||||
sets warning and critical disk usage thresholds and sends a message to Slack
|
||||
whenever those thresholds are crossed.
|
||||
|
||||
_For information about Kapacitor tasks and alerts, see the [Kapacitor alerts](/kapacitor/latest/working/alerts/) documentation._
|
||||
|
||||
#### Example TICKscript alert for disk usage
|
||||
```
|
||||
// Disk usage alerts
|
||||
// Alert when disks are this % full
|
||||
var warn_threshold = 80
|
||||
var crit_threshold = 90
|
||||
|
||||
// Use a larger period here, as the telegraf data can be a little late
|
||||
// if the server is under load.
|
||||
var period = 10m
|
||||
|
||||
// How often to query for the period.
|
||||
var every = 20m
|
||||
|
||||
var data = batch
|
||||
|query('''
|
||||
SELECT last(used_percent) FROM "telegraf"."default".disk
|
||||
WHERE ("path" = '/influxdb/conf' or "path" = '/')
|
||||
''')
|
||||
.period(period)
|
||||
.every(every)
|
||||
.groupBy('host', 'path')
|
||||
|
||||
data
|
||||
|alert()
|
||||
.id('Alert: Disk Usage, Host: {{ index .Tags "host" }}, Path: {{ index .Tags "path" }}')
|
||||
.warn(lambda: "last" > warn_threshold)
|
||||
.message('{{ .ID }}, Used Percent: {{ index .Fields "last" | printf "%0.0f" }}%')
|
||||
.details('')
|
||||
.stateChangesOnly()
|
||||
.slack()
|
||||
|
||||
data
|
||||
|alert()
|
||||
.id('Alert: Disk Usage, Host: {{ index .Tags "host" }}, Path: {{ index .Tags "path" }}')
|
||||
.crit(lambda: "last" > crit_threshold)
|
||||
.message('{{ .ID }}, Used Percent: {{ index .Fields "last" | printf "%0.0f" }}%')
|
||||
.details('')
|
||||
.slack()
|
||||
```
|
|
@ -0,0 +1,12 @@
|
|||
---
|
||||
title: Troubleshooting Hinted Handoff Queue buildup
|
||||
description: placeholder
|
||||
draft: true
|
||||
menu:
|
||||
platform:
|
||||
name: Hinted Handoff Queue buildup
|
||||
parent: Troubleshooting issues
|
||||
weight: 3
|
||||
---
|
||||
|
||||
_PLACEHOLDER_
|
|
@ -0,0 +1,12 @@
|
|||
---
|
||||
title: Troubleshooting IOPS
|
||||
description: placeholder
|
||||
draft: true
|
||||
menu:
|
||||
platform:
|
||||
name: IOPS
|
||||
parent: Troubleshooting issues
|
||||
weight: 5
|
||||
---
|
||||
|
||||
_PLACEHOLDER_
|
|
@ -0,0 +1,12 @@
|
|||
---
|
||||
title: Troubleshooting with log analysis
|
||||
description: placeholder
|
||||
draft: true
|
||||
menu:
|
||||
platform:
|
||||
name: Log analysis
|
||||
parent: Troubleshooting issues
|
||||
weight: 6
|
||||
---
|
||||
|
||||
_PLACEHOLDER_
|
|
@ -0,0 +1,133 @@
|
|||
---
|
||||
title: Troubleshooting out-of-memory loops
|
||||
description: How to identify and troubleshoot out-of-memory (OOM) loops when using InfluxData's TICK stack.
|
||||
menu:
|
||||
platform:
|
||||
name: Out-of-memory loops
|
||||
parent: Troubleshooting issues
|
||||
weight: 1
|
||||
---
|
||||
|
||||
Out-of-memory (OOM) loops occur when a running process consumes an increasing amount
|
||||
of memory until the operating system is forced to kill and restart the process.
|
||||
When the process is killed, memory allocated to the process is released, but after
|
||||
restarting, it continues to use more and more RAM until the cycle repeats.
|
||||
|
||||
In a [monitoring dashboard](/platform/monitoring/monitoring-dashboards), an OOM loop
|
||||
will appear in the **Memory Usage %** metric and look similar to the following:
|
||||
|
||||
![OOM Loop](/img/platform/troubleshooting-oom-loop.png)
|
||||
|
||||
## Potential causes
|
||||
|
||||
The causes of OOM loops can vary widely and depend on your specific use case of
|
||||
the TICK stack, but the following is the most common:
|
||||
|
||||
### Unoptimized queries
|
||||
|
||||
What is queried and how it's queried can drastically affect the memory usage and performance of InfluxDB.
|
||||
An OOM loop will occur as a result of a repeated issuance of a query which exhausts memory.
|
||||
For example, a dashboard cell with which is set to refresh every 30s.
|
||||
|
||||
#### Selecting a measurement without specifying a time range
|
||||
|
||||
When selecting from a measurement without specifying a time range, InfluxDB attempts
|
||||
to pull data points from the beginning of UNIX epoch time (00:00:00 UTC on 1 January 1970),
|
||||
storing the returned data in memory until it's ready for output.
|
||||
The operating system will eventually kill the process due to high memory usage.
|
||||
|
||||
###### Example of selecting a measurement without a time range
|
||||
|
||||
```sql
|
||||
SELECT * FROM "telegraf"."autogen"."cpu"
|
||||
```
|
||||
|
||||
## Solutions
|
||||
|
||||
### Identify and update unoptimized queries
|
||||
|
||||
The most common cause of OOM loops in InfluxDB is unoptimized queries, but it can
|
||||
be challenging to identify what queries could be better optimized.
|
||||
InfluxQL includes tools to help identify the "cost" of queries and gain insight
|
||||
into what queries have room for optimization.
|
||||
|
||||
#### View your InfluxDB logs
|
||||
|
||||
If a query is killed, it is logged by InfluxDB.
|
||||
View your [InfluxDB logs](/influxdb/latest/administration/logs/) for hints as to what queries are being killed.
|
||||
|
||||
#### Estimate query cost
|
||||
|
||||
InfluxQL's [`EXPLAIN` statement](/influxdb/latest/query_language/spec#explain)
|
||||
parses and plans a query, then outputs a summary of estimated costs.
|
||||
This allows you to estimate how resource-intensive a query may be before having to
|
||||
run the actual query.
|
||||
|
||||
###### Example EXPLAIN statement
|
||||
|
||||
```
|
||||
> EXPLAIN SELECT * FROM "telegraf"."autogen"."cpu"
|
||||
|
||||
QUERY PLAN
|
||||
----------
|
||||
EXPRESSION: <nil>
|
||||
AUXILIARY FIELDS: cpu::tag, host::tag, usage_guest::float, usage_guest_nice::float, usage_idle::float, usage_iowait::float, usage_irq::float, usage_nice::float, usage_softirq::float, usage_steal::float, usage_system::float, usage_user::float
|
||||
NUMBER OF SHARDS: 12
|
||||
NUMBER OF SERIES: 108
|
||||
CACHED VALUES: 38250
|
||||
NUMBER OF FILES: 1080
|
||||
NUMBER OF BLOCKS: 10440
|
||||
SIZE OF BLOCKS: 23252999
|
||||
```
|
||||
|
||||
> `EXPLAIN` will only output what iterators are created by the query engine.
|
||||
> It does not capture any other information within the query engine such as how many points will actually be processed.
|
||||
|
||||
#### Analyze actual query cost
|
||||
|
||||
InfluxQL's [`EXPLAIN ANALYZE` statement](/influxdb/latest/query_language/spec/#explain-analyze)
|
||||
actually executes a query and counts the costs during runtime.
|
||||
|
||||
###### Example EXPLAIN ANALYZE statement
|
||||
|
||||
```
|
||||
> EXPLAIN ANALYZE SELECT * FROM "telegraf"."autogen"."cpu" WHERE time > now() - 1d
|
||||
|
||||
EXPLAIN ANALYZE
|
||||
---------------
|
||||
.
|
||||
└── select
|
||||
├── execution_time: 104.608549ms
|
||||
├── planning_time: 5.08487ms
|
||||
├── total_time: 109.693419ms
|
||||
└── build_cursor
|
||||
├── labels
|
||||
│ └── statement: SELECT cpu::tag, host::tag, usage_guest::float, usage_guest_nice::float, usage_idle::float, usage_iowait::float, usage_irq::float, usage_nice::float, usage_softirq::float, usage_steal::float, usage_system::float, usage_user::float FROM telegraf.autogen.cpu
|
||||
└── iterator_scanner
|
||||
├── labels
|
||||
│ └── auxiliary_fields: cpu::tag, host::tag, usage_guest::float, usage_guest_nice::float, usage_idle::float, usage_iowait::float, usage_irq::float, usage_nice::float, usage_softirq::float, usage_steal::float, usage_system::float, usage_user::float
|
||||
└── create_iterator
|
||||
├── labels
|
||||
│ ├── measurement: cpu
|
||||
│ └── shard_id: 317
|
||||
├── cursors_ref: 0
|
||||
├── cursors_aux: 90
|
||||
├── cursors_cond: 0
|
||||
├── float_blocks_decoded: 450
|
||||
├── float_blocks_size_bytes: 960943
|
||||
├── integer_blocks_decoded: 0
|
||||
├── integer_blocks_size_bytes: 0
|
||||
├── unsigned_blocks_decoded: 0
|
||||
├── unsigned_blocks_size_bytes: 0
|
||||
├── string_blocks_decoded: 0
|
||||
├── string_blocks_size_bytes: 0
|
||||
├── boolean_blocks_decoded: 0
|
||||
├── boolean_blocks_size_bytes: 0
|
||||
└── planning_time: 4.523978ms
|
||||
```
|
||||
|
||||
### Scale available memory
|
||||
|
||||
If possible, increase the amount of memory available to InfluxDB.
|
||||
This is easier if running in a virtualized or cloud environment where resources can be scaled on the fly.
|
||||
In environments with a fixed set of resources, this can be a very difficult challenge to overcome.
|
|
@ -0,0 +1,12 @@
|
|||
---
|
||||
title: Troubleshooting the volume of reads and writes
|
||||
description: placeholder
|
||||
draft: true
|
||||
menu:
|
||||
platform:
|
||||
name: Volume of reads and writes
|
||||
parent: Troubleshooting issues
|
||||
weight: 7
|
||||
---
|
||||
|
||||
_PLACEHOLDER_
|
|
@ -0,0 +1,12 @@
|
|||
---
|
||||
title: Troubleshooting runaway series cardinality
|
||||
description: placeholder
|
||||
draft: true
|
||||
menu:
|
||||
platform:
|
||||
name: Runaway series cardinality
|
||||
parent: Troubleshooting issues
|
||||
weight: 2
|
||||
---
|
||||
|
||||
_PLACEHOLDER_
|
Binary file not shown.
After Width: | Height: | Size: 17 KiB |
Binary file not shown.
After Width: | Height: | Size: 26 KiB |
Loading…
Reference in New Issue