roll over Platform troubleshooting docs PR

pull/1387/head
Kelly 2020-08-21 14:38:04 -07:00 committed by Scott Anderson
parent 5511693391
commit 9e2401dc0f
10 changed files with 339 additions and 0 deletions

View File

@ -0,0 +1,34 @@
---
title: Troubleshooting issues using InfluxData Platform monitoring
description: placeholder
menu:
platform:
name: Troubleshooting issues
weight: 25
---
With a [monitored TICK stack](/platform/monitoring), identifying, diagnosing, and resolving problems is much easier.
This section walks through recognizing and solving important issues that commonly appear in the recommended monitored metrics.
## [Out-of-memory loops](/platform/troubleshooting/oom-loops)
How to identify and resolve out-of-memory (OOM) loops in your TICK stack.
<!-- ## [Runaway series cardinality](/platform/troubleshooting/runaway-series-cardinality)
How to identify and resolve runaway series cardinality in your TICK stack.
## [Hinted Handoff Queue buildup](/platform/troubleshooting/hhq-buildup)
How to identify and resolve hinted handoff queue (HHQ) buildup in InfluxDB Enterprise. -->
## [Disk usage](/platform/troubleshooting/disk-usage)
How to identify and resolve high disk usage in your TICK stack.
<!-- ## [IOPS](/platform/troubleshooting/iops)
How to identify and resolve high input/output operations per second (IOPS) in your TICK stack.
## [Log analysis](/platform/troubleshooting/log-analysis)
How to identify and resolve issues by analyzing log output in your TICK stack.
## [Volume of reads/writes](/platform/troubleshooting/read-write-volume)
How to identify and resolve issues with read and write volumes in your TICK stack. -->

View File

@ -0,0 +1,112 @@
---
title: Troubleshooting disk usage
description: How to identify and troubleshoot high disk usage when using InfluxData's TICK stack.
menu:
platform:
name: Disk usage
parent: Troubleshooting issues
weight: 4
---
It's very important that components of your TICK stack do not run out of disk.
A machine with 100% disk usage will not function properly.
In a [monitoring dashboard](/platform/monitoring/monitoring-dashboards), high disk usage
will appear in the **Disk Utilization %** metric and look similar to the following:
![High disk usage](/img/platform/troubleshooting-disk-usage.png)
## Potential causes
### Old data not being downsampled
InfluxDB uses retention policies and continuous queries to downsample older data and preserve disk space.
If using an infinite retention policy or one with a lengthy duration, high resolution
data will use more and more disk space.
### Log data not being dropped
Log data is incredibly useful in your monitoring solution, but can also require
more disk space than other types of time series data.
Many times, log data is stored in an infinite retention policy (the default retention
policy duration), meaning it never gets dropped.
This will inevitably lead to high disk utilization.
## Solutions
### Remove unnecessary data
The simplest solution to high disk utilization is removing old or unnecessary data.
This can be done by brute force (deleting/dropping data) or in a more graceful
manner by tuning the duration of your retention policies and adjusting the downsampling
rates in your continuous queries.
#### Log data retention policies
Log data should only be stored in a finite
[retention policy](/influxdb/latest/query_language/database_management/#retention-policy-management).
The duration of your retention policy is determined by how long you want to keep
log data around.
Whether or not you use a [continuous query](/influxdb/latest/query_language/continuous_queries/)
to downsample log data at the end of its retention period is up to you, but old log
data should either be downsampled or dropped altogether.
### Scale your machine's disk capacity
If removing or downsampling data isn't an option, you can always scale your machine's
disk capacity. How this is done depends on your hardware or virtualization configuration
and is not covered in this documentation.
## Recommendations
### Set up a disk usage alert
To preempt disk utilization issues, create a task that alerts you if disk usage
crosses certain thresholds. The example TICKscript [below](#example-tickscript-alert-for-disk-usage)
sets warning and critical disk usage thresholds and sends a message to Slack
whenever those thresholds are crossed.
_For information about Kapacitor tasks and alerts, see the [Kapacitor alerts](/kapacitor/latest/working/alerts/) documentation._
#### Example TICKscript alert for disk usage
```
// Disk usage alerts
// Alert when disks are this % full
var warn_threshold = 80
var crit_threshold = 90
// Use a larger period here, as the telegraf data can be a little late
// if the server is under load.
var period = 10m
// How often to query for the period.
var every = 20m
var data = batch
|query('''
SELECT last(used_percent) FROM "telegraf"."default".disk
WHERE ("path" = '/influxdb/conf' or "path" = '/')
''')
.period(period)
.every(every)
.groupBy('host', 'path')
data
|alert()
.id('Alert: Disk Usage, Host: {{ index .Tags "host" }}, Path: {{ index .Tags "path" }}')
.warn(lambda: "last" > warn_threshold)
.message('{{ .ID }}, Used Percent: {{ index .Fields "last" | printf "%0.0f" }}%')
.details('')
.stateChangesOnly()
.slack()
data
|alert()
.id('Alert: Disk Usage, Host: {{ index .Tags "host" }}, Path: {{ index .Tags "path" }}')
.crit(lambda: "last" > crit_threshold)
.message('{{ .ID }}, Used Percent: {{ index .Fields "last" | printf "%0.0f" }}%')
.details('')
.slack()
```

View File

@ -0,0 +1,12 @@
---
title: Troubleshooting Hinted Handoff Queue buildup
description: placeholder
draft: true
menu:
platform:
name: Hinted Handoff Queue buildup
parent: Troubleshooting issues
weight: 3
---
_PLACEHOLDER_

View File

@ -0,0 +1,12 @@
---
title: Troubleshooting IOPS
description: placeholder
draft: true
menu:
platform:
name: IOPS
parent: Troubleshooting issues
weight: 5
---
_PLACEHOLDER_

View File

@ -0,0 +1,12 @@
---
title: Troubleshooting with log analysis
description: placeholder
draft: true
menu:
platform:
name: Log analysis
parent: Troubleshooting issues
weight: 6
---
_PLACEHOLDER_

View File

@ -0,0 +1,133 @@
---
title: Troubleshooting out-of-memory loops
description: How to identify and troubleshoot out-of-memory (OOM) loops when using InfluxData's TICK stack.
menu:
platform:
name: Out-of-memory loops
parent: Troubleshooting issues
weight: 1
---
Out-of-memory (OOM) loops occur when a running process consumes an increasing amount
of memory until the operating system is forced to kill and restart the process.
When the process is killed, memory allocated to the process is released, but after
restarting, it continues to use more and more RAM until the cycle repeats.
In a [monitoring dashboard](/platform/monitoring/monitoring-dashboards), an OOM loop
will appear in the **Memory Usage %** metric and look similar to the following:
![OOM Loop](/img/platform/troubleshooting-oom-loop.png)
## Potential causes
The causes of OOM loops can vary widely and depend on your specific use case of
the TICK stack, but the following is the most common:
### Unoptimized queries
What is queried and how it's queried can drastically affect the memory usage and performance of InfluxDB.
An OOM loop will occur as a result of a repeated issuance of a query which exhausts memory.
For example, a dashboard cell with which is set to refresh every 30s.
#### Selecting a measurement without specifying a time range
When selecting from a measurement without specifying a time range, InfluxDB attempts
to pull data points from the beginning of UNIX epoch time (00:00:00 UTC on 1 January 1970),
storing the returned data in memory until it's ready for output.
The operating system will eventually kill the process due to high memory usage.
###### Example of selecting a measurement without a time range
```sql
SELECT * FROM "telegraf"."autogen"."cpu"
```
## Solutions
### Identify and update unoptimized queries
The most common cause of OOM loops in InfluxDB is unoptimized queries, but it can
be challenging to identify what queries could be better optimized.
InfluxQL includes tools to help identify the "cost" of queries and gain insight
into what queries have room for optimization.
#### View your InfluxDB logs
If a query is killed, it is logged by InfluxDB.
View your [InfluxDB logs](/influxdb/latest/administration/logs/) for hints as to what queries are being killed.
#### Estimate query cost
InfluxQL's [`EXPLAIN` statement](/influxdb/latest/query_language/spec#explain)
parses and plans a query, then outputs a summary of estimated costs.
This allows you to estimate how resource-intensive a query may be before having to
run the actual query.
###### Example EXPLAIN statement
```
> EXPLAIN SELECT * FROM "telegraf"."autogen"."cpu"
QUERY PLAN
----------
EXPRESSION: <nil>
AUXILIARY FIELDS: cpu::tag, host::tag, usage_guest::float, usage_guest_nice::float, usage_idle::float, usage_iowait::float, usage_irq::float, usage_nice::float, usage_softirq::float, usage_steal::float, usage_system::float, usage_user::float
NUMBER OF SHARDS: 12
NUMBER OF SERIES: 108
CACHED VALUES: 38250
NUMBER OF FILES: 1080
NUMBER OF BLOCKS: 10440
SIZE OF BLOCKS: 23252999
```
> `EXPLAIN` will only output what iterators are created by the query engine.
> It does not capture any other information within the query engine such as how many points will actually be processed.
#### Analyze actual query cost
InfluxQL's [`EXPLAIN ANALYZE` statement](/influxdb/latest/query_language/spec/#explain-analyze)
actually executes a query and counts the costs during runtime.
###### Example EXPLAIN ANALYZE statement
```
> EXPLAIN ANALYZE SELECT * FROM "telegraf"."autogen"."cpu" WHERE time > now() - 1d
EXPLAIN ANALYZE
---------------
.
└── select
├── execution_time: 104.608549ms
├── planning_time: 5.08487ms
├── total_time: 109.693419ms
└── build_cursor
├── labels
│   └── statement: SELECT cpu::tag, host::tag, usage_guest::float, usage_guest_nice::float, usage_idle::float, usage_iowait::float, usage_irq::float, usage_nice::float, usage_softirq::float, usage_steal::float, usage_system::float, usage_user::float FROM telegraf.autogen.cpu
└── iterator_scanner
├── labels
│   └── auxiliary_fields: cpu::tag, host::tag, usage_guest::float, usage_guest_nice::float, usage_idle::float, usage_iowait::float, usage_irq::float, usage_nice::float, usage_softirq::float, usage_steal::float, usage_system::float, usage_user::float
└── create_iterator
├── labels
│   ├── measurement: cpu
│   └── shard_id: 317
├── cursors_ref: 0
├── cursors_aux: 90
├── cursors_cond: 0
├── float_blocks_decoded: 450
├── float_blocks_size_bytes: 960943
├── integer_blocks_decoded: 0
├── integer_blocks_size_bytes: 0
├── unsigned_blocks_decoded: 0
├── unsigned_blocks_size_bytes: 0
├── string_blocks_decoded: 0
├── string_blocks_size_bytes: 0
├── boolean_blocks_decoded: 0
├── boolean_blocks_size_bytes: 0
└── planning_time: 4.523978ms
```
### Scale available memory
If possible, increase the amount of memory available to InfluxDB.
This is easier if running in a virtualized or cloud environment where resources can be scaled on the fly.
In environments with a fixed set of resources, this can be a very difficult challenge to overcome.

View File

@ -0,0 +1,12 @@
---
title: Troubleshooting the volume of reads and writes
description: placeholder
draft: true
menu:
platform:
name: Volume of reads and writes
parent: Troubleshooting issues
weight: 7
---
_PLACEHOLDER_

View File

@ -0,0 +1,12 @@
---
title: Troubleshooting runaway series cardinality
description: placeholder
draft: true
menu:
platform:
name: Runaway series cardinality
parent: Troubleshooting issues
weight: 2
---
_PLACEHOLDER_

Binary file not shown.

After

Width:  |  Height:  |  Size: 17 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 26 KiB