roll over Platform troubleshooting docs PR

2020-08-21 14:38:04 -07:00 · 2020-08-21 14:38:04 -07:00 · 9e2401dc0f
parent 5511693391
commit 9e2401dc0f
10 changed files with 339 additions and 0 deletions
--- a/content/platform/troubleshooting/_index.md
+++ b/content/platform/troubleshooting/_index.md
@ -0,0 +1,34 @@
+---
+title: Troubleshooting issues using InfluxData Platform monitoring
+description: placeholder
+menu:
+  platform:
+    name: Troubleshooting issues
+    weight: 25
+---
+
+With a [monitored TICK stack](/platform/monitoring), identifying, diagnosing, and resolving problems is much easier.
+This section walks through recognizing and solving important issues that commonly appear in the recommended monitored metrics.
+
+## [Out-of-memory loops](/platform/troubleshooting/oom-loops)
+
+How to identify and resolve out-of-memory (OOM) loops in your TICK stack.
+
+<!-- ## [Runaway series cardinality](/platform/troubleshooting/runaway-series-cardinality)
+How to identify and resolve runaway series cardinality in your TICK stack.
+
+## [Hinted Handoff Queue buildup](/platform/troubleshooting/hhq-buildup)
+How to identify and resolve hinted handoff queue (HHQ) buildup in InfluxDB Enterprise. -->
+
+## [Disk usage](/platform/troubleshooting/disk-usage)
+
+How to identify and resolve high disk usage in your TICK stack.
+
+<!-- ## [IOPS](/platform/troubleshooting/iops)
+How to identify and resolve high input/output operations per second (IOPS) in your TICK stack.
+
+## [Log analysis](/platform/troubleshooting/log-analysis)
+How to identify and resolve issues by analyzing log output in your TICK stack.
+
+## [Volume of reads/writes](/platform/troubleshooting/read-write-volume)
+How to identify and resolve issues with read and write volumes in your TICK stack. -->
--- a/content/platform/troubleshooting/disk-usage.md
+++ b/content/platform/troubleshooting/disk-usage.md
@ -0,0 +1,112 @@
+---
+title: Troubleshooting disk usage
+description: How to identify and troubleshoot high disk usage when using InfluxData's TICK stack.
+menu:
+  platform:
+    name: Disk usage
+    parent: Troubleshooting issues
+    weight: 4
+---
+
+It's very important that components of your TICK stack do not run out of disk.
+A machine with 100% disk usage will not function properly.
+
+In a [monitoring dashboard](/platform/monitoring/monitoring-dashboards), high disk usage
+will appear in the **Disk Utilization %** metric and look similar to the following:
+
+![High disk usage](/img/platform/troubleshooting-disk-usage.png)
+
+## Potential causes
+
+### Old data not being downsampled
+
+InfluxDB uses retention policies and continuous queries to downsample older data and preserve disk space.
+If using an infinite retention policy or one with a lengthy duration, high resolution
+data will use more and more disk space.
+
+### Log data not being dropped
+
+
+Log data is incredibly useful in your monitoring solution, but can also require
+more disk space than other types of time series data.
+Many times, log data is stored in an infinite retention policy (the default retention
+policy duration), meaning it never gets dropped.
+This will inevitably lead to high disk utilization.
+
+## Solutions
+
+### Remove unnecessary data
+
+The simplest solution to high disk utilization is removing old or unnecessary data.
+This can be done by brute force (deleting/dropping data) or in a more graceful
+manner by tuning the duration of your retention policies and adjusting the downsampling
+rates in your continuous queries.
+
+#### Log data retention policies
+
+Log data should only be stored in a finite
+[retention policy](/influxdb/latest/query_language/database_management/#retention-policy-management).
+The duration of your retention policy is determined by how long you want to keep
+log data around.
+
+Whether or not you use a [continuous query](/influxdb/latest/query_language/continuous_queries/)
+to downsample log data at the end of its retention period is up to you, but old log
+data should either be downsampled or dropped altogether.
+
+### Scale your machine's disk capacity
+
+If removing or downsampling data isn't an option, you can always scale your machine's
+disk capacity. How this is done depends on your hardware or virtualization configuration
+and is not covered in this documentation.
+
+## Recommendations
+
+### Set up a disk usage alert
+
+To preempt disk utilization issues, create a task that alerts you if disk usage
+crosses certain thresholds. The example TICKscript [below](#example-tickscript-alert-for-disk-usage)
+sets warning and critical disk usage thresholds and sends a message to Slack
+whenever those thresholds are crossed.
+
+_For information about Kapacitor tasks and alerts, see the [Kapacitor alerts](/kapacitor/latest/working/alerts/) documentation._
+
+#### Example TICKscript alert for disk usage
+```
+// Disk usage alerts
+// Alert when disks are this % full
+var warn_threshold = 80
+var crit_threshold = 90
+
+// Use a larger period here, as the telegraf data can be a little late
+// if the server is under load.
+var period = 10m
+
+// How often to query for the period.
+var every = 20m
+
+var data = batch
+  |query('''
+    SELECT last(used_percent) FROM "telegraf"."default".disk
+    WHERE ("path" = '/influxdb/conf' or "path" = '/')
+    ''')
+    .period(period)
+    .every(every)
+    .groupBy('host', 'path')
+
+data
+  |alert()
+    .id('Alert: Disk Usage, Host: {{ index .Tags "host" }}, Path: {{ index .Tags "path" }}')
+    .warn(lambda: "last" > warn_threshold)
+    .message('{{ .ID }}, Used Percent: {{ index .Fields "last" | printf "%0.0f" }}%')
+    .details('')
+    .stateChangesOnly()
+    .slack()
+
+data
+  |alert()
+    .id('Alert: Disk Usage, Host: {{ index .Tags "host" }}, Path: {{ index .Tags "path" }}')
+    .crit(lambda: "last" > crit_threshold)
+    .message('{{ .ID }}, Used Percent: {{ index .Fields "last" | printf "%0.0f" }}%')
+    .details('')
+    .slack()
+```
--- a/content/platform/troubleshooting/hhq-buildup.md
+++ b/content/platform/troubleshooting/hhq-buildup.md
@ -0,0 +1,12 @@
+---
+title: Troubleshooting Hinted Handoff Queue buildup
+description: placeholder
+draft: true
+menu:
+  platform:
+    name: Hinted Handoff Queue buildup
+    parent: Troubleshooting issues
+    weight: 3
+---
+
+_PLACEHOLDER_
--- a/content/platform/troubleshooting/iops.md
+++ b/content/platform/troubleshooting/iops.md
@ -0,0 +1,12 @@
+---
+title: Troubleshooting IOPS
+description: placeholder
+draft: true
+menu:
+  platform:
+    name: IOPS
+    parent: Troubleshooting issues
+    weight: 5
+---
+
+_PLACEHOLDER_
--- a/content/platform/troubleshooting/log-analysis.md
+++ b/content/platform/troubleshooting/log-analysis.md
@ -0,0 +1,12 @@
+---
+title: Troubleshooting with log analysis
+description: placeholder
+draft: true
+menu:
+  platform:
+    name: Log analysis
+    parent: Troubleshooting issues
+    weight: 6
+---
+
+_PLACEHOLDER_
--- a/content/platform/troubleshooting/oom-loops.md
+++ b/content/platform/troubleshooting/oom-loops.md
@ -0,0 +1,133 @@
+---
+title: Troubleshooting out-of-memory loops
+description: How to identify and troubleshoot out-of-memory (OOM) loops when using InfluxData's TICK stack.
+menu:
+  platform:
+    name: Out-of-memory loops
+    parent: Troubleshooting issues
+    weight: 1
+---
+
+Out-of-memory (OOM) loops occur when a running process consumes an increasing amount
+of memory until the operating system is forced to kill and restart the process.
+When the process is killed, memory allocated to the process is released, but after
+restarting, it continues to use more and more RAM until the cycle repeats.
+
+In a [monitoring dashboard](/platform/monitoring/monitoring-dashboards), an OOM loop
+will appear in the **Memory Usage %** metric and look similar to the following:
+
+![OOM Loop](/img/platform/troubleshooting-oom-loop.png)
+
+## Potential causes
+
+The causes of OOM loops can vary widely and depend on your specific use case of
+the TICK stack, but the following is the most common:
+
+### Unoptimized queries
+
+What is queried and how it's queried can drastically affect the memory usage and performance of InfluxDB.
+An OOM loop will occur as a result of a repeated issuance of a query which exhausts memory.
+For example, a dashboard cell with which is set to refresh every 30s.
+
+#### Selecting a measurement without specifying a time range
+
+When selecting from a measurement without specifying a time range, InfluxDB attempts
+to pull data points from the beginning of UNIX epoch time (00:00:00 UTC on 1 January 1970),
+storing the returned data in memory until it's ready for output.
+The operating system will eventually kill the process due to high memory usage.
+
+###### Example of selecting a measurement without a time range
+
+```sql
+SELECT * FROM "telegraf"."autogen"."cpu"
+```
+
+## Solutions
+
+### Identify and update unoptimized queries
+
+The most common cause of OOM loops in InfluxDB is unoptimized queries, but it can
+be challenging to identify what queries could be better optimized.
+InfluxQL includes tools to help identify the "cost" of queries and gain insight
+into what queries have room for optimization.
+
+#### View your InfluxDB logs
+
+If a query is killed, it is logged by InfluxDB.
+View your [InfluxDB logs](/influxdb/latest/administration/logs/) for hints as to what queries are being killed.
+
+#### Estimate query cost
+
+InfluxQL's [`EXPLAIN` statement](/influxdb/latest/query_language/spec#explain)
+parses and plans a query, then outputs a summary of estimated costs.
+This allows you to estimate how resource-intensive a query may be before having to
+run the actual query.
+
+###### Example EXPLAIN statement
+
+```
+> EXPLAIN SELECT * FROM "telegraf"."autogen"."cpu"
+
+QUERY PLAN
+----------
+EXPRESSION: <nil>
+AUXILIARY FIELDS: cpu::tag, host::tag, usage_guest::float, usage_guest_nice::float, usage_idle::float, usage_iowait::float, usage_irq::float, usage_nice::float, usage_softirq::float, usage_steal::float, usage_system::float, usage_user::float
+NUMBER OF SHARDS: 12
+NUMBER OF SERIES: 108
+CACHED VALUES: 38250
+NUMBER OF FILES: 1080
+NUMBER OF BLOCKS: 10440
+SIZE OF BLOCKS: 23252999
+```
+
+> `EXPLAIN` will only output what iterators are created by the query engine.
+> It does not capture any other information within the query engine such as how many points will actually be processed.
+
+#### Analyze actual query cost
+
+InfluxQL's [`EXPLAIN ANALYZE` statement](/influxdb/latest/query_language/spec/#explain-analyze)
+actually executes a query and counts the costs during runtime.
+
+###### Example EXPLAIN ANALYZE statement
+
+```
+> EXPLAIN ANALYZE SELECT * FROM "telegraf"."autogen"."cpu" WHERE time > now() - 1d
+
+EXPLAIN ANALYZE
+---------------
+.
+└── select
+    ├── execution_time: 104.608549ms
+    ├── planning_time: 5.08487ms
+    ├── total_time: 109.693419ms
+    └── build_cursor
+        ├── labels
+        │   └── statement: SELECT cpu::tag, host::tag, usage_guest::float, usage_guest_nice::float, usage_idle::float, usage_iowait::float, usage_irq::float, usage_nice::float, usage_softirq::float, usage_steal::float, usage_system::float, usage_user::float FROM telegraf.autogen.cpu
+        └── iterator_scanner
+            ├── labels
+            │   └── auxiliary_fields: cpu::tag, host::tag, usage_guest::float, usage_guest_nice::float, usage_idle::float, usage_iowait::float, usage_irq::float, usage_nice::float, usage_softirq::float, usage_steal::float, usage_system::float, usage_user::float
+            └── create_iterator
+                ├── labels
+                │   ├── measurement: cpu
+                │   └── shard_id: 317
+                ├── cursors_ref: 0
+                ├── cursors_aux: 90
+                ├── cursors_cond: 0
+                ├── float_blocks_decoded: 450
+                ├── float_blocks_size_bytes: 960943
+                ├── integer_blocks_decoded: 0
+                ├── integer_blocks_size_bytes: 0
+                ├── unsigned_blocks_decoded: 0
+                ├── unsigned_blocks_size_bytes: 0
+                ├── string_blocks_decoded: 0
+                ├── string_blocks_size_bytes: 0
+                ├── boolean_blocks_decoded: 0
+                ├── boolean_blocks_size_bytes: 0
+                └── planning_time: 4.523978ms
+```
+
+### Scale available memory
+
+If possible, increase the amount of memory available to InfluxDB.
+This is easier if running in a virtualized or cloud environment where resources can be scaled on the fly.
+In environments with a fixed set of resources, this can be a very difficult challenge to overcome.
--- a/content/platform/troubleshooting/read-write-volume.md
+++ b/content/platform/troubleshooting/read-write-volume.md
@ -0,0 +1,12 @@
+---
+title: Troubleshooting the volume of reads and writes
+description: placeholder
+draft: true
+menu:
+  platform:
+    name: Volume of reads and writes
+    parent: Troubleshooting issues
+    weight: 7
+---
+
+_PLACEHOLDER_
--- a/content/platform/troubleshooting/runaway-series-cardinality.md
+++ b/content/platform/troubleshooting/runaway-series-cardinality.md
@ -0,0 +1,12 @@
+---
+title: Troubleshooting runaway series cardinality
+description: placeholder
+draft: true
+menu:
+  platform:
+    name: Runaway series cardinality
+    parent: Troubleshooting issues
+    weight: 2
+---
+
+_PLACEHOLDER_
--- a/static/img/platform/troubleshooting-disk-usage.png
+++ b/static/img/platform/troubleshooting-disk-usage.png
--- a/static/img/platform/troubleshooting-oom-loop.png
+++ b/static/img/platform/troubleshooting-oom-loop.png