diff --git a/content/platform/troubleshooting/_index.md b/content/platform/troubleshooting/_index.md new file mode 100644 index 000000000..681114d44 --- /dev/null +++ b/content/platform/troubleshooting/_index.md @@ -0,0 +1,34 @@ +--- +title: Troubleshooting issues using InfluxData Platform monitoring +description: placeholder +menu: + platform: + name: Troubleshooting issues + weight: 25 +--- + +With a [monitored TICK stack](/platform/monitoring), identifying, diagnosing, and resolving problems is much easier. +This section walks through recognizing and solving important issues that commonly appear in the recommended monitored metrics. + +## [Out-of-memory loops](/platform/troubleshooting/oom-loops) + +How to identify and resolve out-of-memory (OOM) loops in your TICK stack. + + + +## [Disk usage](/platform/troubleshooting/disk-usage) + +How to identify and resolve high disk usage in your TICK stack. + + diff --git a/content/platform/troubleshooting/disk-usage.md b/content/platform/troubleshooting/disk-usage.md new file mode 100644 index 000000000..680fe3083 --- /dev/null +++ b/content/platform/troubleshooting/disk-usage.md @@ -0,0 +1,112 @@ +--- +title: Troubleshooting disk usage +description: How to identify and troubleshoot high disk usage when using InfluxData's TICK stack. +menu: + platform: + name: Disk usage + parent: Troubleshooting issues + weight: 4 +--- + +It's very important that components of your TICK stack do not run out of disk. +A machine with 100% disk usage will not function properly. + +In a [monitoring dashboard](/platform/monitoring/monitoring-dashboards), high disk usage +will appear in the **Disk Utilization %** metric and look similar to the following: + +![High disk usage](/img/platform/troubleshooting-disk-usage.png) + +## Potential causes + +### Old data not being downsampled + +InfluxDB uses retention policies and continuous queries to downsample older data and preserve disk space. +If using an infinite retention policy or one with a lengthy duration, high resolution +data will use more and more disk space. + +### Log data not being dropped + + +Log data is incredibly useful in your monitoring solution, but can also require +more disk space than other types of time series data. +Many times, log data is stored in an infinite retention policy (the default retention +policy duration), meaning it never gets dropped. +This will inevitably lead to high disk utilization. + +## Solutions + +### Remove unnecessary data + +The simplest solution to high disk utilization is removing old or unnecessary data. +This can be done by brute force (deleting/dropping data) or in a more graceful +manner by tuning the duration of your retention policies and adjusting the downsampling +rates in your continuous queries. + +#### Log data retention policies + +Log data should only be stored in a finite +[retention policy](/influxdb/latest/query_language/database_management/#retention-policy-management). +The duration of your retention policy is determined by how long you want to keep +log data around. + +Whether or not you use a [continuous query](/influxdb/latest/query_language/continuous_queries/) +to downsample log data at the end of its retention period is up to you, but old log +data should either be downsampled or dropped altogether. + +### Scale your machine's disk capacity + +If removing or downsampling data isn't an option, you can always scale your machine's +disk capacity. How this is done depends on your hardware or virtualization configuration +and is not covered in this documentation. + +## Recommendations + +### Set up a disk usage alert + +To preempt disk utilization issues, create a task that alerts you if disk usage +crosses certain thresholds. The example TICKscript [below](#example-tickscript-alert-for-disk-usage) +sets warning and critical disk usage thresholds and sends a message to Slack +whenever those thresholds are crossed. + +_For information about Kapacitor tasks and alerts, see the [Kapacitor alerts](/kapacitor/latest/working/alerts/) documentation._ + +#### Example TICKscript alert for disk usage +``` +// Disk usage alerts +// Alert when disks are this % full +var warn_threshold = 80 +var crit_threshold = 90 + +// Use a larger period here, as the telegraf data can be a little late +// if the server is under load. +var period = 10m + +// How often to query for the period. +var every = 20m + +var data = batch + |query(''' + SELECT last(used_percent) FROM "telegraf"."default".disk + WHERE ("path" = '/influxdb/conf' or "path" = '/') + ''') + .period(period) + .every(every) + .groupBy('host', 'path') + +data + |alert() + .id('Alert: Disk Usage, Host: {{ index .Tags "host" }}, Path: {{ index .Tags "path" }}') + .warn(lambda: "last" > warn_threshold) + .message('{{ .ID }}, Used Percent: {{ index .Fields "last" | printf "%0.0f" }}%') + .details('') + .stateChangesOnly() + .slack() + +data + |alert() + .id('Alert: Disk Usage, Host: {{ index .Tags "host" }}, Path: {{ index .Tags "path" }}') + .crit(lambda: "last" > crit_threshold) + .message('{{ .ID }}, Used Percent: {{ index .Fields "last" | printf "%0.0f" }}%') + .details('') + .slack() +``` diff --git a/content/platform/troubleshooting/hhq-buildup.md b/content/platform/troubleshooting/hhq-buildup.md new file mode 100644 index 000000000..fa81e0992 --- /dev/null +++ b/content/platform/troubleshooting/hhq-buildup.md @@ -0,0 +1,12 @@ +--- +title: Troubleshooting Hinted Handoff Queue buildup +description: placeholder +draft: true +menu: + platform: + name: Hinted Handoff Queue buildup + parent: Troubleshooting issues + weight: 3 +--- + +_PLACEHOLDER_ diff --git a/content/platform/troubleshooting/iops.md b/content/platform/troubleshooting/iops.md new file mode 100644 index 000000000..4708bdf7f --- /dev/null +++ b/content/platform/troubleshooting/iops.md @@ -0,0 +1,12 @@ +--- +title: Troubleshooting IOPS +description: placeholder +draft: true +menu: + platform: + name: IOPS + parent: Troubleshooting issues + weight: 5 +--- + +_PLACEHOLDER_ diff --git a/content/platform/troubleshooting/log-analysis.md b/content/platform/troubleshooting/log-analysis.md new file mode 100644 index 000000000..61c9ec056 --- /dev/null +++ b/content/platform/troubleshooting/log-analysis.md @@ -0,0 +1,12 @@ +--- +title: Troubleshooting with log analysis +description: placeholder +draft: true +menu: + platform: + name: Log analysis + parent: Troubleshooting issues + weight: 6 +--- + +_PLACEHOLDER_ diff --git a/content/platform/troubleshooting/oom-loops.md b/content/platform/troubleshooting/oom-loops.md new file mode 100644 index 000000000..08f44f34e --- /dev/null +++ b/content/platform/troubleshooting/oom-loops.md @@ -0,0 +1,133 @@ +--- +title: Troubleshooting out-of-memory loops +description: How to identify and troubleshoot out-of-memory (OOM) loops when using InfluxData's TICK stack. +menu: + platform: + name: Out-of-memory loops + parent: Troubleshooting issues + weight: 1 +--- + +Out-of-memory (OOM) loops occur when a running process consumes an increasing amount +of memory until the operating system is forced to kill and restart the process. +When the process is killed, memory allocated to the process is released, but after +restarting, it continues to use more and more RAM until the cycle repeats. + +In a [monitoring dashboard](/platform/monitoring/monitoring-dashboards), an OOM loop +will appear in the **Memory Usage %** metric and look similar to the following: + +![OOM Loop](/img/platform/troubleshooting-oom-loop.png) + +## Potential causes + +The causes of OOM loops can vary widely and depend on your specific use case of +the TICK stack, but the following is the most common: + +### Unoptimized queries + +What is queried and how it's queried can drastically affect the memory usage and performance of InfluxDB. +An OOM loop will occur as a result of a repeated issuance of a query which exhausts memory. +For example, a dashboard cell with which is set to refresh every 30s. + +#### Selecting a measurement without specifying a time range + +When selecting from a measurement without specifying a time range, InfluxDB attempts +to pull data points from the beginning of UNIX epoch time (00:00:00 UTC on 1 January 1970), +storing the returned data in memory until it's ready for output. +The operating system will eventually kill the process due to high memory usage. + +###### Example of selecting a measurement without a time range + +```sql +SELECT * FROM "telegraf"."autogen"."cpu" +``` + +## Solutions + +### Identify and update unoptimized queries + +The most common cause of OOM loops in InfluxDB is unoptimized queries, but it can +be challenging to identify what queries could be better optimized. +InfluxQL includes tools to help identify the "cost" of queries and gain insight +into what queries have room for optimization. + +#### View your InfluxDB logs + +If a query is killed, it is logged by InfluxDB. +View your [InfluxDB logs](/influxdb/latest/administration/logs/) for hints as to what queries are being killed. + +#### Estimate query cost + +InfluxQL's [`EXPLAIN` statement](/influxdb/latest/query_language/spec#explain) +parses and plans a query, then outputs a summary of estimated costs. +This allows you to estimate how resource-intensive a query may be before having to +run the actual query. + +###### Example EXPLAIN statement + +``` +> EXPLAIN SELECT * FROM "telegraf"."autogen"."cpu" + +QUERY PLAN +---------- +EXPRESSION: +AUXILIARY FIELDS: cpu::tag, host::tag, usage_guest::float, usage_guest_nice::float, usage_idle::float, usage_iowait::float, usage_irq::float, usage_nice::float, usage_softirq::float, usage_steal::float, usage_system::float, usage_user::float +NUMBER OF SHARDS: 12 +NUMBER OF SERIES: 108 +CACHED VALUES: 38250 +NUMBER OF FILES: 1080 +NUMBER OF BLOCKS: 10440 +SIZE OF BLOCKS: 23252999 +``` + +> `EXPLAIN` will only output what iterators are created by the query engine. +> It does not capture any other information within the query engine such as how many points will actually be processed. + +#### Analyze actual query cost + +InfluxQL's [`EXPLAIN ANALYZE` statement](/influxdb/latest/query_language/spec/#explain-analyze) +actually executes a query and counts the costs during runtime. + +###### Example EXPLAIN ANALYZE statement + +``` +> EXPLAIN ANALYZE SELECT * FROM "telegraf"."autogen"."cpu" WHERE time > now() - 1d + +EXPLAIN ANALYZE +--------------- +. +└── select + ├── execution_time: 104.608549ms + ├── planning_time: 5.08487ms + ├── total_time: 109.693419ms + └── build_cursor + ├── labels + │   └── statement: SELECT cpu::tag, host::tag, usage_guest::float, usage_guest_nice::float, usage_idle::float, usage_iowait::float, usage_irq::float, usage_nice::float, usage_softirq::float, usage_steal::float, usage_system::float, usage_user::float FROM telegraf.autogen.cpu + └── iterator_scanner + ├── labels + │   └── auxiliary_fields: cpu::tag, host::tag, usage_guest::float, usage_guest_nice::float, usage_idle::float, usage_iowait::float, usage_irq::float, usage_nice::float, usage_softirq::float, usage_steal::float, usage_system::float, usage_user::float + └── create_iterator + ├── labels + │   ├── measurement: cpu + │   └── shard_id: 317 + ├── cursors_ref: 0 + ├── cursors_aux: 90 + ├── cursors_cond: 0 + ├── float_blocks_decoded: 450 + ├── float_blocks_size_bytes: 960943 + ├── integer_blocks_decoded: 0 + ├── integer_blocks_size_bytes: 0 + ├── unsigned_blocks_decoded: 0 + ├── unsigned_blocks_size_bytes: 0 + ├── string_blocks_decoded: 0 + ├── string_blocks_size_bytes: 0 + ├── boolean_blocks_decoded: 0 + ├── boolean_blocks_size_bytes: 0 + └── planning_time: 4.523978ms +``` + +### Scale available memory + +If possible, increase the amount of memory available to InfluxDB. +This is easier if running in a virtualized or cloud environment where resources can be scaled on the fly. +In environments with a fixed set of resources, this can be a very difficult challenge to overcome. diff --git a/content/platform/troubleshooting/read-write-volume.md b/content/platform/troubleshooting/read-write-volume.md new file mode 100644 index 000000000..29d6e133a --- /dev/null +++ b/content/platform/troubleshooting/read-write-volume.md @@ -0,0 +1,12 @@ +--- +title: Troubleshooting the volume of reads and writes +description: placeholder +draft: true +menu: + platform: + name: Volume of reads and writes + parent: Troubleshooting issues + weight: 7 +--- + +_PLACEHOLDER_ diff --git a/content/platform/troubleshooting/runaway-series-cardinality.md b/content/platform/troubleshooting/runaway-series-cardinality.md new file mode 100644 index 000000000..8277f43a4 --- /dev/null +++ b/content/platform/troubleshooting/runaway-series-cardinality.md @@ -0,0 +1,12 @@ +--- +title: Troubleshooting runaway series cardinality +description: placeholder +draft: true +menu: + platform: + name: Runaway series cardinality + parent: Troubleshooting issues + weight: 2 +--- + +_PLACEHOLDER_ diff --git a/static/img/platform/troubleshooting-disk-usage.png b/static/img/platform/troubleshooting-disk-usage.png new file mode 100644 index 000000000..6dde0a528 Binary files /dev/null and b/static/img/platform/troubleshooting-disk-usage.png differ diff --git a/static/img/platform/troubleshooting-oom-loop.png b/static/img/platform/troubleshooting-oom-loop.png new file mode 100644 index 000000000..693bac546 Binary files /dev/null and b/static/img/platform/troubleshooting-oom-loop.png differ