fix(clustered): Optimize and troubleshoot queries (Closes #5143 and https://github.com/influxdata/DAR/issues/453)

2024-12-05 15:07:49 -06:00 · 2024-12-05 15:07:49 -06:00 · a20f203361
parent 1f64204df4
commit a20f203361
4 changed files with 133 additions and 100 deletions
--- a/content/influxdb/clustered/install/optimize-cluster/optimize-querying.md
+++ b/content/influxdb/clustered/install/optimize-cluster/optimize-querying.md
@ -85,10 +85,9 @@ is in storage. For more information, see

 ## Report query performance issues

-If you have a query that isn't meeting your performance requirements despite
-implementing query optimizations, please following the process described in
-[Report query performance issues](/influxdb/clustered/query-data/troubleshoot-and-optimize/report-query-performance-issues/)
-to gather information for InfluxData engineers so they can help identify any
-potential solutions.
+If you've followed steps to [optimize and
+troubleshoot a query](/influxdb/clustered/query-data/troubleshoot-and-optimize/optimize-queries/),
+but it still doesn't meet performance requirements,
+see how to [report query performance issues](/influxdb/clustered/query-data/troubleshoot-and-optimize/report-query-performance-issues/).

 {{< page-nav prev="/influxdb/clustered/install/optimize-cluster/simulate-load/" prevText="Simulate load" next="/influxdb/clustered/install/secure-cluster/" nextText="Phase 4: Secure your cluster" >}}
--- a/content/influxdb/clustered/query-data/troubleshoot-and-optimize/optimize-queries.md
+++ b/content/influxdb/clustered/query-data/troubleshoot-and-optimize/optimize-queries.md
@ -25,38 +25,26 @@ Learn how to use observability tools to analyze query execution and view metrics
 - [Why is my query slow?](#why-is-my-query-slow)
 - [Strategies for improving query performance](#strategies-for-improving-query-performance)
  - [Query only the data you need](#query-only-the-data-you-need)
- [Analyze and troubleshoot queries](#analyze-and-troubleshoot-queries)
+- [Recognize and address bottlenecks](#recognize-and-address-bottlenecks)
+

 ## Why is my query slow?

-Query performance depends on time range and complexity.
-If a query is slower than you expect, it might be due to the following reasons:
+Query performance depends on factors like the time range and query complexity.
+If a query is slower than expected, consider the following potential causes:

- It queries data from a large time range.
- It includes intensive operations, such as querying many string values or `ORDER BY` sorting or re-sorting large amounts of data.
+- The query spans a large time range, which increases the amount of data being processed.
+- The query performs intensive operations, such as:
+  - Sorting or re-sorting large datasets with `ORDER BY`.
+  - Querying many string values, which can be computationally expensive.

 ## Strategies for improving query performance

-The following design strategies generally improve query performance and resource use:
+The following design strategies generally improve query performance and resource usage:

- Follow [schema design best practices](/influxdb/clustered/write-data/best-practices/schema-design/) to make querying easier and more performant.
- [Query only the data you need](#query-only-the-data-you-need).
- [Downsample data](/influxdb/clustered/process-data/downsample/) to reduce the amount of data you need to query.
-
-Some bottlenecks may be out of your control and are the result of a suboptimal execution plan, such as:
-
- Applying the same sort (`ORDER BY`) to already sorted data.
- Retrieving many Parquet files from the Object store--the same query performs better if it retrieves fewer - though, larger - files.
- Querying many overlapped Parquet files.
- Performing a large number of table scans.
-
-{{% note %}}
-#### Analyze query plans to view metrics and recognize bottlenecks
-
-To view runtime metrics for a query, such as the number of files scanned, use
-the [`EXPLAIN ANALYZE` keywords](/influxdb/clustered/reference/sql/explain/#explain-analyze)
-and learn how to [analyze a query plan](/influxdb/clustered/query-data/troubleshoot-and-optimize/analyze-query-plan/).
-{{% /note %}}
+- Follow [schema design best practices](/influxdb/clustered/write-data/best-practices/schema-design/) to simplify and improve queries.
+- [Query only the data you need](#query-only-the-data-you-need) to reduce unnecessary processing.
+- [Downsample data](/influxdb/clustered/process-data/downsample/) to decrease the volume of data queried.

 ### Query only the data you need

@ -88,10 +76,30 @@ two queries is minimal.
 In a table with over 1000 columns, the `SELECT *` query is slower and
 less efficient.

-## Analyze and troubleshoot queries
+## Recognize and address bottlenecks

-Learn how to [analyze a query plan](/influxdb/clustered/query-data/troubleshoot-and-optimize/analyze-query-plan/)
-to troubleshoot queries and find performance bottlenecks.
+To identify performance bottlenecks, learn how to [analyze a query plan](/influxdb/clustered/query-data/troubleshoot-and-optimize/analyze-query-plan/).
+Query plans provide runtime metrics, such as the number of files scanned, that may reveal inefficiencies in query execution.

-If you need help troubleshooting, follow the guidelines to
-[report query performance issues](/influxdb/clustered/query-data/troubleshoot-and-optimize/report-query-performance-issues/).
+> [!Note]
+>
+> #### Request help to troubleshoot queries
+>
+> Some bottlenecks may result from suboptimal query [execution plans](/influxdb/clustered/reference/internals/query-plan/#physical-plan) and are outside your control--for example:
+>
+> - Sorting (`ORDER BY`) data that is already sorted.
+> - Retrieving numerous small Parquet files from the object store instead of fewer, larger files.
+> - Querying many overlapped Parquet files.
+> - Performing a high number of table scans.
+>
+> If you've followed steps to [optimize](#why-is-my-query-slow) and
+> [troubleshoot a query](/influxdb/clustered/query-data/troubleshoot-and-optimize/troubleshoot/),
+> but it still doesn't meet performance requirements,
+> see how to [report query performance issues](/influxdb/clustered/query-data/troubleshoot-and-optimize/report-query-performance-issues/).
+>
+> #### Query trace logging
+>
+> Currently, customers cannot enable trace logging for {{% product-name omit="Clustered" %}} clusters.
+> InfluxData engineers can use query plans and trace logging to help pinpoint performance bottlenecks in a query.
+>
+> See how to [report query performance issues](/influxdb/clustered/query-data/troubleshoot-and-optimize/report-query-performance-issues/).
--- a/content/influxdb/clustered/query-data/troubleshoot-and-optimize/report-query-performance-issues.md
+++ b/content/influxdb/clustered/query-data/troubleshoot-and-optimize/report-query-performance-issues.md
@ -1,7 +1,7 @@
 ---
 title: Report query performance issues
 description: >
-  A comprehensive guide on ensuring a quick turnaround when troubleshooting query performance.
+  Follow guidelines to help InfluxData engineers troubleshoot and resolve query performance issues.
 menu:
  influxdb_clustered:
    name: Report query performance issues
@ -9,13 +9,17 @@ menu:
 weight: 402
 related:
  - /influxdb/clustered/admin/query-system-data/
+  - /influxdb/clustered/query-data/troubleshoot-and-optimize/analyze-query-plan/
 ---

-These guidelines are intended to faciliate collaboration between InfluxData
-engineers and you. They allow engineers to conduct timely analyses of any performance
-issues that you have not been able to resolve following our [guide on
-troubleshooting and optimizing
-queries](/influxdb/clustered/query-data//troubleshoot-and-optimize).
+Use these guidelines to work with InfluxData engineers to troubleshoot and resolve query performance issues.
+
+> [!Note]
+> #### Optimize your query
+>
+> Before reporting a query performance problem,
+> see the [troubleshooting and optimization guide](/influxdb/clustered/query-data/troubleshoot-and-optimize)
+> to learn how to optimize your query and reduce compute and memory requirements.

 1. [Send InfluxData output artifacts](#send-influxdata-output-artifacts)
 2. [Document your test process](#document-your-test-process)
@ -54,12 +58,13 @@ Send InfluxData engineers all produced artifacts for analysis.

 ### Document your test process

-There currently is no standardized performance test suite that you can run in
-your environment, so please document your process so it can be replicated.
-Include the following:
+Currently, {{% product-name %}} doesn't provide a standardized performance test
+suite that you can run in your cluster.
+Please document your test process so that InfluxData engineers can replicate
+it--include the following:

 - The steps you take when performance testing.
- Timestamps of the tests you perform so they can be correlated with associated logs.
+- Timestamps of your test runs, to correlate tests with logs.

 ### Document your environment

@ -81,7 +86,7 @@ including the following:
 {{% note %}}
 #### If possible, provide a synthetic dataset

-If you can reproduce the performance issue with a synthetic dataset and your
+If you can reproduce the performance issue with a synthetic dataset, and your
 process and environment are well-documented, InfluxData engineers _may_
 be able to reproduce the issue, shorten the feedback cycle, and resolve the
 issue sooner.
@ -95,8 +100,8 @@ conditions that reproduce your issue.
 ### Establish query performance degradation conditions

 The most effective way to investigate query performance is to have a good understanding of
-the conditions in which you don't see the expected performance. Things to think about
-and provide:
+the conditions in which you don't see the expected performance.
+Consider the following:

 - Does this always happen, or only sometimes?
  - If only sometimes, is it at a consistent time of day or over a consistent period?
@ -109,37 +114,25 @@ and provide:

 ### Reduce query noise

-To get a sense of the baseline performance of your system without the
-noise of additional queries, test in an environment that doesn't have periodic
-or intermittent queries running concurrently.
+Test in an environment without periodic or intermittent queries to measure baseline system performance without additional query noise.

-Additionally, when running multiple tests with different queries, let the system
-recover between tests by waiting at least a minute after receiving a query result
-before executing the next query.
+When running multiple tests with different queries, allow the system to recover between tests.
+Wait at least one minute after receiving a query result before executing the next query.

 ### Establish baseline single-query performance

-To get a sense of the baseline performance of your system without the
-noise of additional queries, perform at least some of your testing with
-single queries in isolation from one another.
-
-This is may be useful for the purposes of analysis by InfluxData engineers even if a
-single query in isolation isn't enough to reproduce the issue you are having.
+Perform some tests with single queries in isolation to measure baseline performance.
+This approach may not always reproduce your issue but can provide useful data for analysis by InfluxData engineers.

 ### Run queries at multiple load scales

-Once you've established baseline performance with a single query and your
-performance issue can't be replicated with a single query, use a systematic
-approach to identify the scale at which it does become a problem.
-This involves systematic incremental increases to your query
-concurrency until you identify a threshold at which the issue can be
-reproduced.
+If the issue isn't replicated after [reducing query noise](#reduce-query-noise)
+and [establishing baseline single-query performance](#establish-baseline-single-query-performance),
+systematically increase query concurrency to reproduce the problem and identify
+the scale at which it occurs--for example, run the following test plan.

-This, along with information about your Kubernetes environment, can provide 
-insight necessary to recommend changes to your configuration to improve
-query performance characteristis as your usage scales.
-
-As an example, consider the following test plan outline:
+> [!Note]
+> You might need to scale the example plan up or down, as necessary, to reproduce the problem.

 1. Turn off intermittent or periodic InfluxDB queries and allow the cluster to recover.
 2. Run Query A and allow the cluster to recover for 1 minute.
@ -147,12 +140,12 @@ As an example, consider the following test plan outline:
 4. Run 10 concurrent instances of Query A and allow the cluster to recover for 1 minute.
 5. Run 20 concurrent instances of Query A and allow the cluster to recover for 1 minute.
 6. Run 40 concurrent instances of Query A and allow the cluster to recover for 1 minute.
-7. Provide InfluxData the debug information [described below](#gather-debug-information).
+7. Provide InfluxData the [debug information](#gather-debug-information) associated
+   with each test run.

-{{% note %}}
-This is just an example. You don't have to go beyond the scale where queries get slower
-but you may also need to go further than what's outlined here.
-{{% /note %}}
+Your test findings and associated debug information 
+from your Kubernetes environment can help recommend configuration changes to
+improve query performance as your usage scales.

 <!-- Don't mention dashboards until they're working working in a future Clustered release --

@ -165,8 +158,8 @@ screenshots of dashboard events for Queriers, Compactors, and Ingesters.

 ### Gather debug information

-The following debug information should be collected shortly _after_ a
- problematic query has been tried against your InfluxDB cluster.
+Shortly after testing a problematic query against your InfluxDB cluster,
+collect the following debug information.

 #### Kubernetes-specific information

@ -188,6 +181,9 @@ tar -czf "${DATETIME}-cluster-info.tar.gz" "${DATETIME}-cluster-info/"

 #### Query analysis

+[Use `EXPLAIN` commands](/influxdb/clustered/query-data/troubleshoot-and-optimize/analyze-query-plan/#use-explain-keywords-to-view-a-query-plan) 
+ to output query plan information for a long-running query.
+
 **Outputs (InfluxQL):**

 - `explain.csv`
@ -200,9 +196,6 @@ tar -czf "${DATETIME}-cluster-info.tar.gz" "${DATETIME}-cluster-info/"
 - `explain-verbose.txt`
 - `explain-analyze.txt`

-For any known long-running queries, it may be helpful to execute variations of
-the `EXPLAIN` command on them.
-
 In the examples below, replace the following:

 - {{% code-placeholder-key %}}`DATABASE_NAME`{{% /code-placeholder-key %}}:
@ -328,26 +321,27 @@ curl --get "https://{{< influxdb/host >}}/query" \

 ### Gather system information

-{{% warn %}}
-#### May impact cluster performance
+> [!Warn]
+>
+> #### May impact cluster performance
+>
+> Querying InfluxDB v3 system tables may impact write and query
+> performance of your {{< product-name omit=" Clustered" >}} cluster.
+> Use filters to [optimize queries to reduce impact to your cluster](/influxdb/clustered/admin/query-system-data/#optimize-queries-to-reduce-impact-to-your-cluster).
+>
+> <!--------------- UPDATE THE DATE BELOW AS EXAMPLES ARE UPDATED --------------->
+>
+> #### System tables are subject to change
+> 
+> System tables are not part of InfluxDB's stable API and may change with new releases.
+> The provided schema information and query examples are valid as of **September 20, 2024**.
+> If you detect a schema change or a non-functioning query example, please
+> [submit an issue](https://github.com/influxdata/docs-v2/issues/new/choose).
+> 
+> <!--------------- UPDATE THE DATE ABOVE AS EXAMPLES ARE UPDATED --------------->

-Querying InfluxDB v3 system tables may impact write and query
-performance of your {{< product-name omit=" Clustered" >}} cluster.
-Use filters to [optimize queries to reduce impact to your cluster](/influxdb/clustered/admin/query-system-data/#optimize-queries-to-reduce-impact-to-your-cluster).
-
-<!--------------- UPDATE THE DATE BELOW AS EXAMPLES ARE UPDATED --------------->
-
-#### System tables are subject to change
-
-System tables are not part of InfluxDB's stable API and may change with new releases.
-The provided schema information and query examples are valid as of **September 20, 2024**.
-If you detect a schema change or a non-functioning query example, please
-[submit an issue](https://github.com/influxdata/docs-v2/issues/new/choose).
-
-<!--------------- UPDATE THE DATE ABOVE AS EXAMPLES ARE UPDATED --------------->
-{{% /warn %}}
-
-If queries are slow for a specific table, run the following system queries to collect information for troubleshooting.
+If queries are slow for a specific table, run the following system queries to
+collect information for troubleshooting:

 - [Collect table information](#collect-table-information)
 - [Collect compaction information for the table](#collect-compaction-information-for-the-table)
--- a/content/influxdb/clustered/query-data/troubleshoot-and-optimize/troubleshoot.md
+++ b/content/influxdb/clustered/query-data/troubleshoot-and-optimize/troubleshoot.md
@ -20,6 +20,8 @@ Troubleshoot SQL and InfluxQL queries that return unexpected results.

 - [Why doesn't my query return data?](#why-doesnt-my-query-return-data)
 - [Optimize slow or expensive queries](#optimize-slow-or-expensive-queries)
+- [Analyze your queries](#analyze-your-queries)
+- [Request help to troubleshoot queries](#request-help-to-troubleshoot-queries)

 ## Why doesn't my query return data?

@ -48,4 +50,34 @@ If a query times out or returns an error, it might be due to the following:

 If a query is slow or uses too many compute resources, limit the amount of data that it queries.

-See how to [optimize queries](/influxdb/clustered/query-data/troubleshoot-and-optimize/optimize-queries/) and use tools to view runtime metrics, identify bottlenecks, and debug queries.
+See how to [optimize queries](/influxdb/clustered/query-data/troubleshoot-and-optimize/optimize-queries/).
+
+## Analyze your queries 
+
+Use the following tools to retrieve system query information, analyze query execution,
+and find performance bottlenecks:
+
+- [Analyze a query plan](/influxdb/clustered/query-data/troubleshoot-and-optimize/analyze-query-plan/)
+- [Retrieve `system.queries` information for a query](/influxdb/clustered/query-data/troubleshoot-and-optimize/system-information/)
+
+#### Request help to troubleshoot queries
+
+Some bottlenecks may result from suboptimal query [execution plans](/influxdb/clustered/reference/internals/query-plan/#physical-plan) and are outside your control--for example:
+
+- Sorting (`ORDER BY`) data that is already sorted
+- Retrieving numerous small Parquet files from the object store, instead of fewer, larger files
+- Querying many overlapped Parquet files
+- Performing a high number of table scans
+
+If you have followed steps to [optimize](/influxdb/clustered/query-data/troubleshoot-and-optimize/optimize-queries/) and [troubleshoot a query](#why-doesnt-my-query-return-data),
+and it isn't meeting your performance requirements,
+see how to [report query performance issues](/influxdb/clustered/query-data/troubleshoot-and-optimize/report-query-performance-issues/).
+
+> [!Note]
+>
+> #### Query trace logging
+>
+> Currently, customers cannot enable trace logging for {{% product-name omit="Clustered" %}} clusters.
+> InfluxData engineers can use query plans and trace logging to help pinpoint performance bottlenecks in a query.
+>
+> See how to [report query performance issues](/influxdb/clustered/query-data/troubleshoot-and-optimize/report-query-performance-issues/).