docs: part 2 of compactor documents with best practice and guidelines (#5880)

* docs: part 2 of compactor * fix: typos * chore: Apply suggestions from code review Co-authored-by: Andrew Lamb <alamb@influxdata.com> * docs: address review comments Co-authored-by: Andrew Lamb <alamb@influxdata.com> Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-10-18 09:07:48 -04:00 · 2022-10-18 09:07:48 -04:00 · 6672404711
parent 9b7a7b5e40
commit 6672404711
1 changed files with 177 additions and 6 deletions
--- a/docs/compactor.md
+++ b/docs/compactor.md
@ -14,7 +14,8 @@ There are 3 kinds of `compaction_level` files in IOx: level-0, level-1 and level
   - Level-0 files are small files ingested by the Ingesters
   - Level-1 files are files created by a Compactor as a result of compacting one or many level-0 files with their overlapped level-1 files.
   - Level-2 files are files created by a Compactor as a result of compacting one or many level-1 files with their overlapped level-2 files.
-Regarless of level, a file in IOx must belong to a partition which represents data of a time range which is usually a day. Two files of different partitions never overlap in time and hence the Compactor only needs to compact files that belong to the same partition.
+  
+Regardless of level, a file in IOx must belong to a partition which represents data of a time range which is usually a day. Two files of different partitions never overlap in time and hence the Compactor only needs to compact files that belong to the same partition.

 A level-0 file may overlap with other level-0, level-1 and level-2 files. A Level-1 file do not overlap with any other level-1 files but may overlap with level-2 files. A level-2 file does not overlap with any other level-2 files.

@ -38,12 +39,182 @@ If increasing memory a lot does not help, consider changing one or a combination
 - INFLUXDB_IOX_COMPACTION_MAX_COMPACTING_FILES: reduce this value in half or more. This puts a hard cap on the maximun number of files of a partition it can compact, even if its memory budget estimate would allow for more.
 - INFLUXDB_IOX_COMPACTION_MEMORY_BUDGET_BYTES: reduce this value in half or more. This tells the compact its total budget is less so it will reduce the number of partition it can compact concurrenly or reduce number of files to be compacted for a partition.

-# Avoid and deal with files in skipped_compactions
-todo
+# Compactor Config Parameters
+These are [up-to-date configurable parameters](https://github.com/influxdata/influxdb_iox/blob/main/clap_blocks/src/compactor.rs). Here are a few key parameters you may want to tune for your needs:
+
+ - **Size of the files:** The compactor cannot control the sizes of level-0 files but they are usually small and can be adjusted by config params of the Ingesters. The compactor decides the max desired size of level-1 and level-2 files which is around `INFLUXDB_IOX_COMPACTION_MAX_DESIRED_FILE_SIZE_BYTES * (100 + INFLUXDB_IOX_COMPACTION_PERCENTAGE_MAX_FILE_SIZE) / 100`.
+ - **Map a compactor to several shards**:  Depending on your Ingester setup, there may be several shards. A compactor can be set up to compact all or a fraction of the shards. Use range `[INFLUXDB_IOX_SHARD_INDEX_RANGE_START, INFLUXDB_IOX_SHARD_INDEX_RANGE_END]` to map them.
+- **Number of partitions considered to compact per shard**: If there is enough memory which is usually the case, the compactor will compact many partitions of same or different shards concurrently. Depending on how many shards a compactor handles and how much memory that compactor is configured, you can increase/reduce the concurrent compaction level by increase/reduce number of partitions per shard by adjusting `INFLUXDB_IOX_COMPACTION_MAX_NUMBER_PARTITIONS_PER_SHARD`.
+- **Concurrency capacity:** to configure this based on your available memory, you need to understand how IOx estimates memory to compact files in the next section.

 # Memory Estimation
-todo

-# Usual SQL to verify compaction status
-todo
+The idea of a single compaction is to compact as many small input files as possible into one or few larger output files as follows
+
+```text
+     ┌────────┐         ┌────────┐     
+     │        │         │        │     
+     │ Output │  .....  │ Output │     
+     │ File 1 │         │ File m │     
+     │        │         │        │     
+     └────────┘         └────────┘     
+          ▲                  ▲         
+          │                  │         
+          │                  │         
+      .─────────────────────────.      
+ _.──'                           `───. 
+(               Compact               )
+ `────.                         _.───' 
+       `───────────────────────'       
+       ▲        ▲              ▲       
+       │        │              │       
+       │        │              │       
+   ┌───────┐┌──────┐       ┌──────┐    
+   │       ││      │       │      │    
+   │ Input ││Input │       │Input │    
+   │File 1 ││File 2│ ..... │File n│    
+   │       ││      │       │      │    
+   └───────┘└──────┘       └──────┘    
+Figure 1: Compact a Partition
+```
+
+
+Currently, in order to avoid over committing memory (and OOMing), the compactor computes an estimate of the memory needed for loading full input files into memory, memory for streaming input files in parallel, and memory for output streams. The boxes in the diagram below illustrates the memory needed to run the query plan above. IOx picks number of input files to compact based on their sizes, number of columns and columns types of the file's table, max desired output files, and the memory budget the compactor is provided. Details:
+
+
+```text
+          ┌───────────┐          ┌────────────┐     
+          │Memory for │          │ Memory for │     
+          │  Output   │  .....   │   Output   │     
+          │ Stream 1  │          │  Stream m  │     
+          │           │          │            │     
+          └───────────┘          └────────────┘     
+                                                    
+               ▲                        ▲           
+               │                        │           
+               │                        │           
+               │                        │           
+                 .─────────────────────.            
+          _.────'                       `─────.     
+       ,─'                                     '─.  
+      ╱            Run Compaction Plan            ╲ 
+     (                                             )
+      `.                                         ,' 
+        '─.                                   ,─'   
+           `─────.                     _.────'      
+                  `───────────────────'             
+       ▲              ▲                      ▲      
+       │              │                      │      
+       │              │                      │      
+       │              │                      │      
+┌────────────┐ ┌────────────┐         ┌────────────┐
+│ Memory for │ │ Memory for │         │ Memory for │
+│ Streaming  │ │ Streaming  │         │ Streaming  │
+│   File 1   │ │   File 2   │  .....  │   File n   │
+│            │ │            │         │            │
+└────────────┘ └────────────┘         └────────────┘
+┌────────────┐ ┌────────────┐         ┌────────────┐
+│ Memory for │ │ Memory for │         │ Memory for │
+│Loading Full│ │Loading Full│  .....  │Loading Full│
+│   File 1   │ │   File 2   │         │   File n   │
+│            │ │            │         │            │
+└────────────┘ └────────────┘         └────────────┘
+Figure 2:  Memory Estimation for a Compaction Plan
+```
+
+- Memory for loading a full file : Twice the file size.
+- Memory for streaming an input file: See [estimate_arrow_bytes_for_file](https://github.com/influxdata/influxdb_iox/blob/bb7df22aa1783e040ea165153876f1fe36838d4e/compactor/src/compact.rs#L504) for the details but, briefly, each column of the file will need `size_of_a_row_of_a_column * INFLUXDB_IOX_COMPACTION_MIN_ROWS_PER_RECORD_BATCH_TO_PLAN` in which `size_of_a_row_of_a_column` depends on column type and the file cardinality.
+- Memory for an output stream is similar to memory for streaming an input file. Number of output streams is estimated based on the sizes of the input files and max desired file size `INFLUXDB_IOX_COMPACTION_MAX_DESIRED_FILE_SIZE_BYTES`.
+ 
+IOx limits the number of input files using the budget provided in `INFLUXDB_IOX_COMPACTION_MEMORY_BUDGET_BYTES`. However, IOx also caps number of input files based on `INFLUXDB_IOX_COMPACTION_MAX_COMPACTING_FILES` even if more could fit under the memory budget.
+
+Since the compactor keeps running to look for new files to compact, number of input files are usually small (fewer then 10) and, thus, memory needed to run such a plan is usually small enough for a compactor to be able to run many of them concurrently.
+
+**Best Pratice Recommendation for your configuration:**
+- INFLUXDB_IOX_COMPACTION_MEMORY_BUDGET_BYTES: 1/3 (or at most 1/2) your total actual memory.
+- INFLUXDB_IOX_COMPACTION_MAX_COMPACTING_FILES: 20
+- INFLUXDB_IOX_COMPACTION_MIN_ROWS_PER_RECORD_BATCH_TO_PLAN: 32 * 1024
+- INFLUXDB_IOX_COMPACTION_MAX_DESIRED_FILE_SIZE_BYTES: 100 * 1024 * 1024
+- INFLUXDB_IOX_COMPACTION_PERCENTAGE_MAX_FILE_SIZE: 5
+
+# Avoid and deal with files in skipped_compactions
+To deduplicate data correctly, the Compactor must compact level-0 files in ascending order of their sequence nubers and with their overlapped level-1 files. If the first level-0 and its overlapped level-1 files are too large and their memory estimation in Figure 2 is over the budget defined in INFLUXDB_IOX_COMPACTION_MEMORY_BUDGET_BYTES, the compactor won't be able to compact that partition. To avoid considering that same partition again and again, the compactor will put that partition into the catalog table `skipped_compactions`.
+
+If you find your queries on data of a partition that is in `skipped_compactions` are slow, we may want to force the Compactor to compact that partition by increasing your memory budget and then removing that partition from the `skipped_compactions`. If you remove the partition without adjusting your config params, the Compactor will put its back in again without compacting it. We are working on an gRPC API to let you see the content of the `skipped_compactions` and remove a specific partition from it. In the mean time, you can do this by using delete SQL statement directly from your Catalog (see SQL section for this  statement) but we do not recommend modifying the Catalog unless you know your partitions and their data files very well.
+
+Increasing memory budget can be as simple as increasing your actual memory size and then increasing the value of INFLUXDB_IOX_COMPACTION_MEMORY_BUDGET_BYTES accordingly. You can also try to reduce the value of INFLUXDB_IOX_COMPACTION_MIN_ROWS_PER_RECORD_BATCH_TO_PLAN to minimum 8 * 1024 but you may hit OOMs if you do so. This depends on column types and number of columns of your file.
+
+If your partition is put into the  `skipped_partition` with the reason `over limit of num files`, you have to increase INFLUXDB_IOX_COMPACTION_MAX_COMPACTING_FILES accordingly but you may hit OOMs if you do not increase your actual memory.
+
+# Avoid Deduplication in Querier
+
+Deduplication is known to be expensive. To avoid deduplication work during query time in Queriers, your files should not be overlapped in time range. This can be achieved by having all files of a partition in either level-1 or level-2. With the current design, if your compactor catches up well, partitions with recent level-0 files within the last 4 hours should have at most two level-2 files. Partitions without new level-0 files in the last 8 hours should have all level-2 files. Depending on the performance in the Querier, we can adjust the Compactor (a future feature) to have all files in level-1 or level-2.
+
+# Common SQL to verify compaction status
+
+If your Compactors catch up well with your Ingesters and do not hit memory issues, you should see:
+1. Table `skipped_compactions` is empty
+2. Most partitions have at most 2 level-0 files. If a partition has a lot of level-0 files, it signals either your compactor is behind and does not compact it, or that partition is put in `skipped_compactions`.
+3. Most partitions without new level-0 files in the last 8 hours should have all level-2 files.
+4. Most non-used files (files with `to_delete is not null`) are removed by garbage collector
+
+Here are SQL to verify them
+
+
+```sql
+-- Content of skipped_compactions
+SELECT * FROM skipped_compactions;
+
+-- remove partitions from the skipped_compactions
+DELETE FROM skipped_compactions WHERE partition_id in ([your_ids]);
+
+-- Content of skipped_compactions with their shard index, partition key and table id
+SELECT shard_index, table_id, partition_id, partition_key, left(reason, 25), 
+   num_files, limit_num_files, estimated_bytes, limit_bytes, to_timestamp(skipped_at) skipped_at
+FROM skipped_compactions, partition, shard
+WHERE partition.id = skipped_compactions.partition_id and partition.shard_id = shard.id
+ORDER BY shard_index, table_id, partition_key, skipped_at;
+
+-- Number of files per level for top 50 partitions with most files of a specified day
+SELECT s.shard_index, pf.table_id, partition_id, partition_key,  
+   count(case when to_delete is null then 1 end) total_not_deleted, 
+   count(case when compaction_level=0 and to_delete is null then 1 end) num_l0, 
+   count(case when compaction_level=1 and to_delete is null then 1 end) num_l1, 
+   count(case when compaction_level=2 and to_delete is null then 1 end) num_l2 ,
+   count(case when compaction_level=0 and to_delete is not null then 1 end) deleted_num_l0, 
+   count(case when compaction_level=1 and to_delete is not null then 1 end) deleted_num_l1,
+   count(case when compaction_level=2 and to_delete is not null then 1 end) deleted_num_l2
+FROM parquet_file pf, partition p, shard s
+WHERE pf.partition_id = p.id AND pf.shard_id = s.id
+  AND partition_key = '2022-10-11' 
+GROUP BY s.shard_index, pf.table_id, partition_id, partition_key
+ORDER BY count(case when to_delete is null then 1 end) DESC
+LIMIT 50;
+
+-- Partitions with level-0 files ingested within the last 4 hours 
+SELECT partition_key, id as partition_id 
+FROM partition p, (
+   SELECT partition_id, max(created_at)
+   FROM parquet_file
+   WHERE compaction_level = 0 AND to_delete IS NULL
+   GROUP BY partition_id
+   HAVING to_timestamp(max(created_at)/1000000000) > now() - '(4 hour)'::interval
+) sq
+WHERE sq.partition_id =  p.id;
+```
+
+These are other SQL you may find useful
+
+```sql
+-- Number of columns in a table
+select left(t.name, 50), c.table_id, count(1) num_cols 
+from table_name t, column_name c 
+where t.id =c.table_id 
+group by 1, 2
+order by 3 desc;
+```
+
+
+   
+