--- title: Rebalancing InfluxDB Enterprise clusters aliases: - /enterprise/v1.5/guides/rebalance/ menu: enterprise_influxdb_1_5: name: Rebalancing clusters weight: 19 parent: Guides --- ## Introduction This guide describes how to manually rebalance an InfluxDB Enterprise cluster. Rebalancing a cluster involves two primary goals: * Evenly distribute [shards](/influxdb/v1.5/concepts/glossary/#shard) across all data nodes in the cluster * Ensure that every shard is on N number of nodes, where N is determined by the retention policy's [replication factor](/influxdb/v1.5/concepts/glossary/#replication-factor) Rebalancing a cluster is essential for cluster health. Perform a rebalance if you add a new data node to your cluster. The proper rebalance path depends on the purpose of the new data node. If you added a data node to expand the disk size of the cluster or increase write throughput, follow the steps in [Rebalance Procedure 1](#rebalance-procedure-1-rebalance-a-cluster-to-create-space). If you added a data node to increase data availability for queries and query throughput, follow the steps in [Rebalance Procedure 2](#rebalance-procedure-2-rebalance-a-cluster-to-increase-availability). ### Requirements The following sections assume that you already added a new data node to the cluster, and they use the [`influx-ctl` tool](/enterprise_influxdb/v1.5/administration/cluster-commands/) available on all meta nodes. Before you begin, stop writing historical data to InfluxDB. Historical data has a timestamp that occurs in the past (not real-time data). Performing a rebalance while writing historical data can lead to data loss. ## Rebalance Procedure 1: Rebalance a cluster to create space For demonstration purposes, the next steps assume that you added a third data node to a previously two-data-node cluster that has a [replication factor](/influxdb/v1.5/concepts/glossary/#replication-factor) of two. This rebalance procedure is applicable for different cluster sizes and replication factors, but some of the specific, user-provided values will depend on that cluster size. Rebalance Procedure 1 focuses on how to rebalance a cluster after adding a data node to expand the total disk capacity of the cluster. In the next steps, you will safely move shards from one of the two original data nodes to the new data node. ### Step 1: Truncate Hot Shards Hot shards are shards that are currently receiving writes. Performing any action on a hot shard can lead to data inconsistency within the cluster which requires manual intervention from the user. To prevent data inconsistency, truncate hot shards before moving any shards across data nodes. The command below creates a new hot shard which is automatically distributed across all data nodes in the cluster, and the system writes all new points to that shard. All previous writes are now stored in cold shards. ``` influxd-ctl truncate-shards ``` The expected ouput of this command is: ``` Truncated shards. ``` Once you truncate the shards, you can work on redistributing the cold shards without the threat of data inconsistency in the cluster. Any hot or new shards are now evenly distributed across the cluster and require no further intervention. ### Step 2: Identify Cold Shards In this step, you identify the cold shards that you will copy to the new data node and remove from one of the original two data nodes. The following command lists every shard in our cluster: ``` influxd-ctl show-shards ``` The expected output is similar to the items in the codeblock below: ``` Shards ========== ID Database Retention Policy Desired Replicas [...] End Owners 21 telegraf autogen 2 [...] 2017-01-26T18:00:00Z [{4 enterprise-data-01:8088} {5 enterprise-data-02:8088}] 22 telegraf autogen 2 [...] 2017-01-26T18:05:36.418734949Z* [{4 enterprise-data-01:8088} {5 enterprise-data-02:8088}] 24 telegraf autogen 2 [...] 2017-01-26T19:00:00Z [{5 enterprise-data-02:8088} {6 enterprise-data-03:8088}] ``` The sample output includes three shards. The first two shards are cold shards. The timestamp in the `End` column occurs in the past (assume that the current time is just after `2017-01-26T18:05:36.418734949Z`), and the shards' owners are the two original data nodes: `enterprise-data-01:8088` and `enterprise-data-02:8088`. The second shard is the truncated shard; truncated shards have an asterix (`*`) on the timestamp in the `End` column. The third shard is the newly-created hot shard; the timestamp in the `End` column is in the future (again, assume that the current time is just after `2017-01-26T18:05:36.418734949Z`), and the shard's owners include one of the original data nodes (`enterprise-data-02:8088`) and the new data node (`enterprise-data-03:8088`). That hot shard and any subsequent shards require no attention during the rebalance process. Identify the cold shards that you'd like to move from one of the original two data nodes to the new data node. Take note of the cold shard's `ID` (for example: `22`) and the TCP address of one of its owners in the `Owners` column (for example: `enterprise-data-01:8088`). > **Note:** > Use the following command string to determine the size of the shards in your cluster: > find /var/lib/influxdb/data/ -mindepth 3 -type d -exec du -h {} \; > In general, we recommend moving larger shards to the new data node to increase the available disk space on the original data nodes. Users should note that moving shards will impact network traffic. ### Step 3: Copy Cold Shards Next, copy the relevant cold shards to the new data node with the syntax below. Repeat this command for every cold shard that you'd like to move to the new data node. ``` influxd-ctl copy-shard ``` Where `source_TCP_address` is the address that you noted in step 2, `destination_TCP_address` is the TCP address of the new data node, and `shard_ID` is the ID of the shard that you noted in step 2. The expected output of the command is: ``` Copied shard from to ``` ### Step 4: Confirm the Copied Shards Confirm that the TCP address of the new data node appears in the `Owners` column for every copied shard: ``` influxd-ctl show-shards ``` The expected output shows that the copied shard now has three owners: ``` Shards ========== ID Database Retention Policy Desired Replicas [...] End Owners 22 telegraf autogen 2 [...] 2017-01-26T18:05:36.418734949Z* [{4 enterprise-data-01:8088} {5 enterprise-data-02:8088} {6 enterprise-data-03:8088}] ``` In addition, verify that the copied shards appear in the new data node's shard directory and match the shards in the source data node's shard directory. Shards are located in `/var/lib/influxdb/data///`. Here's an example of the correct output for shard `22`: ``` # On the source data node (enterprise-data-01) ~# ls /var/lib/influxdb/data/telegraf/autogen/22 000000001-000000001.tsm # 👍 # On the new data node (enterprise-data-03) ~# ls /var/lib/influxdb/data/telegraf/autogen/22 000000001-000000001.tsm # 👍 ``` It is essential that every copied shard appears on the new data node both in the `influxd-ctl show-shards` output and in the shard directory. If a shard does not pass both of the tests above, please repeat step 3. ### Step 5: Remove Unnecessary Cold Shards Next, remove the copied shard from the original data node with the command below. Repeat this command for every cold shard that you'd like to remove from one of the original data nodes. **Removing a shard is an irrecoverable, destructive action; please be cautious with this command.** ``` influxd-ctl remove-shard ``` Where `source_TCP_address` is the TCP address of the original data node and `shard_ID` is the ID of the shard that you noted in step 2. The expected output of the command is: ``` Removed shard from ``` ### Step 6: Confirm the Rebalance For every relevant shard, confirm that the TCP address of the new data node and only one of the original data nodes appears in the `Owners` column: ``` influxd-ctl show-shards ``` The expected output shows that the copied shard now has only two owners: ``` Shards ========== ID Database Retention Policy Desired Replicas [...] End Owners 22 telegraf autogen 2 [...] 2017-01-26T18:05:36.418734949Z* [{5 enterprise-data-02:8088} {6 enterprise-data-03:8088}] ``` That's it. You've successfully rebalanced your cluster; you expanded the available disk size on the original data nodes and increased the cluster's write throughput. ## Rebalance Procedure 2: Rebalance a cluster to increase availability For demonstration purposes, the next steps assume that you added a third data node to a previously two-data-node cluster that has a [replication factor](/influxdb/v1.5/concepts/glossary/#replication-factor) of two. This rebalance procedure is applicable for different cluster sizes and replication factors, but some of the specific, user-provided values will depend on that cluster size. Rebalance Procedure 2 focuses on how to rebalance a cluster to improve availability and query throughput. In the next steps, you will increase the retention policy's replication factor and safely copy shards from one of the two original data nodes to the new data node. ### Step 1: Update the Retention Policy [Update](/influxdb/v1.5/query_language/database_management/#modify-retention-policies-with-alter-retention-policy) every retention policy to have a replication factor of three. This step ensures that the system automatically distributes all newly-created shards across the three data nodes in the cluster. The following query increases the replication factor to three. Run the query on any data node for each retention policy and database. Here, we use InfluxDB's [CLI](/influxdb/v1.5/tools/shell/) to execute the query: ``` > ALTER RETENTION POLICY "" ON "" REPLICATION 3 > ``` A successful `ALTER RETENTION POLICY` query returns no results. Use the [`SHOW RETENTION POLICIES` query](/influxdb/v1.5/query_language/schema_exploration/#show-retention-policies) to verify the new replication factor. Example: ``` > SHOW RETENTION POLICIES ON "telegraf" name duration shardGroupDuration replicaN default ---- -------- ------------------ -------- ------- autogen 0s 1h0m0s 3 #👍 true ``` ### Step 2: Truncate Hot Shards Hot shards are shards that are currently receiving writes. Performing any action on a hot shard can lead to data inconsistency within the cluster which requires manual intervention from the user. To prevent data inconsistency, truncate hot shards before copying any shards to the new data node. The command below creates a new hot shard which is automatically distributed across the three data nodes in the cluster, and the system writes all new points to that shard. All previous writes are now stored in cold shards. ``` influxd-ctl truncate-shards ``` The expected ouput of this command is: ``` Truncated shards. ``` Once you truncate the shards, you can work on distributing the cold shards without the threat of data inconsistency in the cluster. Any hot or new shards are now automatically distributed across the cluster and require no further intervention. ### Step 3: Identify Cold Shards In this step, you identify the cold shards that you will copy to the new data node. The following command lists every shard in your cluster: ``` influxd-ctl show-shards ``` The expected output is similar to the items in the codeblock below: ``` Shards ========== ID Database Retention Policy Desired Replicas [...] End Owners 21 telegraf autogen 3 [...] 2017-01-26T18:00:00Z [{4 enterprise-data-01:8088} {5 enterprise-data-02:8088}] 22 telegraf autogen 3 [...] 2017-01-26T18:05:36.418734949Z* [{4 enterprise-data-01:8088} {5 enterprise-data-02:8088}] 24 telegraf autogen 3 [...] 2017-01-26T19:00:00Z [{4 enterprise-data-01:8088} {5 enterprise-data-02:8088} {6 enterprise-data-03:8088}] ``` The sample output includes three shards. The first two shards are cold shards. The timestamp in the `End` column occurs in the past (assume that the current time is just after `2017-01-26T18:05:36.418734949Z`), and the shards' owners are the two original data nodes: `enterprise-data-01:8088` and `enterprise-data-02:8088`. The second shard is the truncated shard; truncated shards have an asterix (`*`) on the timestamp in the `End` column. The third shard is the newly-created hot shard; the timestamp in the `End` column is in the future (again, assume that the current time is just after `2017-01-26T18:05:36.418734949Z`), and the shard's owners include all three data nodes: `enterprise-data-01:8088`, `enterprise-data-02:8088`, and `enterprise-data-03:8088`. That hot shard and any subsequent shards require no attention during the rebalance process. Identify the cold shards that you'd like to copy from one of the original two data nodes to the new data node. Take note of the cold shard's `ID` (for example: `22`) and the TCP address of one of its owners in the `Owners` column (for example: `enterprise-data-01:8088`). ### Step 4: Copy Cold Shards Next, copy the relevant cold shards to the new data node with the syntax below. Repeat this command for every cold shard that you'd like to move to the new data node. ``` influxd-ctl copy-shard ``` Where `source_TCP_address` is the address that you noted in step 3, `destination_TCP_address` is the TCP address of the new data node, and `shard_ID` is the ID of the shard that you noted in step 3. The expected output of the command is: ``` Copied shard from to ``` ### Step 5: Confirm the Rebalance Confirm that the TCP address of the new data node appears in the `Owners` column for every copied shard: ``` influxd-ctl show-shards ``` The expected output shows that the copied shard now has three owners: ``` Shards ========== ID Database Retention Policy Desired Replicas [...] End Owners 22 telegraf autogen 3 [...] 2017-01-26T18:05:36.418734949Z* [{4 enterprise-data-01:8088} {5 enterprise-data-02:8088} {6 enterprise-data-03:8088}] ``` In addition, verify that the copied shards appear in the new data node's shard directory and match the shards in the source data node's shard directory. Shards are located in `/var/lib/influxdb/data///`. Here's an example of the correct output for shard `22`: ``` # On the source data node (enterprise-data-01) ~# ls /var/lib/influxdb/data/telegraf/autogen/22 000000001-000000001.tsm # 👍 # On the new data node (enterprise-data-03) ~# ls /var/lib/influxdb/data/telegraf/autogen/22 000000001-000000001.tsm # 👍 ``` That's it. You've successfully rebalanced your cluster and increased data availability for queries and query throughput.