9.6 KiB
title | aliases | menu | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Rebalancing InfluxDB Enterprise clusters |
|
|
Manually rebalance an InfluxDB Enterprise cluster to ensure:
- Shards are evenly distributed across all data nodes in the cluster
- Every shard is on n number of nodes, where n is the replication factor.
Why rebalance?
Rebalance a cluster any time you do one of the following:
- Expand capacity and increase write throughput by adding a data node.
- Increase availability and query throughput, by either:
- Adding data nodes and increasing the replication factor.
- Adjusting fully replicated data to partially replicated data. See Full versus partial replication.
- Add or remove a data node for any reason.
- Adjust your replication factor for any reason.
Full versus partial replication
When replication factor equals the number of data nodes, data is fully replicated to all data nodes in a cluster. Full replication means each node can respond to queries without communicating with other data nodes. A more typical configuration is a partial replication of data; for example, a replication factor of 2 with 4, 6, or 8 data nodes. Partial replication allows more series to be stored in a database.
Rebalance your cluster
{{% warn %}} Before you begin, do the following:
- Stop writing data to InfluxDB. Rebalancing while writing data can lead to data loss.
- Enable anti-entropy (AE) to ensure all shard data is successfully copied. To learn more about AE, see Use Anti-Entropy service in InfluxDB Enterprise. {{% /warn %}}
After adding or removing data nodes from your cluster, complete the following steps to rebalance:
- If applicable, update the replication factor.
- Truncate hot shards (shards actively receiving writes).
- Identify cold shards
- Copy cold shards to new data node
- If you're expanding capacity to increase write throughput, remove cold shards from an original data node.
- Confirm the rebalance.
Important: The number of data nodes in a cluster must be evenly divisible by the replication factor. For example, a replication factor of 2 works with 2, 4, 6, or 8 data nodes. A replication factor of 3 works with 3, 6, or 9 data nodes.
Update the replication factor
The following example shows how to increase the replication factor to 3
. Adjust the replication factor as needed for your cluster, ensuring the number of data nodes is evenly divisible by the replication factor.
-
In InfluxDB CLI, run the following query on any data node for each retention policy and database:
> ALTER RETENTION POLICY "<retention_policy_name>" ON "<database_name>" REPLICATION 3 >
If successful, no query results are returned. At this point, InfluxDB automatically distributes all new shards across the data nodes in the cluster with the correct number of replicas.
-
To verify the new replication factor, run the
SHOW RETENTION POLICIES
query:> SHOW RETENTION POLICIES ON "telegraf" name duration shardGroupDuration replicaN default ---- -------- ------------------ -------- ------- autogen 0s 1h0m0s 3 # true
Truncate hot shards
-
In InfluxDB CLI, run the following command:
influxd-ctl truncate-shards
A message confirms shards have been truncated. Previous writes are stored in cold shards and InfluxDB starts writing new points to a new hot shard. New hot shards are automatically distributed across the cluster.
Identify cold shards
-
In InfluxDB CLI, run following command to view every shard in the cluster:
``` influxd-ctl show-shards Shards ========== ID Database Retention Policy Desired Replicas [...] End Owners 21 telegraf autogen 2 [...] 2020-01-26T18:00:00Z [{4 enterprise-data-01:8088} {5 enterprise-data-02:8088}] 22 telegraf autogen 2 [...] 2020-01-26T18:05:36.418734949Z* [{4 enterprise-data-01:8088} {5 enterprise-data-02:8088}] 24 telegraf autogen 2 [...] 2020-01-26T19:00:00Z [{5 enterprise-data-02:8088} {6 enterprise-data-03:8088}] ```
This example shows three shards:
- Shards are cold if their
End
timestamp occurs in the past. In this example, the current time is2020-01-26T18:05:36.418734949Z
, so the first two shards are cold. - Shard owners identify data nodes in a cluster. In this example, cold shards are on two data nodes:
enterprise-data-01:8088
andenterprise-data-02:8088
. - Truncated shards have an asterix (
*
) after timestamp. In this example, the second shard is truncated. - Shards with an
End
timestamp in the future are hot shards. In this example, the third shard is the newly-created hot shard. The host shard owner includes one of the original data nodes (enterprise-data-02:8088
) and the new data node (enterprise-data-03:8088
).
- Shards are cold if their
-
For each cold shard, note the shard
ID
andOwners
or data node TCP name (for example,enterprise-data-01:8088
). -
Run the following command to determine the size of the shards in your cluster:
find /var/lib/influxdb/data/ -mindepth 3 -type d -exec du -h {} \;
To increase capacity on the original data nodes, move larger shards to the new data node. Note that moving shards impacts network traffic.
Copy cold shards to new data node
-
In InfluxDB CLI, run the following command, specifying the
shard ID
to copy, thesource_TCP_name
of the original data node anddestination_TCP_name
of the new data node:influxd-ctl copy-shard <source_TCP_name> <destination_TCP_name> <shard_ID>
A message appears confirming the shard was copied.
-
Repeat step 1 for every cold shard you want to move to the new data node.
-
Confirm copied shards display the TCP name of the new data node as an owner:
influxd-ctl show-shards
Expected output shows copied shard now has three owners:
Shards ========== ID Database Retention Policy Desired Replicas [...] End Owners 22 telegraf autogen 2 [...] 2017-01-26T18:05:36.418734949Z* [{4 enterprise-data-01:8088} {5 enterprise-data-02:8088} {6 enterprise-data-03:8088}]
Copied shards appear in the new data node shard directory: /var/lib/influxdb/data/<database>/<retention_policy>/<shard_ID>
.
Remove cold shards from original data node
Removing a shard is an irrecoverable, destructive action; use this command with caution.
-
Run the following command, specifying the cold shard ID to remove and the source_TCP_name of one of the original data nodes:
influxd-ctl remove-shard <source_TCP_name> <shard_ID>
Expected output:
Removed shard <shard_ID> from <source_TCP_name>
-
Repeat step 1 to remove all cold shards from one of the original data nodes.
Confirm the rebalance
-
For each relevant shard, confirm the TCP name is correct by running the following command.
influxd-ctl show-shards
For example:
-
If you're rebalancing to expand the capacity to increase write throughput, verify the new data node and only one of the original data nodes appears in the
Owners
column:Expected output shows that the copied shard now has only two owners:
Shards ========== ID Database Retention Policy Desired Replicas [...] End Owners 22 telegraf autogen 2 [...] 2017-01-26T18:05:36.418734949Z* [{5 enterprise-data-02:8088} {6 enterprise-data-03:8088}]
-
If you're rebalancing to increase data availability for queries and query throughput, verify the TCP name of the new data node appears in the
Owners
column:Expected output shows that the copied shard now has three owners:
``` Shards ========== ID Database Retention Policy Desired Replicas [...] End Owners 22 telegraf autogen 3 [...] 2017-01-26T18:05:36.418734949Z* [{4 enterprise-data-01:8088} {5 enterprise-data-02:8088} {6 enterprise-data-03:8088}] In addition, verify that the copied shards appear in the new data node’s shard directory and match the shards in the source data node’s shard directory. Shards are located in `/var/lib/influxdb/data/<database>/<retention_policy>/<shard_ID>`. Here’s an example of the correct output for shard 22: # On the source data node (enterprise-data-01) ~# ls /var/lib/influxdb/data/telegraf/autogen/22 000000001-000000001.tsm # # On the new data node (enterprise-data-03) ~# ls /var/lib/influxdb/data/telegraf/autogen/22 000000001-000000001.tsm #
-