hotfix: update replacing nodes doc, closes #1309, closes influxdata/DAR#91

pull/5186/head
Scott Anderson 2023-10-18 17:22:21 -06:00
parent be793c7758
commit 244b37d917
1 changed files with 101 additions and 50 deletions

View File

@ -10,33 +10,51 @@ menu:
weight: 20
---
## Introduction
You may need to replace a node in your InfluxDB Enterprise cluster, for example, to update hardware. This guide describes how to replace both meta nodes and data nodes in a cluster:
Nodes in an InfluxDB Enterprise cluster may need to be replaced at some point due to hardware needs, hardware issues, or something else entirely.
This guide outlines processes for replacing both meta nodes and data nodes in an InfluxDB Enterprise cluster.
- [Concepts](#concepts)
- [Scenarios](#scenarios)
- [Replace a node in a cluster with security enable](#replace-a-node-in-a-cluster-with-security-enable)
- [Replace a meta node in a functional cluster](#replace-a-meta-node-in-a-functional-cluster)
- [Replace an unresponsive meta node](#replace-an-unresponsive-meta-node)
- [Replace responsive and unresponsive data nodes in a cluster](#replace-responsive-and-unresponsive-data-nodes-in-a-cluster)
- [Reconnect a data node with a failed disk](#reconnect-a-data-node-with-a-failed-disk)
- [Replace meta nodes in an InfluxDB Enterprise cluster](#replace-meta-nodes-in-an-influxdb-enterprise-cluster)
- [Replace data nodes in an InfluxDB Enterprise cluster](#replace-data-nodes-in-an-influxdb-enterprise-cluster)
- [Troubleshoot](#troubleshoot)
- [Cluster commands result in timeout without error](#cluster-commands-result-in-timeout-without-error)
## Concepts
Meta nodes manage and monitor both the uptime of nodes in the cluster as well as distribution of [shards](/enterprise_influxdb/v1/concepts/glossary/#shard) among nodes in the cluster.
They hold information about which data nodes own which shards; information on which the
[anti-entropy](/enterprise_influxdb/v1/administration/anti-entropy/) (AE) process depends.
Data nodes hold raw time-series data and metadata. Data shards are both distributed and replicated across data nodes in the cluster. The AE process runs on data nodes and references the shard information stored in the meta nodes to ensure each data node has the shards they need.
**Meta nodes** manage and monitor both the uptime of nodes in the cluster and
distribution of [shards](/influxdb/v2/reference/glossary/#shard) among nodes in
the cluster. Meta nodes hold information about which data nodes own which shards;
information that the [anti-entropy](/enterprise_influxdb/v1/administration/configure/anti-entropy/)
(AE) process depends on.
**Data nodes** hold raw time-series data and metadata. Data shards are both distributed and replicated across data nodes in the cluster. The AE process runs on data nodes and references the shard information stored in the meta nodes to ensure each data node has the shards they need.
`influxd-ctl` is a CLI included in each meta node and is used to manage your InfluxDB Enterprise cluster.
## Scenarios
### Replace nodes in clusters with security enabled
Many InfluxDB Enterprise clusters are configured with security enabled, forcing secure TLS encryption between all nodes in the cluster.
Both `influxd-ctl` and `curl`, the command line tools used when replacing nodes, have options that facilitate the use of TLS.
### Replace a node in a cluster with security enable
Many InfluxDB Enterprise clusters are configured with security enabled, forcing
secure TLS encryption between all nodes in the cluster.
Both `influxd-ctl` and `curl`, the command line tools used when replacing nodes,
have options that facilitate the use of TLS.
#### `influxd-ctl -bind-tls`
In order to manage your cluster over TLS, pass the `-bind-tls` flag with any `influxd-ctl` commmand.
> If using a self-signed certificate, pass the `-k` flag to skip certificate verification.
To manage your cluster over TLS, pass the `-bind-tls` flag with any `influxd-ctl` commmand.
{{% note %}}
If using a self-signed certificate, pass the `-k` flag to skip certificate verification.
{{% /note %}}
```bash
# Pattern
# Syntax
influxd-ctl -bind-tls [-k] <command>
# Example
@ -45,29 +63,38 @@ influxd-ctl -bind-tls remove-meta enterprise-meta-02:8091
#### `curl -k`
`curl` natively supports TLS/SSL connections, but if using a self-signed certificate, pass the `-k`/`--insecure` flag to allow for "insecure" SSL connections.
`curl` natively supports TLS/SSL connections, but if using a self-signed certificate,
pass the `-k`/`--insecure` flag to allow for "insecure" SSL connections.
> Self-signed certificates are considered "insecure" due to their lack of a valid chain of authority. However, data is still encrypted when using self-signed certificates.
{{% note %}}
Self-signed certificates are considered "insecure" due to their lack of a valid
chain of authority. However, data is still encrypted when using self-signed certificates.
{{% /note %}}
```bash
# Pattern
# Syntax
curl [-k, --insecure] <url>
# Example
curl -k https://localhost:8091/status
```
### Replace meta nodes in a functional cluster
### Replace a meta node in a functional cluster
If all meta nodes in the cluster are fully functional, simply follow the steps for [replacing meta nodes](#replace-meta-nodes-in-an-influxdb-enterprise-cluster).
If all meta nodes in the cluster are fully functional, complete the following
steps to [replace meta nodes](#replace-meta-nodes-in-an-influxdb-enterprise-cluster).
### Replace an unresponsive meta node
If replacing a meta node that is either unreachable or unrecoverable, you need to forcefully remove it from the meta cluster. Instructions for forcefully removing meta nodes are provided in the [step 2.2](#2-2-remove-the-non-leader-meta-node) of the [replacing meta nodes](#replace-meta-nodes-in-an-influxdb-enterprise-cluster) process.
If replacing a meta node that is either unreachable or unrecoverable, you must
forcefully remove the node from the meta cluster.
See [step 2.2](#22-remove-the-non-leader-meta-node) of the
[replace meta nodes](#replace-meta-nodes-in-an-influxdb-enterprise-cluster) process.
### Replace responsive and unresponsive data nodes in a cluster
The process of replacing both responsive and unresponsive data nodes is the same. Simply follow the instructions for [replacing data nodes](#replace-data-nodes-in-an-influxdb-enterprise-cluster).
The process of replacing both responsive and unresponsive data nodes is the same.
Follow the instructions for [replacing data nodes](#replace-a-data-node-in-an-influxdb-enterprise-cluster).
### Reconnect a data node with a failed disk
@ -82,34 +109,40 @@ To resolve this, sign in to a meta node and use the [`influxd-ctl update-data`](
to [update the failed data node to itself](#2-replace-the-old-data-node-with-the-new-data-node).
```bash
# Pattern
# Syntax
influxd-ctl update-data <data-node-tcp-bind-address> <data-node-tcp-bind-address>
# Example
influxd-ctl update-data enterprise-data-01:8088 enterprise-data-01:8088
```
This will connect the `influxd` process running on the newly replaced disk to the cluster.
The AE process will detect the missing shards and begin to sync data from other
This connects the `influxd` process running on the newly replaced disk to the cluster.
The AE process detects the missing shards and begins to sync data from other
shards in the same shard group.
{{% note %}}
If the AE process is disabled, use [`influxd-ctl copy-shard`](/enterprise_influxdb/v1/tools/influxd-ctl/copy-shard/)
to manually copy shards from existing data nodes to the new data node.
{{% /note %}}
## Replace meta nodes in an InfluxDB Enterprise cluster
[Meta nodes](/enterprise_influxdb/v1/concepts/clustering/#meta-nodes) together form a [Raft](https://raft.github.io/) cluster in which nodes elect a leader through consensus vote.
The leader oversees the management of the meta cluster, so it is important to replace non-leader nodes before the leader node.
[Meta nodes](/enterprise_influxdb/v1/concepts/clustering/#meta-nodes) together
form a [Raft](https://raft.github.io/) cluster in which nodes elect a leader through consensus vote.
The leader oversees the management of the meta cluster, so it is important to
replace non-leader nodes before the leader node.
The process for replacing meta nodes is as follows:
1. [Identify the leader node](#1-identify-the-leader-node)
2. [Replace all non-leader nodes](#2-replace-all-non-leader-nodes)
2.1. [Provision a new meta node](#2-1-provision-a-new-meta-node)
2.2. [Remove the non-leader meta node](#2-2-remove-the-non-leader-meta-node)
2.3. [Add the new meta node](#2-3-add-the-new-meta-node)
2.4. [Confirm the meta node was added](#2-4-confirm-the-meta-node-was-added)
2.5. [Remove and replace all other non-leader meta nodes](#2-5-remove-and-replace-all-other-non-leader-meta-nodes)
2.1. [Provision a new meta node](#21-provision-a-new-meta-node)
2.2. [Remove the non-leader meta node](#22-remove-the-non-leader-meta-node)
2.3. [Add the new meta node](#23-add-the-new-meta-node)
2.4. [Confirm the meta node was added](#24-confirm-the-meta-node-was-added)
2.5. [Remove and replace all other non-leader meta nodes](#25-remove-and-replace-all-other-non-leader-meta-nodes)
3. [Replace the leader node](#3-replace-the-leader-node)
3.1. [Kill the meta process on the leader node](#3-1-kill-the-meta-process-on-the-leader-node)
3.2. [Remove and replace the old leader node](#3-2-remove-and-replace-the-old-leader-node)
3.1. [Kill the meta process on the leader node](#31-kill-the-meta-process-on-the-leader-node)
3.2. [Remove and replace the old leader node](#32-remove-and-replace-the-old-leader-node)
### 1. Identify the leader node
@ -119,7 +152,9 @@ Log into any of your meta nodes and run the following:
curl -s localhost:8091/status | jq
```
> Piping the command into `jq` is optional, but does make the JSON output easier to read.
{{% note %}}
Piping the command into `jq` is optional, but does make the JSON output easier to read.
{{% /note %}}
The output will include information about the current meta node, the leader of the meta cluster, and a list of "peers" in the meta cluster.
@ -143,7 +178,8 @@ Identify the `leader` of the cluster. When replacing nodes in a cluster, non-lea
#### 2.1. Provision a new meta node
[Provision and start a new meta node](/enterprise_influxdb/v1/installation/meta_node_installation/), but **do not** add it to the cluster yet.
[Provision and start a new meta node](/enterprise_influxdb/v1/installation/meta_node_installation/),
but **do not** add it to the cluster yet.
For this guide, the new meta node's hostname will be `enterprise-meta-04`.
#### 2.2. Remove the non-leader meta node
@ -151,29 +187,33 @@ For this guide, the new meta node's hostname will be `enterprise-meta-04`.
Now remove the non-leader node you are replacing by using the `influxd-ctl remove-meta` command and the TCP address of the meta node (ex. `enterprise-meta-02:8091`):
```bash
# Pattern
# Syntax
influxd-ctl remove-meta <meta-node-tcp-bind-address>
# Example
influxd-ctl remove-meta enterprise-meta-02:8091
```
> Only use `remove-meta` if you want to permanently remove a meta node from a cluster.
{{% note %}}
Only use `remove-meta` if you want to permanently remove a meta node from a cluster.
{{% /note %}}
<!-- -->
{{% note %}}
**For unresponsive or unrecoverable meta nodes:**
> **For unresponsive or unrecoverable meta nodes:**
>If the meta process is not running on the node you are trying to remove or the node is neither reachable nor recoverable, use the `-force` flag.
When forcefully removing a meta node, you must also pass the `-tcpAddr` flag with the TCP and HTTP bind addresses of the node you are removing.
If the meta process is not running on the node you are trying to remove or the
node is neither reachable nor recoverable, use the `-force` flag.
When forcefully removing a meta node, you must also pass the `-tcpAddr` flag with
the TCP and HTTP bind addresses of the node you are removing.
```bash
# Pattern
# Syntax
influxd-ctl remove-meta -force -tcpAddr <meta-node-tcp-bind-address> <meta-node-http-bind-address>
# Example
influxd-ctl remove-meta -force -tcpAddr enterprise-meta-02:8089 enterprise-meta-02:8091
```
{{% /note %}}
#### 2.3. Add the new meta node
@ -181,7 +221,7 @@ Once the non-leader meta node has been removed, on **one of the existing meta no
run `influxd-ctl add-meta` to replace the node removed with the new meta node:
```bash
# Pattern
# Syntax
influxd-ctl add-meta <meta-node-tcp-bind-address>
# Example
@ -191,7 +231,7 @@ influxd-ctl add-meta enterprise-meta-04:8091
You can also add a meta node remotely through another meta node:
```bash
# Pattern
# Syntax
influxd-ctl -bind <remote-meta-node-bind-address> add-meta <meta-node-tcp-bind-address>
# Example
@ -280,7 +320,7 @@ The process of replacing data nodes is as follows:
Log into any of your cluster's meta nodes and use `influxd-ctl update-data` to replace the old data node with the new data node:
```bash
# Pattern
# Syntax
influxd-ctl update-data <old-node-tcp-bind-address> <new-node-tcp-bind-address>
# Example
@ -330,8 +370,15 @@ ID Database Retention Policy Desired Replicas Shard Group Start
```
Within the duration defined by [`anti-entropy.check-interval`](/enterprise_influxdb/v1/administration/config-data-nodes#check-interval-10m),
the AE service will begin copying shards from other shard owners to the new node.
The time it takes for copying to complete is determined by the number of shards copied and how much data is stored in each.
the AE service begins copying shards from other shard owners to the new node.
The time it takes for copying to complete is determined by the number of shards
copied and how much data is stored in each.
{{% note %}}
**Tip:** If unexpected shard issues occur (for example, when AE is disabled or
causing unexpected results), use [`influxd-ctl copy-shard`](/enterprise_influxdb/v1/tools/influxd-ctl/copy-shard/)
to manually replace shards on a node.
{{% /note %}}
### 4. Check the `copy-shard-status`
@ -348,8 +395,12 @@ Source Dest Database Policy ShardID To
enterprise-data-02:8088 enterprise-data-03:8088 telegraf autogen 3 119624324 119624324 2018-04-17 23:45:09.470696179 +0000 UTC
```
> **Important:** If replacing other data nodes in the cluster, make sure shards are completely copied from nodes in the same shard group before replacing the other nodes.
View the [Anti-entropy](/enterprise_influxdb/v1/administration/anti-entropy/#concepts) documentation for important information regarding anti-entropy and your database's replication factor.
{{% note %}}
**Important:** If replacing other data nodes in the cluster, make sure shards
are completely copied from nodes in the same shard group before replacing the other nodes.
View the [Anti-entropy documentation](/enterprise_influxdb/v1/administration/configure/anti-entropy/)
for important information regarding anti-entropy and your database's replication factor.
{{% /note %}}
## Troubleshoot
@ -377,7 +428,7 @@ curl localhost:8091/user
You can also check the permissions of a specific user by passing the username with the `name` parameter:
```bash
# Pattern
# Syntax
curl localhost:8091/user?name=<username>
# Example