Merge pull request #23498 from sftim/20200827_revise_multiple_zones_setup

Revise page about multiple zones
2020-10-07 23:12:15 -07:00 · 2020-10-07 23:12:15 -07:00 · 56b0a2fb86
parent d406bdf532 7196db6980
commit 56b0a2fb86
1 changed files with 101 additions and 361 deletions
--- a/content/en/docs/setup/best-practices/multiple-zones.md
+++ b/content/en/docs/setup/best-practices/multiple-zones.md
@ -4,401 +4,141 @@ reviewers:
 - justinsb
 - quinton-hoole
 title: Running in multiple zones
-weight: 10
+weight: 20
 content_type: concept
 ---

 <!-- overview -->

-This page describes how to run a cluster in multiple zones.
-
-
+This page describes running Kubernetes across multiple zones.

 <!-- body -->

-## Introduction
+## Background

-Kubernetes 1.2 adds support for running a single cluster in multiple failure zones
-(GCE calls them simply "zones", AWS calls them "availability zones", here we'll refer to them as "zones").
-This is a lightweight version of a broader Cluster Federation feature (previously referred to by the affectionate
-nickname ["Ubernetes"](https://github.com/kubernetes/community/blob/{{< param "githubbranch" >}}/contributors/design-proposals/multicluster/federation.md)).
-Full Cluster Federation allows combining separate
-Kubernetes clusters running in different regions or cloud providers
-(or on-premises data centers).  However, many
-users simply want to run a more available Kubernetes cluster in multiple zones
-of their single cloud provider, and this is what the multizone support in 1.2 allows
-(this previously went by the nickname "Ubernetes Lite").
+Kubernetes is designed so that a single Kubernetes cluster can run
+across multiple failure zones, typically where these zones fit within
+a logical grouping called a _region_. Major cloud providers define a region
+as a set of failure zones (also called _availability zones_) that provide
+a consistent set of features: within a region, each zone offers the same
+APIs and services.

-Multizone support is deliberately limited: a single Kubernetes cluster can run
-in multiple zones, but only within the same region (and cloud provider).  Only
-GCE and AWS are currently supported automatically (though it is easy to
-add similar support for other clouds or even bare metal, by simply arranging
-for the appropriate labels to be added to nodes and volumes).
+Typical cloud architectures aim to minimize the chance that a failure in
+one zone also impairs services in another zone.

+## Control plane behavior

-## Functionality
+All [control plane components](/docs/concepts/overview/components/#control-plane-components)
+support running as a pool of interchangable resources, replicated per
+component.

-When nodes are started, the kubelet automatically adds labels to them with
-zone information.
-
-Kubernetes will automatically spread the pods in a replication controller
-or service across nodes in a single-zone cluster (to reduce the impact of
-failures.)  With multiple-zone clusters, this spreading behavior is
-extended across zones (to reduce the impact of zone failures.)  (This is
-achieved via `SelectorSpreadPriority`).  This is a best-effort
-placement, and so if the zones in your cluster are heterogeneous
-(e.g. different numbers of nodes, different types of nodes, or
-different pod resource requirements), this might prevent perfectly
-even spreading of your pods across zones. If desired, you can use
-homogeneous zones (same number and types of nodes) to reduce the
-probability of unequal spreading.
-
-When persistent volumes are created, the `PersistentVolumeLabel`
-admission controller automatically adds zone labels to them.  The scheduler (via the
-`VolumeZonePredicate` predicate) will then ensure that pods that claim a
-given volume are only placed into the same zone as that volume, as volumes
-cannot be attached across zones.
-
-## Limitations
-
-There are some important limitations of the multizone support:
-
-* We assume that the different zones are located close to each other in the
-network, so we don't perform any zone-aware routing.  In particular, traffic
-that goes via services might cross zones (even if some pods backing that service
-exist in the same zone as the client), and this may incur additional latency and cost.
-
-* Volume zone-affinity will only work with a `PersistentVolume`, and will not
-work if you directly specify an EBS volume in the pod spec (for example).
-
-* Clusters cannot span clouds or regions (this functionality will require full
-federation support).
-
-* Although your nodes are in multiple zones, kube-up currently builds
-a single master node by default.  While services are highly
-available and can tolerate the loss of a zone, the control plane is
-located in a single zone.  Users that want a highly available control
-plane should follow the [high availability](/docs/setup/production-environment/tools/kubeadm/high-availability/) instructions.
-
-### Volume limitations
-The following limitations are addressed with [topology-aware volume binding](/docs/concepts/storage/storage-classes/#volume-binding-mode).
-
-* StatefulSet volume zone spreading when using dynamic provisioning is currently not compatible with
-  pod affinity or anti-affinity policies.
-
-* If the name of the StatefulSet contains dashes ("-"), volume zone spreading
-  may not provide a uniform distribution of storage across zones.
-
-* When specifying multiple PVCs in a Deployment or Pod spec, the StorageClass
-  needs to be configured for a specific single zone, or the PVs need to be
-  statically provisioned in a specific zone. Another workaround is to use a
-  StatefulSet, which will ensure that all the volumes for a replica are
-  provisioned in the same zone.
-
-## Walkthrough
-
-We're now going to walk through setting up and using a multi-zone
-cluster on both GCE & AWS.  To do so, you bring up a full cluster
-(specifying `MULTIZONE=true`), and then you add nodes in additional zones
-by running `kube-up` again (specifying `KUBE_USE_EXISTING_MASTER=true`).
-
-### Bringing up your cluster
-
-Create the cluster as normal, but pass MULTIZONE to tell the cluster to manage multiple zones; creating nodes in us-central1-a.
-
-GCE:
-
-```shell
-curl -sS https://get.k8s.io | MULTIZONE=true KUBERNETES_PROVIDER=gce KUBE_GCE_ZONE=us-central1-a NUM_NODES=3 bash
-```
-
-AWS:
-
-```shell
-curl -sS https://get.k8s.io | MULTIZONE=true KUBERNETES_PROVIDER=aws KUBE_AWS_ZONE=us-west-2a NUM_NODES=3 bash
-```
-
-This step brings up a cluster as normal, still running in a single zone
-(but `MULTIZONE=true` has enabled multi-zone capabilities).
-
-### Nodes are labeled
-
-View the nodes; you can see that they are labeled with zone information.
-They are all in `us-central1-a` (GCE) or `us-west-2a` (AWS) so far.  The
-labels are `failure-domain.beta.kubernetes.io/region` for the region,
-and `failure-domain.beta.kubernetes.io/zone` for the zone:
-
-```shell
-kubectl get nodes --show-labels
-```
-
-The output is similar to this:
-
-```shell
-NAME                     STATUS                     ROLES    AGE   VERSION          LABELS
-kubernetes-master        Ready,SchedulingDisabled   <none>   6m    v1.13.0          beta.kubernetes.io/instance-type=n1-standard-1,failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-a,kubernetes.io/hostname=kubernetes-master
-kubernetes-minion-87j9   Ready                      <none>   6m    v1.13.0          beta.kubernetes.io/instance-type=n1-standard-2,failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-a,kubernetes.io/hostname=kubernetes-minion-87j9
-kubernetes-minion-9vlv   Ready                      <none>   6m    v1.13.0          beta.kubernetes.io/instance-type=n1-standard-2,failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-a,kubernetes.io/hostname=kubernetes-minion-9vlv
-kubernetes-minion-a12q   Ready                      <none>   6m    v1.13.0          beta.kubernetes.io/instance-type=n1-standard-2,failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-a,kubernetes.io/hostname=kubernetes-minion-a12q
-```
-
-### Add more nodes in a second zone
-
-Let's add another set of nodes to the existing cluster, reusing the
-existing master, running in a different zone (us-central1-b or us-west-2b).
-We run kube-up again, but by specifying `KUBE_USE_EXISTING_MASTER=true`
-kube-up will not create a new master, but will reuse one that was previously
-created instead.
-
-GCE:
-
-```shell
-KUBE_USE_EXISTING_MASTER=true MULTIZONE=true KUBERNETES_PROVIDER=gce KUBE_GCE_ZONE=us-central1-b NUM_NODES=3 kubernetes/cluster/kube-up.sh
-```
-
-On AWS we also need to specify the network CIDR for the additional
-subnet, along with the master internal IP address:
-
-```shell
-KUBE_USE_EXISTING_MASTER=true MULTIZONE=true KUBERNETES_PROVIDER=aws KUBE_AWS_ZONE=us-west-2b NUM_NODES=3 KUBE_SUBNET_CIDR=172.20.1.0/24 MASTER_INTERNAL_IP=172.20.0.9 kubernetes/cluster/kube-up.sh
-```
-
-
-View the nodes again; 3 more nodes should have launched and be tagged
-in us-central1-b:
-
-```shell
-kubectl get nodes --show-labels
-```
-
-The output is similar to this:
-
-```shell
-NAME                     STATUS                     ROLES    AGE   VERSION           LABELS
-kubernetes-master        Ready,SchedulingDisabled   <none>   16m   v1.13.0           beta.kubernetes.io/instance-type=n1-standard-1,failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-a,kubernetes.io/hostname=kubernetes-master
-kubernetes-minion-281d   Ready                      <none>   2m    v1.13.0           beta.kubernetes.io/instance-type=n1-standard-2,failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-b,kubernetes.io/hostname=kubernetes-minion-281d
-kubernetes-minion-87j9   Ready                      <none>   16m   v1.13.0           beta.kubernetes.io/instance-type=n1-standard-2,failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-a,kubernetes.io/hostname=kubernetes-minion-87j9
-kubernetes-minion-9vlv   Ready                      <none>   16m   v1.13.0           beta.kubernetes.io/instance-type=n1-standard-2,failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-a,kubernetes.io/hostname=kubernetes-minion-9vlv
-kubernetes-minion-a12q   Ready                      <none>   17m   v1.13.0           beta.kubernetes.io/instance-type=n1-standard-2,failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-a,kubernetes.io/hostname=kubernetes-minion-a12q
-kubernetes-minion-pp2f   Ready                      <none>   2m    v1.13.0           beta.kubernetes.io/instance-type=n1-standard-2,failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-b,kubernetes.io/hostname=kubernetes-minion-pp2f
-kubernetes-minion-wf8i   Ready                      <none>   2m    v1.13.0           beta.kubernetes.io/instance-type=n1-standard-2,failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-b,kubernetes.io/hostname=kubernetes-minion-wf8i
-```
-
-### Volume affinity
-
-Create a volume using the dynamic volume creation (only PersistentVolumes are supported for zone affinity):
-
-```bash
-kubectl apply -f - <<EOF
-{
-  "apiVersion": "v1",
-  "kind": "PersistentVolumeClaim",
-  "metadata": {
-    "name": "claim1",
-    "annotations": {
-        "volume.alpha.kubernetes.io/storage-class": "foo"
-    }
-  },
-  "spec": {
-    "accessModes": [
-      "ReadWriteOnce"
-    ],
-    "resources": {
-      "requests": {
-        "storage": "5Gi"
-      }
-    }
-  }
-}
-EOF
-```
+When you deploy a cluster control plane, place replicas of
+control plane components across multple failure zones. If availability is
+an important concern, select at least three failure zones and replicate
+each individual control plane component (API server, scheduler, etcd,
+cluster controller manager) across at least three failure zones.
+If you are running a cloud controller manager then you should
+also replicate this across all the failure zones you selected.

 {{< note >}}
-For version 1.3+ Kubernetes will distribute dynamic PV claims across
-the configured zones. For version 1.2, dynamic persistent volumes were
-always created in the zone of the cluster master
-(here us-central1-a / us-west-2a); that issue
-([#23330](https://github.com/kubernetes/kubernetes/issues/23330))
-was addressed in 1.3+.
+Kubernetes does not provide cross-zone resilience for the API server
+endpoints. You can use various techniques to improve availability for
+the cluster API server, including DNS round-robin, SRV records, or
+a third-party load balancing solution with health checking.
 {{< /note >}}

-Now let's validate that Kubernetes automatically labeled the zone & region the PV was created in.
+## Node behavior

-```shell
-kubectl get pv --show-labels
-```
+Kubernetes automatically spreads the Pods for
+workload resources (such as {{< glossary_tooltip text="Deployment" term_id="deployment" >}}
+or {{< glossary_tooltip text="StatefulSet" term_id="statefulset" >}})
+across different nodes in a cluster. This spreading helps
+reduce the impact of failures.

-The output is similar to this:
+When nodes start up, the kubelet on each node automatically adds
+{{< glossary_tooltip text="labels" term_id="label" >}} to the Node object
+that represents that specific kubelet in the Kubernetes API.
+These labels can include
+[zone information](/docs/reference/kubernetes-api/labels-annotations-taints/#topologykubernetesiozone).

-```shell
-NAME           CAPACITY   ACCESSMODES   RECLAIM POLICY   STATUS    CLAIM            STORAGECLASS    REASON    AGE       LABELS
-pv-gce-mj4gm   5Gi        RWO           Retain           Bound     default/claim1   manual                    46s       failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-a
-```
+If your cluster spans multiple zones or regions, you can use node labels
+in conjunction with
+[Pod topology spread constraints](/docs/concepts/workloads/pods/pod-topology-spread-constraints/)
+to control how Pods are spread across your cluster among fault domains:
+regions, zones, and even specific nodes.
+These hints enable the
+{{< glossary_tooltip text="scheduler" term_id="kube-scheduler" >}} to place
+Pods for better expected availability, reducing the risk that a correlated
+failure affects your whole workload.

-So now we will create a pod that uses the persistent volume claim.
-Because GCE PDs / AWS EBS volumes cannot be attached across zones,
-this means that this pod can only be created in the same zone as the volume:
+For example, you can set a constraint to make sure that the
+3 replicas of a StatefulSet are all running in different zones to each
+other, whenever that is feasible. You can define this declaratively
+without explicitly defining which availability zones are in use for
+each workload.

-```yaml
-kubectl apply -f - <<EOF
-apiVersion: v1
-kind: Pod
-metadata:
-  name: mypod
-spec:
-  containers:
-    - name: myfrontend
-      image: nginx
-      volumeMounts:
-      - mountPath: "/var/www/html"
-        name: mypd
-  volumes:
-    - name: mypd
-      persistentVolumeClaim:
-        claimName: claim1
-EOF
-```
+### Distributing nodes across zones

-Note that the pod was automatically created in the same zone as the volume, as
-cross-zone attachments are not generally permitted by cloud providers:
+Kubernetes' core does not create nodes for you; you need to do that yourself,
+or use a tool such as the [Cluster API](https://cluster-api.sigs.k8s.io/) to
+manage nodes on your behalf.

-```shell
-kubectl describe pod mypod | grep Node
-```
+Using tools such as the Cluster API you can define sets of machines to run as
+worker nodes for your cluster across multiple failure domains, and rules to
+automatically heal the cluster in case of whole-zone service disruption.

-```shell
-Node:        kubernetes-minion-9vlv/10.240.0.5
-```
+## Manual zone assignment for Pods

-And check node labels:
+You can apply [node selector constraints](/docs/concepts/scheduling-eviction/assign-pod-node/#nodeselector)
+to Pods that you create, as well as to Pod templates in workload resources
+such as Deployment, StatefulSet, or Job.

-```shell
-kubectl get node kubernetes-minion-9vlv --show-labels
-```
+## Storage access for zones

-```shell
-NAME                     STATUS    AGE    VERSION          LABELS
-kubernetes-minion-9vlv   Ready     22m    v1.6.0+fff5156   beta.kubernetes.io/instance-type=n1-standard-2,failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-a,kubernetes.io/hostname=kubernetes-minion-9vlv
-```
+When persistent volumes are created, the `PersistentVolumeLabel`
+[admission controller](/docs/reference/access-authn-authz/admission-controllers/)
+automatically adds zone labels to any PersistentVolumes that are linked to a specific
+zone. The {{< glossary_tooltip text="scheduler" term_id="kube-scheduler" >}} then ensures,
+through its `NoVolumeZoneConflict` predicate, that pods which claim a given PersistentVolume
+are only placed into the same zone as that volume.

-### Pods are spread across zones
+You can specify a {{< glossary_tooltip text="StorageClass" term_id="storage-class" >}}
+for PersistentVolumeClaims that specifies the failure domains (zones) that the
+storage in that class may use.
+To learn about configuring a StorageClass that is aware of failure domains or zones,
+see [Allowed topologies](/docs/concepts/storage/storage-classes/#allowed-topologies).

-Pods in a replication controller or service are automatically spread
-across zones.  First, let's launch more nodes in a third zone:
+## Networking

-GCE:
+By itself, Kubernetes does not include zone-aware networking. You can use a
+[network plugin](docs/concepts/extend-kubernetes/compute-storage-net/network-plugins/)
+to configure cluster networking, and that network solution might have zone-specific
+elements. For example, if your cloud provider supports Services with
+`type=LoadBalancer`, the load balancer might only send traffic to Pods running in the
+same zone as the load balancer element processing a given connection.
+Check your cloud provider's documentation for details.

-```shell
-KUBE_USE_EXISTING_MASTER=true MULTIZONE=true KUBERNETES_PROVIDER=gce KUBE_GCE_ZONE=us-central1-f NUM_NODES=3 kubernetes/cluster/kube-up.sh
-```
+For custom or on-premises deployments, similar considerations apply.
+{{< glossary_tooltip text="Service" term_id="service" >}} and
+{{< glossary_tooltip text="Ingress" term_id="ingress" >}} behavior, including handling
+of different failure zones, does vary depending on exactly how your cluster is set up.

-AWS:
+## Fault recovery

-```shell
-KUBE_USE_EXISTING_MASTER=true MULTIZONE=true KUBERNETES_PROVIDER=aws KUBE_AWS_ZONE=us-west-2c NUM_NODES=3 KUBE_SUBNET_CIDR=172.20.2.0/24 MASTER_INTERNAL_IP=172.20.0.9 kubernetes/cluster/kube-up.sh
-```
+When you set up your cluster, you might also need to consider whether and how
+your setup can restore service if all of the failure zones in a region go
+off-line at the same time. For example, do you rely on there being at least
+one node able to run Pods in a zone?  
+Make sure that any cluster-critical repair work does not rely
+on there being at least one healthy node in your cluster. For example: if all nodes
+are unhealthy, you might need to run a repair Job with a special
+{{< glossary_tooltip text="toleration" term_id="toleration" >}} so that the repair
+can complete enough to bring at least one node into service.

-Verify that you now have nodes in 3 zones:
-
-```shell
-kubectl get nodes --show-labels
-```
-
-Create the guestbook-go example, which includes an RC of size 3, running a simple web app:
-
-```shell
-find kubernetes/examples/guestbook-go/ -name '*.json' | xargs -I {} kubectl apply -f {}
-```
-
-The pods should be spread across all 3 zones:
-
-```shell
-kubectl describe pod -l app=guestbook | grep Node
-```
-
-```shell
-Node:        kubernetes-minion-9vlv/10.240.0.5
-Node:        kubernetes-minion-281d/10.240.0.8
-Node:        kubernetes-minion-olsh/10.240.0.11
-```
-
-```shell
-kubectl get node kubernetes-minion-9vlv kubernetes-minion-281d kubernetes-minion-olsh --show-labels
-```
-
-```shell
-NAME                     STATUS    ROLES    AGE    VERSION          LABELS
-kubernetes-minion-9vlv   Ready     <none>   34m    v1.13.0          beta.kubernetes.io/instance-type=n1-standard-2,failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-a,kubernetes.io/hostname=kubernetes-minion-9vlv
-kubernetes-minion-281d   Ready     <none>   20m    v1.13.0          beta.kubernetes.io/instance-type=n1-standard-2,failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-b,kubernetes.io/hostname=kubernetes-minion-281d
-kubernetes-minion-olsh   Ready     <none>   3m     v1.13.0          beta.kubernetes.io/instance-type=n1-standard-2,failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-f,kubernetes.io/hostname=kubernetes-minion-olsh
-```
-
-
-Load-balancers span all zones in a cluster; the guestbook-go example
-includes an example load-balanced service:
-
-```shell
-kubectl describe service guestbook | grep LoadBalancer.Ingress
-```
-
-The output is similar to this:
-
-```shell
-LoadBalancer Ingress:   130.211.126.21
-```
-
-Set the above IP:
-
-```shell
-export IP=130.211.126.21
-```
-
-Explore with curl via IP:
-
-```shell
-curl -s http://${IP}:3000/env | grep HOSTNAME
-```
-
-The output is similar to this:
-
-```shell
-  "HOSTNAME": "guestbook-44sep",
-```
-
-Again, explore multiple times:
-
-```shell
-(for i in `seq 20`; do curl -s http://${IP}:3000/env | grep HOSTNAME; done)  | sort | uniq
-```
-
-The output is similar to this:
-
-```shell
-  "HOSTNAME": "guestbook-44sep",
-  "HOSTNAME": "guestbook-hum5n",
-  "HOSTNAME": "guestbook-ppm40",
-```
-
-The load balancer correctly targets all the pods, even though they are in multiple zones.
-
-### Shutting down the cluster
-
-When you're done, clean up:
-
-GCE:
-
-```shell
-KUBERNETES_PROVIDER=gce KUBE_USE_EXISTING_MASTER=true KUBE_GCE_ZONE=us-central1-f kubernetes/cluster/kube-down.sh
-KUBERNETES_PROVIDER=gce KUBE_USE_EXISTING_MASTER=true KUBE_GCE_ZONE=us-central1-b kubernetes/cluster/kube-down.sh
-KUBERNETES_PROVIDER=gce KUBE_GCE_ZONE=us-central1-a kubernetes/cluster/kube-down.sh
-```
-
-AWS:
-
-```shell
-KUBERNETES_PROVIDER=aws KUBE_USE_EXISTING_MASTER=true KUBE_AWS_ZONE=us-west-2c kubernetes/cluster/kube-down.sh
-KUBERNETES_PROVIDER=aws KUBE_USE_EXISTING_MASTER=true KUBE_AWS_ZONE=us-west-2b kubernetes/cluster/kube-down.sh
-KUBERNETES_PROVIDER=aws KUBE_AWS_ZONE=us-west-2a kubernetes/cluster/kube-down.sh
-```
+Kubernetes doesn't come with an answer for this challenge; however, it's
+something to consider.

+## {{% heading "whatsnext" %}}

+To learn how the scheduler places Pods in a cluster, honoring the configured constraints,
+visit [Scheduling and Eviction](/docs/concepts/scheduling-eviction/).