1134 lines
44 KiB
Markdown
1134 lines
44 KiB
Markdown
---
|
|
reviewers:
|
|
- bprashanth
|
|
- enisoc
|
|
- erictune
|
|
- foxish
|
|
- janetkuo
|
|
- kow3ns
|
|
- smarterclayton
|
|
title: Running ZooKeeper, A Distributed System Coordinator
|
|
content_type: tutorial
|
|
weight: 40
|
|
---
|
|
|
|
<!-- overview -->
|
|
This tutorial demonstrates running [Apache Zookeeper](https://zookeeper.apache.org) on
|
|
Kubernetes using [StatefulSets](/docs/concepts/workloads/controllers/statefulset/),
|
|
[PodDisruptionBudgets](/docs/concepts/workloads/pods/disruptions/#specifying-a-poddisruptionbudget),
|
|
and [PodAntiAffinity](/docs/user-guide/node-selection/#inter-pod-affinity-and-anti-affinity-beta-feature).
|
|
|
|
|
|
## {{% heading "prerequisites" %}}
|
|
|
|
|
|
Before starting this tutorial, you should be familiar with the following
|
|
Kubernetes concepts.
|
|
|
|
- [Pods](/docs/user-guide/pods/single-container/)
|
|
- [Cluster DNS](/docs/concepts/services-networking/dns-pod-service/)
|
|
- [Headless Services](/docs/concepts/services-networking/service/#headless-services)
|
|
- [PersistentVolumes](/docs/concepts/storage/volumes/)
|
|
- [PersistentVolume Provisioning](https://github.com/kubernetes/examples/tree/{{< param "githubbranch" >}}/staging/persistent-volume-provisioning/)
|
|
- [StatefulSets](/docs/concepts/workloads/controllers/statefulset/)
|
|
- [PodDisruptionBudgets](/docs/concepts/workloads/pods/disruptions/#specifying-a-poddisruptionbudget)
|
|
- [PodAntiAffinity](/docs/user-guide/node-selection/#inter-pod-affinity-and-anti-affinity-beta-feature)
|
|
- [kubectl CLI](/docs/user-guide/kubectl/)
|
|
|
|
You will require a cluster with at least four nodes, and each node requires at least 2 CPUs and 4 GiB of memory. In this tutorial you will cordon and drain the cluster's nodes. **This means that the cluster will terminate and evict all Pods on its nodes, and the nodes will temporarily become unschedulable.** You should use a dedicated cluster for this tutorial, or you should ensure that the disruption you cause will not interfere with other tenants.
|
|
|
|
This tutorial assumes that you have configured your cluster to dynamically provision
|
|
PersistentVolumes. If your cluster is not configured to do so, you
|
|
will have to manually provision three 20 GiB volumes before starting this
|
|
tutorial.
|
|
|
|
|
|
## {{% heading "objectives" %}}
|
|
|
|
After this tutorial, you will know the following.
|
|
|
|
- How to deploy a ZooKeeper ensemble using StatefulSet.
|
|
- How to consistently configure the ensemble using ConfigMaps.
|
|
- How to spread the deployment of ZooKeeper servers in the ensemble.
|
|
- How to use PodDisruptionBudgets to ensure service availability during planned maintenance.
|
|
|
|
|
|
<!-- lessoncontent -->
|
|
|
|
### ZooKeeper Basics
|
|
|
|
[Apache ZooKeeper](https://zookeeper.apache.org/doc/current/) is a
|
|
distributed, open-source coordination service for distributed applications.
|
|
ZooKeeper allows you to read, write, and observe updates to data. Data are
|
|
organized in a file system like hierarchy and replicated to all ZooKeeper
|
|
servers in the ensemble (a set of ZooKeeper servers). All operations on data
|
|
are atomic and sequentially consistent. ZooKeeper ensures this by using the
|
|
[Zab](https://pdfs.semanticscholar.org/b02c/6b00bd5dbdbd951fddb00b906c82fa80f0b3.pdf)
|
|
consensus protocol to replicate a state machine across all servers in the ensemble.
|
|
|
|
The ensemble uses the Zab protocol to elect a leader, and the ensemble cannot write data until that election is complete. Once complete, the ensemble uses Zab to ensure that it replicates all writes to a quorum before it acknowledges and makes them visible to clients. Without respect to weighted quorums, a quorum is a majority component of the ensemble containing the current leader. For instance, if the ensemble has three servers, a component that contains the leader and one other server constitutes a quorum. If the ensemble can not achieve a quorum, the ensemble cannot write data.
|
|
|
|
ZooKeeper servers keep their entire state machine in memory, and write every mutation to a durable WAL (Write Ahead Log) on storage media. When a server crashes, it can recover its previous state by replaying the WAL. To prevent the WAL from growing without bound, ZooKeeper servers will periodically snapshot them in memory state to storage media. These snapshots can be loaded directly into memory, and all WAL entries that preceded the snapshot may be discarded.
|
|
|
|
## Creating a ZooKeeper Ensemble
|
|
|
|
The manifest below contains a
|
|
[Headless Service](/docs/concepts/services-networking/service/#headless-services),
|
|
a [Service](/docs/concepts/services-networking/service/),
|
|
a [PodDisruptionBudget](/docs/concepts/workloads/pods/disruptions//#specifying-a-poddisruptionbudget),
|
|
and a [StatefulSet](/docs/concepts/workloads/controllers/statefulset/).
|
|
|
|
{{< codenew file="application/zookeeper/zookeeper.yaml" >}}
|
|
|
|
Open a terminal, and use the
|
|
[`kubectl apply`](/docs/reference/generated/kubectl/kubectl-commands/#apply) command to create the
|
|
manifest.
|
|
|
|
```shell
|
|
kubectl apply -f https://k8s.io/examples/application/zookeeper/zookeeper.yaml
|
|
```
|
|
|
|
This creates the `zk-hs` Headless Service, the `zk-cs` Service,
|
|
the `zk-pdb` PodDisruptionBudget, and the `zk` StatefulSet.
|
|
|
|
```
|
|
service/zk-hs created
|
|
service/zk-cs created
|
|
poddisruptionbudget.policy/zk-pdb created
|
|
statefulset.apps/zk created
|
|
```
|
|
|
|
Use [`kubectl get`](/docs/reference/generated/kubectl/kubectl-commands/#get) to watch the
|
|
StatefulSet controller create the StatefulSet's Pods.
|
|
|
|
```shell
|
|
kubectl get pods -w -l app=zk
|
|
```
|
|
|
|
Once the `zk-2` Pod is Running and Ready, use `CTRL-C` to terminate kubectl.
|
|
|
|
```
|
|
NAME READY STATUS RESTARTS AGE
|
|
zk-0 0/1 Pending 0 0s
|
|
zk-0 0/1 Pending 0 0s
|
|
zk-0 0/1 ContainerCreating 0 0s
|
|
zk-0 0/1 Running 0 19s
|
|
zk-0 1/1 Running 0 40s
|
|
zk-1 0/1 Pending 0 0s
|
|
zk-1 0/1 Pending 0 0s
|
|
zk-1 0/1 ContainerCreating 0 0s
|
|
zk-1 0/1 Running 0 18s
|
|
zk-1 1/1 Running 0 40s
|
|
zk-2 0/1 Pending 0 0s
|
|
zk-2 0/1 Pending 0 0s
|
|
zk-2 0/1 ContainerCreating 0 0s
|
|
zk-2 0/1 Running 0 19s
|
|
zk-2 1/1 Running 0 40s
|
|
```
|
|
|
|
The StatefulSet controller creates three Pods, and each Pod has a container with
|
|
a [ZooKeeper](http://www-us.apache.org/dist/zookeeper/stable/) server.
|
|
|
|
### Facilitating Leader Election
|
|
|
|
Because there is no terminating algorithm for electing a leader in an anonymous network, Zab requires explicit membership configuration to perform leader election. Each server in the ensemble needs to have a unique identifier, all servers need to know the global set of identifiers, and each identifier needs to be associated with a network address.
|
|
|
|
Use [`kubectl exec`](/docs/reference/generated/kubectl/kubectl-commands/#exec) to get the hostnames
|
|
of the Pods in the `zk` StatefulSet.
|
|
|
|
```shell
|
|
for i in 0 1 2; do kubectl exec zk-$i -- hostname; done
|
|
```
|
|
|
|
The StatefulSet controller provides each Pod with a unique hostname based on its ordinal index. The hostnames take the form of `<statefulset name>-<ordinal index>`. Because the `replicas` field of the `zk` StatefulSet is set to `3`, the Set's controller creates three Pods with their hostnames set to `zk-0`, `zk-1`, and
|
|
`zk-2`.
|
|
|
|
```
|
|
zk-0
|
|
zk-1
|
|
zk-2
|
|
```
|
|
|
|
The servers in a ZooKeeper ensemble use natural numbers as unique identifiers, and store each server's identifier in a file called `myid` in the server's data directory.
|
|
|
|
To examine the contents of the `myid` file for each server use the following command.
|
|
|
|
```shell
|
|
for i in 0 1 2; do echo "myid zk-$i";kubectl exec zk-$i -- cat /var/lib/zookeeper/data/myid; done
|
|
```
|
|
|
|
Because the identifiers are natural numbers and the ordinal indices are non-negative integers, you can generate an identifier by adding 1 to the ordinal.
|
|
|
|
```
|
|
myid zk-0
|
|
1
|
|
myid zk-1
|
|
2
|
|
myid zk-2
|
|
3
|
|
```
|
|
|
|
To get the Fully Qualified Domain Name (FQDN) of each Pod in the `zk` StatefulSet use the following command.
|
|
|
|
```shell
|
|
for i in 0 1 2; do kubectl exec zk-$i -- hostname -f; done
|
|
```
|
|
|
|
The `zk-hs` Service creates a domain for all of the Pods,
|
|
`zk-hs.default.svc.cluster.local`.
|
|
|
|
```
|
|
zk-0.zk-hs.default.svc.cluster.local
|
|
zk-1.zk-hs.default.svc.cluster.local
|
|
zk-2.zk-hs.default.svc.cluster.local
|
|
```
|
|
|
|
The A records in [Kubernetes DNS](/docs/concepts/services-networking/dns-pod-service/) resolve the FQDNs to the Pods' IP addresses. If Kubernetes reschedules the Pods, it will update the A records with the Pods' new IP addresses, but the A records names will not change.
|
|
|
|
ZooKeeper stores its application configuration in a file named `zoo.cfg`. Use `kubectl exec` to view the contents of the `zoo.cfg` file in the `zk-0` Pod.
|
|
|
|
```shell
|
|
kubectl exec zk-0 -- cat /opt/zookeeper/conf/zoo.cfg
|
|
```
|
|
|
|
In the `server.1`, `server.2`, and `server.3` properties at the bottom of
|
|
the file, the `1`, `2`, and `3` correspond to the identifiers in the
|
|
ZooKeeper servers' `myid` files. They are set to the FQDNs for the Pods in
|
|
the `zk` StatefulSet.
|
|
|
|
```
|
|
clientPort=2181
|
|
dataDir=/var/lib/zookeeper/data
|
|
dataLogDir=/var/lib/zookeeper/log
|
|
tickTime=2000
|
|
initLimit=10
|
|
syncLimit=2000
|
|
maxClientCnxns=60
|
|
minSessionTimeout= 4000
|
|
maxSessionTimeout= 40000
|
|
autopurge.snapRetainCount=3
|
|
autopurge.purgeInterval=0
|
|
server.1=zk-0.zk-hs.default.svc.cluster.local:2888:3888
|
|
server.2=zk-1.zk-hs.default.svc.cluster.local:2888:3888
|
|
server.3=zk-2.zk-hs.default.svc.cluster.local:2888:3888
|
|
```
|
|
|
|
### Achieving Consensus
|
|
|
|
Consensus protocols require that the identifiers of each participant be unique. No two participants in the Zab protocol should claim the same unique identifier. This is necessary to allow the processes in the system to agree on which processes have committed which data. If two Pods are launched with the same ordinal, two ZooKeeper servers would both identify themselves as the same server.
|
|
|
|
```shell
|
|
kubectl get pods -w -l app=zk
|
|
```
|
|
|
|
```
|
|
NAME READY STATUS RESTARTS AGE
|
|
zk-0 0/1 Pending 0 0s
|
|
zk-0 0/1 Pending 0 0s
|
|
zk-0 0/1 ContainerCreating 0 0s
|
|
zk-0 0/1 Running 0 19s
|
|
zk-0 1/1 Running 0 40s
|
|
zk-1 0/1 Pending 0 0s
|
|
zk-1 0/1 Pending 0 0s
|
|
zk-1 0/1 ContainerCreating 0 0s
|
|
zk-1 0/1 Running 0 18s
|
|
zk-1 1/1 Running 0 40s
|
|
zk-2 0/1 Pending 0 0s
|
|
zk-2 0/1 Pending 0 0s
|
|
zk-2 0/1 ContainerCreating 0 0s
|
|
zk-2 0/1 Running 0 19s
|
|
zk-2 1/1 Running 0 40s
|
|
```
|
|
|
|
The A records for each Pod are entered when the Pod becomes Ready. Therefore,
|
|
the FQDNs of the ZooKeeper servers will resolve to a single endpoint, and that
|
|
endpoint will be the unique ZooKeeper server claiming the identity configured
|
|
in its `myid` file.
|
|
|
|
```
|
|
zk-0.zk-hs.default.svc.cluster.local
|
|
zk-1.zk-hs.default.svc.cluster.local
|
|
zk-2.zk-hs.default.svc.cluster.local
|
|
```
|
|
|
|
This ensures that the `servers` properties in the ZooKeepers' `zoo.cfg` files
|
|
represents a correctly configured ensemble.
|
|
|
|
```
|
|
server.1=zk-0.zk-hs.default.svc.cluster.local:2888:3888
|
|
server.2=zk-1.zk-hs.default.svc.cluster.local:2888:3888
|
|
server.3=zk-2.zk-hs.default.svc.cluster.local:2888:3888
|
|
```
|
|
|
|
When the servers use the Zab protocol to attempt to commit a value, they will either achieve consensus and commit the value (if leader election has succeeded and at least two of the Pods are Running and Ready), or they will fail to do so (if either of the conditions are not met). No state will arise where one server acknowledges a write on behalf of another.
|
|
|
|
### Sanity Testing the Ensemble
|
|
|
|
The most basic sanity test is to write data to one ZooKeeper server and
|
|
to read the data from another.
|
|
|
|
The command below executes the `zkCli.sh` script to write `world` to the path `/hello` on the `zk-0` Pod in the ensemble.
|
|
|
|
```shell
|
|
kubectl exec zk-0 zkCli.sh create /hello world
|
|
```
|
|
```
|
|
WATCHER::
|
|
|
|
WatchedEvent state:SyncConnected type:None path:null
|
|
Created /hello
|
|
```
|
|
|
|
To get the data from the `zk-1` Pod use the following command.
|
|
|
|
```shell
|
|
kubectl exec zk-1 zkCli.sh get /hello
|
|
```
|
|
|
|
The data that you created on `zk-0` is available on all the servers in the
|
|
ensemble.
|
|
|
|
```
|
|
WATCHER::
|
|
|
|
WatchedEvent state:SyncConnected type:None path:null
|
|
world
|
|
cZxid = 0x100000002
|
|
ctime = Thu Dec 08 15:13:30 UTC 2016
|
|
mZxid = 0x100000002
|
|
mtime = Thu Dec 08 15:13:30 UTC 2016
|
|
pZxid = 0x100000002
|
|
cversion = 0
|
|
dataVersion = 0
|
|
aclVersion = 0
|
|
ephemeralOwner = 0x0
|
|
dataLength = 5
|
|
numChildren = 0
|
|
```
|
|
|
|
### Providing Durable Storage
|
|
|
|
As mentioned in the [ZooKeeper Basics](#zookeeper-basics) section,
|
|
ZooKeeper commits all entries to a durable WAL, and periodically writes snapshots
|
|
in memory state, to storage media. Using WALs to provide durability is a common
|
|
technique for applications that use consensus protocols to achieve a replicated
|
|
state machine.
|
|
|
|
Use the [`kubectl delete`](/docs/reference/generated/kubectl/kubectl-commands/#delete) command to delete the
|
|
`zk` StatefulSet.
|
|
|
|
```shell
|
|
kubectl delete statefulset zk
|
|
```
|
|
|
|
```
|
|
statefulset.apps "zk" deleted
|
|
```
|
|
|
|
Watch the termination of the Pods in the StatefulSet.
|
|
|
|
```shell
|
|
kubectl get pods -w -l app=zk
|
|
```
|
|
|
|
When `zk-0` if fully terminated, use `CTRL-C` to terminate kubectl.
|
|
|
|
```
|
|
zk-2 1/1 Terminating 0 9m
|
|
zk-0 1/1 Terminating 0 11m
|
|
zk-1 1/1 Terminating 0 10m
|
|
zk-2 0/1 Terminating 0 9m
|
|
zk-2 0/1 Terminating 0 9m
|
|
zk-2 0/1 Terminating 0 9m
|
|
zk-1 0/1 Terminating 0 10m
|
|
zk-1 0/1 Terminating 0 10m
|
|
zk-1 0/1 Terminating 0 10m
|
|
zk-0 0/1 Terminating 0 11m
|
|
zk-0 0/1 Terminating 0 11m
|
|
zk-0 0/1 Terminating 0 11m
|
|
```
|
|
|
|
Reapply the manifest in `zookeeper.yaml`.
|
|
|
|
```shell
|
|
kubectl apply -f https://k8s.io/examples/application/zookeeper/zookeeper.yaml
|
|
```
|
|
|
|
This creates the `zk` StatefulSet object, but the other API objects in the manifest are not modified because they already exist.
|
|
|
|
Watch the StatefulSet controller recreate the StatefulSet's Pods.
|
|
|
|
```shell
|
|
kubectl get pods -w -l app=zk
|
|
```
|
|
|
|
Once the `zk-2` Pod is Running and Ready, use `CTRL-C` to terminate kubectl.
|
|
|
|
```
|
|
NAME READY STATUS RESTARTS AGE
|
|
zk-0 0/1 Pending 0 0s
|
|
zk-0 0/1 Pending 0 0s
|
|
zk-0 0/1 ContainerCreating 0 0s
|
|
zk-0 0/1 Running 0 19s
|
|
zk-0 1/1 Running 0 40s
|
|
zk-1 0/1 Pending 0 0s
|
|
zk-1 0/1 Pending 0 0s
|
|
zk-1 0/1 ContainerCreating 0 0s
|
|
zk-1 0/1 Running 0 18s
|
|
zk-1 1/1 Running 0 40s
|
|
zk-2 0/1 Pending 0 0s
|
|
zk-2 0/1 Pending 0 0s
|
|
zk-2 0/1 ContainerCreating 0 0s
|
|
zk-2 0/1 Running 0 19s
|
|
zk-2 1/1 Running 0 40s
|
|
```
|
|
|
|
Use the command below to get the value you entered during the [sanity test](#sanity-testing-the-ensemble),
|
|
from the `zk-2` Pod.
|
|
|
|
```shell
|
|
kubectl exec zk-2 zkCli.sh get /hello
|
|
```
|
|
|
|
Even though you terminated and recreated all of the Pods in the `zk` StatefulSet, the ensemble still serves the original value.
|
|
|
|
```
|
|
WATCHER::
|
|
|
|
WatchedEvent state:SyncConnected type:None path:null
|
|
world
|
|
cZxid = 0x100000002
|
|
ctime = Thu Dec 08 15:13:30 UTC 2016
|
|
mZxid = 0x100000002
|
|
mtime = Thu Dec 08 15:13:30 UTC 2016
|
|
pZxid = 0x100000002
|
|
cversion = 0
|
|
dataVersion = 0
|
|
aclVersion = 0
|
|
ephemeralOwner = 0x0
|
|
dataLength = 5
|
|
numChildren = 0
|
|
```
|
|
|
|
The `volumeClaimTemplates` field of the `zk` StatefulSet's `spec` specifies a PersistentVolume provisioned for each Pod.
|
|
|
|
```yaml
|
|
volumeClaimTemplates:
|
|
- metadata:
|
|
name: datadir
|
|
annotations:
|
|
volume.alpha.kubernetes.io/storage-class: anything
|
|
spec:
|
|
accessModes: [ "ReadWriteOnce" ]
|
|
resources:
|
|
requests:
|
|
storage: 20Gi
|
|
```
|
|
|
|
The `StatefulSet` controller generates a `PersistentVolumeClaim` for each Pod in
|
|
the `StatefulSet`.
|
|
|
|
Use the following command to get the `StatefulSet`'s `PersistentVolumeClaims`.
|
|
|
|
```shell
|
|
kubectl get pvc -l app=zk
|
|
```
|
|
|
|
When the `StatefulSet` recreated its Pods, it remounts the Pods' PersistentVolumes.
|
|
|
|
```
|
|
NAME STATUS VOLUME CAPACITY ACCESSMODES AGE
|
|
datadir-zk-0 Bound pvc-bed742cd-bcb1-11e6-994f-42010a800002 20Gi RWO 1h
|
|
datadir-zk-1 Bound pvc-bedd27d2-bcb1-11e6-994f-42010a800002 20Gi RWO 1h
|
|
datadir-zk-2 Bound pvc-bee0817e-bcb1-11e6-994f-42010a800002 20Gi RWO 1h
|
|
```
|
|
|
|
The `volumeMounts` section of the `StatefulSet`'s container `template` mounts the PersistentVolumes in the ZooKeeper servers' data directories.
|
|
|
|
```shell
|
|
volumeMounts:
|
|
- name: datadir
|
|
mountPath: /var/lib/zookeeper
|
|
```
|
|
|
|
When a Pod in the `zk` `StatefulSet` is (re)scheduled, it will always have the
|
|
same `PersistentVolume` mounted to the ZooKeeper server's data directory.
|
|
Even when the Pods are rescheduled, all the writes made to the ZooKeeper
|
|
servers' WALs, and all their snapshots, remain durable.
|
|
|
|
## Ensuring Consistent Configuration
|
|
|
|
As noted in the [Facilitating Leader Election](#facilitating-leader-election) and
|
|
[Achieving Consensus](#achieving-consensus) sections, the servers in a
|
|
ZooKeeper ensemble require consistent configuration to elect a leader
|
|
and form a quorum. They also require consistent configuration of the Zab protocol
|
|
in order for the protocol to work correctly over a network. In our example we
|
|
achieve consistent configuration by embedding the configuration directly into
|
|
the manifest.
|
|
|
|
Get the `zk` StatefulSet.
|
|
|
|
```shell
|
|
kubectl get sts zk -o yaml
|
|
```
|
|
```
|
|
…
|
|
command:
|
|
- sh
|
|
- -c
|
|
- "start-zookeeper \
|
|
--servers=3 \
|
|
--data_dir=/var/lib/zookeeper/data \
|
|
--data_log_dir=/var/lib/zookeeper/data/log \
|
|
--conf_dir=/opt/zookeeper/conf \
|
|
--client_port=2181 \
|
|
--election_port=3888 \
|
|
--server_port=2888 \
|
|
--tick_time=2000 \
|
|
--init_limit=10 \
|
|
--sync_limit=5 \
|
|
--heap=512M \
|
|
--max_client_cnxns=60 \
|
|
--snap_retain_count=3 \
|
|
--purge_interval=12 \
|
|
--max_session_timeout=40000 \
|
|
--min_session_timeout=4000 \
|
|
--log_level=INFO"
|
|
…
|
|
```
|
|
|
|
The command used to start the ZooKeeper servers passed the configuration as command line parameter. You can also use environment variables to pass configuration to the ensemble.
|
|
|
|
### Configuring Logging
|
|
|
|
One of the files generated by the `zkGenConfig.sh` script controls ZooKeeper's logging.
|
|
ZooKeeper uses [Log4j](http://logging.apache.org/log4j/2.x/), and, by default,
|
|
it uses a time and size based rolling file appender for its logging configuration.
|
|
|
|
Use the command below to get the logging configuration from one of Pods in the `zk` `StatefulSet`.
|
|
|
|
```shell
|
|
kubectl exec zk-0 cat /usr/etc/zookeeper/log4j.properties
|
|
```
|
|
|
|
The logging configuration below will cause the ZooKeeper process to write all
|
|
of its logs to the standard output file stream.
|
|
|
|
```
|
|
zookeeper.root.logger=CONSOLE
|
|
zookeeper.console.threshold=INFO
|
|
log4j.rootLogger=${zookeeper.root.logger}
|
|
log4j.appender.CONSOLE=org.apache.log4j.ConsoleAppender
|
|
log4j.appender.CONSOLE.Threshold=${zookeeper.console.threshold}
|
|
log4j.appender.CONSOLE.layout=org.apache.log4j.PatternLayout
|
|
log4j.appender.CONSOLE.layout.ConversionPattern=%d{ISO8601} [myid:%X{myid}] - %-5p [%t:%C{1}@%L] - %m%n
|
|
```
|
|
|
|
This is the simplest possible way to safely log inside the container. Because the applications write logs to standard out, Kubernetes will handle log rotation for you. Kubernetes also implements a sane retention policy that ensures application logs written to standard out and standard error do not exhaust local storage media.
|
|
|
|
Use [`kubectl logs`](/docs/reference/generated/kubectl/kubectl-commands/#logs) to retrieve the last 20 log lines from one of the Pods.
|
|
|
|
```shell
|
|
kubectl logs zk-0 --tail 20
|
|
```
|
|
|
|
You can view application logs written to standard out or standard error using `kubectl logs` and from the Kubernetes Dashboard.
|
|
|
|
```
|
|
2016-12-06 19:34:16,236 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@827] - Processing ruok command from /127.0.0.1:52740
|
|
2016-12-06 19:34:16,237 [myid:1] - INFO [Thread-1136:NIOServerCnxn@1008] - Closed socket connection for client /127.0.0.1:52740 (no session established for client)
|
|
2016-12-06 19:34:26,155 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192] - Accepted socket connection from /127.0.0.1:52749
|
|
2016-12-06 19:34:26,155 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@827] - Processing ruok command from /127.0.0.1:52749
|
|
2016-12-06 19:34:26,156 [myid:1] - INFO [Thread-1137:NIOServerCnxn@1008] - Closed socket connection for client /127.0.0.1:52749 (no session established for client)
|
|
2016-12-06 19:34:26,222 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192] - Accepted socket connection from /127.0.0.1:52750
|
|
2016-12-06 19:34:26,222 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@827] - Processing ruok command from /127.0.0.1:52750
|
|
2016-12-06 19:34:26,226 [myid:1] - INFO [Thread-1138:NIOServerCnxn@1008] - Closed socket connection for client /127.0.0.1:52750 (no session established for client)
|
|
2016-12-06 19:34:36,151 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192] - Accepted socket connection from /127.0.0.1:52760
|
|
2016-12-06 19:34:36,152 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@827] - Processing ruok command from /127.0.0.1:52760
|
|
2016-12-06 19:34:36,152 [myid:1] - INFO [Thread-1139:NIOServerCnxn@1008] - Closed socket connection for client /127.0.0.1:52760 (no session established for client)
|
|
2016-12-06 19:34:36,230 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192] - Accepted socket connection from /127.0.0.1:52761
|
|
2016-12-06 19:34:36,231 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@827] - Processing ruok command from /127.0.0.1:52761
|
|
2016-12-06 19:34:36,231 [myid:1] - INFO [Thread-1140:NIOServerCnxn@1008] - Closed socket connection for client /127.0.0.1:52761 (no session established for client)
|
|
2016-12-06 19:34:46,149 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192] - Accepted socket connection from /127.0.0.1:52767
|
|
2016-12-06 19:34:46,149 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@827] - Processing ruok command from /127.0.0.1:52767
|
|
2016-12-06 19:34:46,149 [myid:1] - INFO [Thread-1141:NIOServerCnxn@1008] - Closed socket connection for client /127.0.0.1:52767 (no session established for client)
|
|
2016-12-06 19:34:46,230 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192] - Accepted socket connection from /127.0.0.1:52768
|
|
2016-12-06 19:34:46,230 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@827] - Processing ruok command from /127.0.0.1:52768
|
|
2016-12-06 19:34:46,230 [myid:1] - INFO [Thread-1142:NIOServerCnxn@1008] - Closed socket connection for client /127.0.0.1:52768 (no session established for client)
|
|
```
|
|
|
|
Kubernetes supports more powerful, but more complex, logging integrations
|
|
with [Stackdriver](/docs/tasks/debug-application-cluster/logging-stackdriver/)
|
|
and [Elasticsearch and Kibana](/docs/tasks/debug-application-cluster/logging-elasticsearch-kibana/).
|
|
For cluster level log shipping and aggregation, consider deploying a [sidecar](https://kubernetes.io/blog/2015/06/the-distributed-system-toolkit-patterns)
|
|
container to rotate and ship your logs.
|
|
|
|
### Configuring a Non-Privileged User
|
|
|
|
The best practices to allow an application to run as a privileged
|
|
user inside of a container are a matter of debate. If your organization requires
|
|
that applications run as a non-privileged user you can use a
|
|
[SecurityContext](/docs/tasks/configure-pod-container/security-context/) to control the user that
|
|
the entry point runs as.
|
|
|
|
The `zk` `StatefulSet`'s Pod `template` contains a `SecurityContext`.
|
|
|
|
```yaml
|
|
securityContext:
|
|
runAsUser: 1000
|
|
fsGroup: 1000
|
|
```
|
|
|
|
In the Pods' containers, UID 1000 corresponds to the zookeeper user and GID 1000
|
|
corresponds to the zookeeper group.
|
|
|
|
Get the ZooKeeper process information from the `zk-0` Pod.
|
|
|
|
```shell
|
|
kubectl exec zk-0 -- ps -elf
|
|
```
|
|
|
|
As the `runAsUser` field of the `securityContext` object is set to 1000,
|
|
instead of running as root, the ZooKeeper process runs as the zookeeper user.
|
|
|
|
```
|
|
F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD
|
|
4 S zookeep+ 1 0 0 80 0 - 1127 - 20:46 ? 00:00:00 sh -c zkGenConfig.sh && zkServer.sh start-foreground
|
|
0 S zookeep+ 27 1 0 80 0 - 1155556 - 20:46 ? 00:00:19 /usr/lib/jvm/java-8-openjdk-amd64/bin/java -Dzookeeper.log.dir=/var/log/zookeeper -Dzookeeper.root.logger=INFO,CONSOLE -cp /usr/bin/../build/classes:/usr/bin/../build/lib/*.jar:/usr/bin/../share/zookeeper/zookeeper-3.4.9.jar:/usr/bin/../share/zookeeper/slf4j-log4j12-1.6.1.jar:/usr/bin/../share/zookeeper/slf4j-api-1.6.1.jar:/usr/bin/../share/zookeeper/netty-3.10.5.Final.jar:/usr/bin/../share/zookeeper/log4j-1.2.16.jar:/usr/bin/../share/zookeeper/jline-0.9.94.jar:/usr/bin/../src/java/lib/*.jar:/usr/bin/../etc/zookeeper: -Xmx2G -Xms2G -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.local.only=false org.apache.zookeeper.server.quorum.QuorumPeerMain /usr/bin/../etc/zookeeper/zoo.cfg
|
|
```
|
|
|
|
By default, when the Pod's PersistentVolumes is mounted to the ZooKeeper server's data directory, it is only accessible by the root user. This configuration prevents the ZooKeeper process from writing to its WAL and storing its snapshots.
|
|
|
|
Use the command below to get the file permissions of the ZooKeeper data directory on the `zk-0` Pod.
|
|
|
|
```shell
|
|
kubectl exec -ti zk-0 -- ls -ld /var/lib/zookeeper/data
|
|
```
|
|
|
|
Because the `fsGroup` field of the `securityContext` object is set to 1000, the ownership of the Pods' PersistentVolumes is set to the zookeeper group, and the ZooKeeper process is able to read and write its data.
|
|
|
|
```
|
|
drwxr-sr-x 3 zookeeper zookeeper 4096 Dec 5 20:45 /var/lib/zookeeper/data
|
|
```
|
|
|
|
## Managing the ZooKeeper Process
|
|
|
|
The [ZooKeeper documentation](https://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_supervision)
|
|
mentions that "You will want to have a supervisory process that
|
|
manages each of your ZooKeeper server processes (JVM)." Utilizing a watchdog
|
|
(supervisory process) to restart failed processes in a distributed system is a
|
|
common pattern. When deploying an application in Kubernetes, rather than using
|
|
an external utility as a supervisory process, you should use Kubernetes as the
|
|
watchdog for your application.
|
|
|
|
### Updating the Ensemble
|
|
|
|
The `zk` `StatefulSet` is configured to use the `RollingUpdate` update strategy.
|
|
|
|
You can use `kubectl patch` to update the number of `cpus` allocated to the servers.
|
|
|
|
```shell
|
|
kubectl patch sts zk --type='json' -p='[{"op": "replace", "path": "/spec/template/spec/containers/0/resources/requests/cpu", "value":"0.3"}]'
|
|
```
|
|
```
|
|
statefulset.apps/zk patched
|
|
```
|
|
|
|
Use `kubectl rollout status` to watch the status of the update.
|
|
|
|
```shell
|
|
kubectl rollout status sts/zk
|
|
```
|
|
```
|
|
waiting for statefulset rolling update to complete 0 pods at revision zk-5db4499664...
|
|
Waiting for 1 pods to be ready...
|
|
Waiting for 1 pods to be ready...
|
|
waiting for statefulset rolling update to complete 1 pods at revision zk-5db4499664...
|
|
Waiting for 1 pods to be ready...
|
|
Waiting for 1 pods to be ready...
|
|
waiting for statefulset rolling update to complete 2 pods at revision zk-5db4499664...
|
|
Waiting for 1 pods to be ready...
|
|
Waiting for 1 pods to be ready...
|
|
statefulset rolling update complete 3 pods at revision zk-5db4499664...
|
|
```
|
|
|
|
This terminates the Pods, one at a time, in reverse ordinal order, and recreates them with the new configuration. This ensures that quorum is maintained during a rolling update.
|
|
|
|
Use the `kubectl rollout history` command to view a history or previous configurations.
|
|
|
|
```shell
|
|
kubectl rollout history sts/zk
|
|
```
|
|
|
|
```
|
|
statefulsets "zk"
|
|
REVISION
|
|
1
|
|
2
|
|
```
|
|
|
|
Use the `kubectl rollout undo` command to roll back the modification.
|
|
|
|
```shell
|
|
kubectl rollout undo sts/zk
|
|
```
|
|
|
|
```
|
|
statefulset.apps/zk rolled back
|
|
```
|
|
|
|
### Handling Process Failure
|
|
|
|
[Restart Policies](/docs/user-guide/pod-states/#restartpolicy) control how
|
|
Kubernetes handles process failures for the entry point of the container in a Pod.
|
|
For Pods in a `StatefulSet`, the only appropriate `RestartPolicy` is Always, and this
|
|
is the default value. For stateful applications you should **never** override
|
|
the default policy.
|
|
|
|
Use the following command to examine the process tree for the ZooKeeper server running in the `zk-0` Pod.
|
|
|
|
```shell
|
|
kubectl exec zk-0 -- ps -ef
|
|
```
|
|
|
|
The command used as the container's entry point has PID 1, and
|
|
the ZooKeeper process, a child of the entry point, has PID 27.
|
|
|
|
```
|
|
UID PID PPID C STIME TTY TIME CMD
|
|
zookeep+ 1 0 0 15:03 ? 00:00:00 sh -c zkGenConfig.sh && zkServer.sh start-foreground
|
|
zookeep+ 27 1 0 15:03 ? 00:00:03 /usr/lib/jvm/java-8-openjdk-amd64/bin/java -Dzookeeper.log.dir=/var/log/zookeeper -Dzookeeper.root.logger=INFO,CONSOLE -cp /usr/bin/../build/classes:/usr/bin/../build/lib/*.jar:/usr/bin/../share/zookeeper/zookeeper-3.4.9.jar:/usr/bin/../share/zookeeper/slf4j-log4j12-1.6.1.jar:/usr/bin/../share/zookeeper/slf4j-api-1.6.1.jar:/usr/bin/../share/zookeeper/netty-3.10.5.Final.jar:/usr/bin/../share/zookeeper/log4j-1.2.16.jar:/usr/bin/../share/zookeeper/jline-0.9.94.jar:/usr/bin/../src/java/lib/*.jar:/usr/bin/../etc/zookeeper: -Xmx2G -Xms2G -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.local.only=false org.apache.zookeeper.server.quorum.QuorumPeerMain /usr/bin/../etc/zookeeper/zoo.cfg
|
|
```
|
|
|
|
In another terminal watch the Pods in the `zk` `StatefulSet` with the following command.
|
|
|
|
```shell
|
|
kubectl get pod -w -l app=zk
|
|
```
|
|
|
|
In another terminal, terminate the ZooKeeper process in Pod `zk-0` with the following command.
|
|
|
|
```shell
|
|
kubectl exec zk-0 -- pkill java
|
|
```
|
|
|
|
The termination of the ZooKeeper process caused its parent process to terminate. Because the `RestartPolicy` of the container is Always, it restarted the parent process.
|
|
|
|
```
|
|
NAME READY STATUS RESTARTS AGE
|
|
zk-0 1/1 Running 0 21m
|
|
zk-1 1/1 Running 0 20m
|
|
zk-2 1/1 Running 0 19m
|
|
NAME READY STATUS RESTARTS AGE
|
|
zk-0 0/1 Error 0 29m
|
|
zk-0 0/1 Running 1 29m
|
|
zk-0 1/1 Running 1 29m
|
|
```
|
|
|
|
If your application uses a script (such as `zkServer.sh`) to launch the process
|
|
that implements the application's business logic, the script must terminate with the
|
|
child process. This ensures that Kubernetes will restart the application's
|
|
container when the process implementing the application's business logic fails.
|
|
|
|
### Testing for Liveness
|
|
|
|
Configuring your application to restart failed processes is not enough to
|
|
keep a distributed system healthy. There are scenarios where
|
|
a system's processes can be both alive and unresponsive, or otherwise
|
|
unhealthy. You should use liveness probes to notify Kubernetes
|
|
that your application's processes are unhealthy and it should restart them.
|
|
|
|
The Pod `template` for the `zk` `StatefulSet` specifies a liveness probe.
|
|
|
|
```yaml
|
|
livenessProbe:
|
|
exec:
|
|
command:
|
|
- sh
|
|
- -c
|
|
- "zookeeper-ready 2181"
|
|
initialDelaySeconds: 15
|
|
timeoutSeconds: 5
|
|
```
|
|
|
|
The probe calls a bash script that uses the ZooKeeper `ruok` four letter
|
|
word to test the server's health.
|
|
|
|
```
|
|
OK=$(echo ruok | nc 127.0.0.1 $1)
|
|
if [ "$OK" == "imok" ]; then
|
|
exit 0
|
|
else
|
|
exit 1
|
|
fi
|
|
```
|
|
|
|
In one terminal window, use the following command to watch the Pods in the `zk` StatefulSet.
|
|
|
|
```shell
|
|
kubectl get pod -w -l app=zk
|
|
```
|
|
|
|
In another window, using the following command to delete the `zkOk.sh` script from the file system of Pod `zk-0`.
|
|
|
|
```shell
|
|
kubectl exec zk-0 -- rm /usr/bin/zookeeper-ready
|
|
```
|
|
|
|
When the liveness probe for the ZooKeeper process fails, Kubernetes will
|
|
automatically restart the process for you, ensuring that unhealthy processes in
|
|
the ensemble are restarted.
|
|
|
|
```shell
|
|
kubectl get pod -w -l app=zk
|
|
```
|
|
|
|
```
|
|
NAME READY STATUS RESTARTS AGE
|
|
zk-0 1/1 Running 0 1h
|
|
zk-1 1/1 Running 0 1h
|
|
zk-2 1/1 Running 0 1h
|
|
NAME READY STATUS RESTARTS AGE
|
|
zk-0 0/1 Running 0 1h
|
|
zk-0 0/1 Running 1 1h
|
|
zk-0 1/1 Running 1 1h
|
|
```
|
|
|
|
### Testing for Readiness
|
|
|
|
Readiness is not the same as liveness. If a process is alive, it is scheduled
|
|
and healthy. If a process is ready, it is able to process input. Liveness is
|
|
a necessary, but not sufficient, condition for readiness. There are cases,
|
|
particularly during initialization and termination, when a process can be
|
|
alive but not ready.
|
|
|
|
If you specify a readiness probe, Kubernetes will ensure that your application's
|
|
processes will not receive network traffic until their readiness checks pass.
|
|
|
|
For a ZooKeeper server, liveness implies readiness. Therefore, the readiness
|
|
probe from the `zookeeper.yaml` manifest is identical to the liveness probe.
|
|
|
|
```yaml
|
|
readinessProbe:
|
|
exec:
|
|
command:
|
|
- sh
|
|
- -c
|
|
- "zookeeper-ready 2181"
|
|
initialDelaySeconds: 15
|
|
timeoutSeconds: 5
|
|
```
|
|
|
|
Even though the liveness and readiness probes are identical, it is important
|
|
to specify both. This ensures that only healthy servers in the ZooKeeper
|
|
ensemble receive network traffic.
|
|
|
|
## Tolerating Node Failure
|
|
|
|
ZooKeeper needs a quorum of servers to successfully commit mutations
|
|
to data. For a three server ensemble, two servers must be healthy for
|
|
writes to succeed. In quorum based systems, members are deployed across failure
|
|
domains to ensure availability. To avoid an outage, due to the loss of an
|
|
individual machine, best practices preclude co-locating multiple instances of the
|
|
application on the same machine.
|
|
|
|
By default, Kubernetes may co-locate Pods in a `StatefulSet` on the same node. For the three server ensemble you created, if two servers are on the same node, and that node fails, the clients of your ZooKeeper service will experience an outage until at least one of the Pods can be rescheduled.
|
|
|
|
You should always provision additional capacity to allow the processes of critical
|
|
systems to be rescheduled in the event of node failures. If you do so, then the
|
|
outage will only last until the Kubernetes scheduler reschedules one of the ZooKeeper
|
|
servers. However, if you want your service to tolerate node failures with no downtime,
|
|
you should set `podAntiAffinity`.
|
|
|
|
Use the command below to get the nodes for Pods in the `zk` `StatefulSet`.
|
|
|
|
```shell
|
|
for i in 0 1 2; do kubectl get pod zk-$i --template {{.spec.nodeName}}; echo ""; done
|
|
```
|
|
|
|
All of the Pods in the `zk` `StatefulSet` are deployed on different nodes.
|
|
|
|
```
|
|
kubernetes-node-cxpk
|
|
kubernetes-node-a5aq
|
|
kubernetes-node-2g2d
|
|
```
|
|
|
|
This is because the Pods in the `zk` `StatefulSet` have a `PodAntiAffinity` specified.
|
|
|
|
```yaml
|
|
affinity:
|
|
podAntiAffinity:
|
|
requiredDuringSchedulingIgnoredDuringExecution:
|
|
- labelSelector:
|
|
matchExpressions:
|
|
- key: "app"
|
|
operator: In
|
|
values:
|
|
- zk
|
|
topologyKey: "kubernetes.io/hostname"
|
|
```
|
|
|
|
The `requiredDuringSchedulingIgnoredDuringExecution` field tells the
|
|
Kubernetes Scheduler that it should never co-locate two Pods which have `app` label
|
|
as `zk` in the domain defined by the `topologyKey`. The `topologyKey`
|
|
`kubernetes.io/hostname` indicates that the domain is an individual node. Using
|
|
different rules, labels, and selectors, you can extend this technique to spread
|
|
your ensemble across physical, network, and power failure domains.
|
|
|
|
## Surviving Maintenance
|
|
|
|
**In this section you will cordon and drain nodes. If you are using this tutorial
|
|
on a shared cluster, be sure that this will not adversely affect other tenants.**
|
|
|
|
The previous section showed you how to spread your Pods across nodes to survive
|
|
unplanned node failures, but you also need to plan for temporary node failures
|
|
that occur due to planned maintenance.
|
|
|
|
Use this command to get the nodes in your cluster.
|
|
|
|
```shell
|
|
kubectl get nodes
|
|
```
|
|
|
|
Use [`kubectl cordon`](/docs/reference/generated/kubectl/kubectl-commands/#cordon) to
|
|
cordon all but four of the nodes in your cluster.
|
|
|
|
```shell
|
|
kubectl cordon <node-name>
|
|
```
|
|
|
|
Use this command to get the `zk-pdb` `PodDisruptionBudget`.
|
|
|
|
```shell
|
|
kubectl get pdb zk-pdb
|
|
```
|
|
|
|
The `max-unavailable` field indicates to Kubernetes that at most one Pod from
|
|
`zk` `StatefulSet` can be unavailable at any time.
|
|
|
|
```
|
|
NAME MIN-AVAILABLE MAX-UNAVAILABLE ALLOWED-DISRUPTIONS AGE
|
|
zk-pdb N/A 1 1
|
|
```
|
|
|
|
In one terminal, use this command to watch the Pods in the `zk` `StatefulSet`.
|
|
|
|
```shell
|
|
kubectl get pods -w -l app=zk
|
|
```
|
|
|
|
In another terminal, use this command to get the nodes that the Pods are currently scheduled on.
|
|
|
|
```shell
|
|
for i in 0 1 2; do kubectl get pod zk-$i --template {{.spec.nodeName}}; echo ""; done
|
|
```
|
|
|
|
```
|
|
kubernetes-node-pb41
|
|
kubernetes-node-ixsl
|
|
kubernetes-node-i4c4
|
|
```
|
|
|
|
Use [`kubectl drain`](/docs/reference/generated/kubectl/kubectl-commands/#drain) to cordon and
|
|
drain the node on which the `zk-0` Pod is scheduled.
|
|
|
|
```shell
|
|
kubectl drain $(kubectl get pod zk-0 --template {{.spec.nodeName}}) --ignore-daemonsets --force --delete-local-data
|
|
```
|
|
|
|
```
|
|
node "kubernetes-node-pb41" cordoned
|
|
|
|
WARNING: Deleting pods not managed by ReplicationController, ReplicaSet, Job, or DaemonSet: fluentd-cloud-logging-kubernetes-node-pb41, kube-proxy-kubernetes-node-pb41; Ignoring DaemonSet-managed pods: node-problem-detector-v0.1-o5elz
|
|
pod "zk-0" deleted
|
|
node "kubernetes-node-pb41" drained
|
|
```
|
|
|
|
As there are four nodes in your cluster, `kubectl drain`, succeeds and the
|
|
`zk-0` is rescheduled to another node.
|
|
|
|
```
|
|
NAME READY STATUS RESTARTS AGE
|
|
zk-0 1/1 Running 2 1h
|
|
zk-1 1/1 Running 0 1h
|
|
zk-2 1/1 Running 0 1h
|
|
NAME READY STATUS RESTARTS AGE
|
|
zk-0 1/1 Terminating 2 2h
|
|
zk-0 0/1 Terminating 2 2h
|
|
zk-0 0/1 Terminating 2 2h
|
|
zk-0 0/1 Terminating 2 2h
|
|
zk-0 0/1 Pending 0 0s
|
|
zk-0 0/1 Pending 0 0s
|
|
zk-0 0/1 ContainerCreating 0 0s
|
|
zk-0 0/1 Running 0 51s
|
|
zk-0 1/1 Running 0 1m
|
|
```
|
|
|
|
Keep watching the `StatefulSet`'s Pods in the first terminal and drain the node on which
|
|
`zk-1` is scheduled.
|
|
|
|
```shell
|
|
kubectl drain $(kubectl get pod zk-1 --template {{.spec.nodeName}}) --ignore-daemonsets --force --delete-local-data "kubernetes-node-ixsl" cordoned
|
|
```
|
|
|
|
```
|
|
WARNING: Deleting pods not managed by ReplicationController, ReplicaSet, Job, or DaemonSet: fluentd-cloud-logging-kubernetes-node-ixsl, kube-proxy-kubernetes-node-ixsl; Ignoring DaemonSet-managed pods: node-problem-detector-v0.1-voc74
|
|
pod "zk-1" deleted
|
|
node "kubernetes-node-ixsl" drained
|
|
```
|
|
|
|
The `zk-1` Pod cannot be scheduled because the `zk` `StatefulSet` contains a `PodAntiAffinity` rule preventing co-location of the Pods, and as only two nodes are schedulable, the Pod will remain in a Pending state.
|
|
|
|
```shell
|
|
kubectl get pods -w -l app=zk
|
|
```
|
|
|
|
```
|
|
NAME READY STATUS RESTARTS AGE
|
|
zk-0 1/1 Running 2 1h
|
|
zk-1 1/1 Running 0 1h
|
|
zk-2 1/1 Running 0 1h
|
|
NAME READY STATUS RESTARTS AGE
|
|
zk-0 1/1 Terminating 2 2h
|
|
zk-0 0/1 Terminating 2 2h
|
|
zk-0 0/1 Terminating 2 2h
|
|
zk-0 0/1 Terminating 2 2h
|
|
zk-0 0/1 Pending 0 0s
|
|
zk-0 0/1 Pending 0 0s
|
|
zk-0 0/1 ContainerCreating 0 0s
|
|
zk-0 0/1 Running 0 51s
|
|
zk-0 1/1 Running 0 1m
|
|
zk-1 1/1 Terminating 0 2h
|
|
zk-1 0/1 Terminating 0 2h
|
|
zk-1 0/1 Terminating 0 2h
|
|
zk-1 0/1 Terminating 0 2h
|
|
zk-1 0/1 Pending 0 0s
|
|
zk-1 0/1 Pending 0 0s
|
|
```
|
|
|
|
Continue to watch the Pods of the stateful set, and drain the node on which
|
|
`zk-2` is scheduled.
|
|
|
|
```shell
|
|
kubectl drain $(kubectl get pod zk-2 --template {{.spec.nodeName}}) --ignore-daemonsets --force --delete-local-data
|
|
```
|
|
```
|
|
node "kubernetes-node-i4c4" cordoned
|
|
|
|
WARNING: Deleting pods not managed by ReplicationController, ReplicaSet, Job, or DaemonSet: fluentd-cloud-logging-kubernetes-node-i4c4, kube-proxy-kubernetes-node-i4c4; Ignoring DaemonSet-managed pods: node-problem-detector-v0.1-dyrog
|
|
WARNING: Ignoring DaemonSet-managed pods: node-problem-detector-v0.1-dyrog; Deleting pods not managed by ReplicationController, ReplicaSet, Job, or DaemonSet: fluentd-cloud-logging-kubernetes-node-i4c4, kube-proxy-kubernetes-node-i4c4
|
|
There are pending pods when an error occurred: Cannot evict pod as it would violate the pod's disruption budget.
|
|
pod/zk-2
|
|
```
|
|
|
|
Use `CTRL-C` to terminate to kubectl.
|
|
|
|
You cannot drain the third node because evicting `zk-2` would violate `zk-budget`. However, the node will remain cordoned.
|
|
|
|
Use `zkCli.sh` to retrieve the value you entered during the sanity test from `zk-0`.
|
|
|
|
```shell
|
|
kubectl exec zk-0 zkCli.sh get /hello
|
|
```
|
|
|
|
The service is still available because its `PodDisruptionBudget` is respected.
|
|
|
|
```
|
|
WatchedEvent state:SyncConnected type:None path:null
|
|
world
|
|
cZxid = 0x200000002
|
|
ctime = Wed Dec 07 00:08:59 UTC 2016
|
|
mZxid = 0x200000002
|
|
mtime = Wed Dec 07 00:08:59 UTC 2016
|
|
pZxid = 0x200000002
|
|
cversion = 0
|
|
dataVersion = 0
|
|
aclVersion = 0
|
|
ephemeralOwner = 0x0
|
|
dataLength = 5
|
|
numChildren = 0
|
|
```
|
|
|
|
Use [`kubectl uncordon`](/docs/reference/generated/kubectl/kubectl-commands/#uncordon) to uncordon the first node.
|
|
|
|
```shell
|
|
kubectl uncordon kubernetes-node-pb41
|
|
```
|
|
```
|
|
node "kubernetes-node-pb41" uncordoned
|
|
```
|
|
|
|
`zk-1` is rescheduled on this node. Wait until `zk-1` is Running and Ready.
|
|
|
|
```shell
|
|
kubectl get pods -w -l app=zk
|
|
```
|
|
```
|
|
NAME READY STATUS RESTARTS AGE
|
|
zk-0 1/1 Running 2 1h
|
|
zk-1 1/1 Running 0 1h
|
|
zk-2 1/1 Running 0 1h
|
|
NAME READY STATUS RESTARTS AGE
|
|
zk-0 1/1 Terminating 2 2h
|
|
zk-0 0/1 Terminating 2 2h
|
|
zk-0 0/1 Terminating 2 2h
|
|
zk-0 0/1 Terminating 2 2h
|
|
zk-0 0/1 Pending 0 0s
|
|
zk-0 0/1 Pending 0 0s
|
|
zk-0 0/1 ContainerCreating 0 0s
|
|
zk-0 0/1 Running 0 51s
|
|
zk-0 1/1 Running 0 1m
|
|
zk-1 1/1 Terminating 0 2h
|
|
zk-1 0/1 Terminating 0 2h
|
|
zk-1 0/1 Terminating 0 2h
|
|
zk-1 0/1 Terminating 0 2h
|
|
zk-1 0/1 Pending 0 0s
|
|
zk-1 0/1 Pending 0 0s
|
|
zk-1 0/1 Pending 0 12m
|
|
zk-1 0/1 ContainerCreating 0 12m
|
|
zk-1 0/1 Running 0 13m
|
|
zk-1 1/1 Running 0 13m
|
|
```
|
|
|
|
Attempt to drain the node on which `zk-2` is scheduled.
|
|
|
|
```shell
|
|
kubectl drain $(kubectl get pod zk-2 --template {{.spec.nodeName}}) --ignore-daemonsets --force --delete-local-data
|
|
```
|
|
|
|
The output:
|
|
|
|
```
|
|
node "kubernetes-node-i4c4" already cordoned
|
|
WARNING: Deleting pods not managed by ReplicationController, ReplicaSet, Job, or DaemonSet: fluentd-cloud-logging-kubernetes-node-i4c4, kube-proxy-kubernetes-node-i4c4; Ignoring DaemonSet-managed pods: node-problem-detector-v0.1-dyrog
|
|
pod "heapster-v1.2.0-2604621511-wht1r" deleted
|
|
pod "zk-2" deleted
|
|
node "kubernetes-node-i4c4" drained
|
|
```
|
|
|
|
This time `kubectl drain` succeeds.
|
|
|
|
Uncordon the second node to allow `zk-2` to be rescheduled.
|
|
|
|
```shell
|
|
kubectl uncordon kubernetes-node-ixsl
|
|
```
|
|
|
|
```
|
|
node "kubernetes-node-ixsl" uncordoned
|
|
```
|
|
|
|
You can use `kubectl drain` in conjunction with `PodDisruptionBudgets` to ensure that your services remain available during maintenance. If drain is used to cordon nodes and evict pods prior to taking the node offline for maintenance, services that express a disruption budget will have that budget respected. You should always allocate additional capacity for critical services so that their Pods can be immediately rescheduled.
|
|
|
|
|
|
|
|
## {{% heading "cleanup" %}}
|
|
|
|
|
|
- Use `kubectl uncordon` to uncordon all the nodes in your cluster.
|
|
- You will need to delete the persistent storage media for the PersistentVolumes
|
|
used in this tutorial. Follow the necessary steps, based on your environment,
|
|
storage configuration, and provisioning method, to ensure that all storage is
|
|
reclaimed.
|