307 lines
14 KiB
Markdown
307 lines
14 KiB
Markdown
---
|
|
title: Specifying a Disruption Budget for your Application
|
|
content_type: task
|
|
weight: 110
|
|
min-kubernetes-server-version: v1.21
|
|
---
|
|
|
|
<!-- overview -->
|
|
|
|
{{< feature-state for_k8s_version="v1.21" state="stable" >}}
|
|
|
|
This page shows how to limit the number of concurrent disruptions
|
|
that your application experiences, allowing for higher availability
|
|
while permitting the cluster administrator to manage the clusters
|
|
nodes.
|
|
|
|
## {{% heading "prerequisites" %}}
|
|
|
|
{{< version-check >}}
|
|
|
|
- You are the owner of an application running on a Kubernetes cluster that requires
|
|
high availability.
|
|
- You should know how to deploy [Replicated Stateless Applications](/docs/tasks/run-application/run-stateless-application-deployment/)
|
|
and/or [Replicated Stateful Applications](/docs/tasks/run-application/run-replicated-stateful-application/).
|
|
- You should have read about [Pod Disruptions](/docs/concepts/workloads/pods/disruptions/).
|
|
- You should confirm with your cluster owner or service provider that they respect
|
|
Pod Disruption Budgets.
|
|
|
|
<!-- steps -->
|
|
|
|
## Protecting an Application with a PodDisruptionBudget
|
|
|
|
1. Identify what application you want to protect with a PodDisruptionBudget (PDB).
|
|
1. Think about how your application reacts to disruptions.
|
|
1. Create a PDB definition as a YAML file.
|
|
1. Create the PDB object from the YAML file.
|
|
|
|
<!-- discussion -->
|
|
|
|
## Identify an Application to Protect
|
|
|
|
The most common use case when you want to protect an application
|
|
specified by one of the built-in Kubernetes controllers:
|
|
|
|
- Deployment
|
|
- ReplicationController
|
|
- ReplicaSet
|
|
- StatefulSet
|
|
|
|
In this case, make a note of the controller's `.spec.selector`; the same
|
|
selector goes into the PDBs `.spec.selector`.
|
|
|
|
From version 1.15 PDBs support custom controllers where the
|
|
[scale subresource](/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#scale-subresource)
|
|
is enabled.
|
|
|
|
You can also use PDBs with pods which are not controlled by one of the above
|
|
controllers, or arbitrary groups of pods, but there are some restrictions,
|
|
described in [Arbitrary workloads and arbitrary selectors](#arbitrary-controllers-and-selectors).
|
|
|
|
## Think about how your application reacts to disruptions
|
|
|
|
Decide how many instances can be down at the same time for a short period
|
|
due to a voluntary disruption.
|
|
|
|
- Stateless frontends:
|
|
- Concern: don't reduce serving capacity by more than 10%.
|
|
- Solution: use PDB with minAvailable 90% for example.
|
|
- Single-instance Stateful Application:
|
|
- Concern: do not terminate this application without talking to me.
|
|
- Possible Solution 1: Do not use a PDB and tolerate occasional downtime.
|
|
- Possible Solution 2: Set PDB with maxUnavailable=0. Have an understanding
|
|
(outside of Kubernetes) that the cluster operator needs to consult you before
|
|
termination. When the cluster operator contacts you, prepare for downtime,
|
|
and then delete the PDB to indicate readiness for disruption. Recreate afterwards.
|
|
- Multiple-instance Stateful application such as Consul, ZooKeeper, or etcd:
|
|
- Concern: Do not reduce number of instances below quorum, otherwise writes fail.
|
|
- Possible Solution 1: set maxUnavailable to 1 (works with varying scale of application).
|
|
- Possible Solution 2: set minAvailable to quorum-size (e.g. 3 when scale is 5).
|
|
(Allows more disruptions at once).
|
|
- Restartable Batch Job:
|
|
- Concern: Job needs to complete in case of voluntary disruption.
|
|
- Possible solution: Do not create a PDB. The Job controller will create a replacement pod.
|
|
|
|
### Rounding logic when specifying percentages
|
|
|
|
Values for `minAvailable` or `maxUnavailable` can be expressed as integers or as a percentage.
|
|
|
|
- When you specify an integer, it represents a number of Pods. For instance, if you set
|
|
`minAvailable` to 10, then 10 Pods must always be available, even during a disruption.
|
|
- When you specify a percentage by setting the value to a string representation of a
|
|
percentage (eg. `"50%"`), it represents a percentage of total Pods. For instance, if
|
|
you set `minAvailable` to `"50%"`, then at least 50% of the Pods remain available
|
|
during a disruption.
|
|
|
|
When you specify the value as a percentage, it may not map to an exact number of Pods.
|
|
For example, if you have 7 Pods and you set `minAvailable` to `"50%"`, it's not
|
|
immediately obvious whether that means 3 Pods or 4 Pods must be available. Kubernetes
|
|
rounds up to the nearest integer, so in this case, 4 Pods must be available. When you
|
|
specify the value `maxUnavailable` as a percentage, Kubernetes rounds up the number of
|
|
Pods that may be disrupted. Thereby a disruption can exceed your defined
|
|
`maxUnavailable` percentage. You can examine the
|
|
[code](https://github.com/kubernetes/kubernetes/blob/23be9587a0f8677eb8091464098881df939c44a9/pkg/controller/disruption/disruption.go#L539)
|
|
that controls this behavior.
|
|
|
|
## Specifying a PodDisruptionBudget
|
|
|
|
A `PodDisruptionBudget` has three fields:
|
|
|
|
- A label selector `.spec.selector` to specify the set of
|
|
pods to which it applies. This field is required.
|
|
- `.spec.minAvailable` which is a description of the number of pods from that
|
|
set that must still be available after the eviction, even in the absence
|
|
of the evicted pod. `minAvailable` can be either an absolute number or a percentage.
|
|
- `.spec.maxUnavailable` (available in Kubernetes 1.7 and higher) which is a description
|
|
of the number of pods from that set that can be unavailable after the eviction.
|
|
It can be either an absolute number or a percentage.
|
|
|
|
{{< note >}}
|
|
The behavior for an empty selector differs between the policy/v1beta1 and policy/v1 APIs for
|
|
PodDisruptionBudgets. For policy/v1beta1 an empty selector matches zero pods, while
|
|
for policy/v1 an empty selector matches every pod in the namespace.
|
|
{{< /note >}}
|
|
|
|
You can specify only one of `maxUnavailable` and `minAvailable` in a single `PodDisruptionBudget`.
|
|
`maxUnavailable` can only be used to control the eviction of pods
|
|
that have an associated controller managing them. In the examples below, "desired replicas"
|
|
is the `scale` of the controller managing the pods being selected by the
|
|
`PodDisruptionBudget`.
|
|
|
|
Example 1: With a `minAvailable` of 5, evictions are allowed as long as they leave behind
|
|
5 or more [healthy](#healthiness-of-a-pod) pods among those selected by the PodDisruptionBudget's `selector`.
|
|
|
|
Example 2: With a `minAvailable` of 30%, evictions are allowed as long as at least 30%
|
|
of the number of desired replicas are healthy.
|
|
|
|
Example 3: With a `maxUnavailable` of 5, evictions are allowed as long as there are at most 5
|
|
unhealthy replicas among the total number of desired replicas.
|
|
|
|
Example 4: With a `maxUnavailable` of 30%, evictions are allowed as long as the number of
|
|
unhealthy replicas does not exceed 30% of the total number of desired replica rounded up to
|
|
the nearest integer. If the total number of desired replicas is just one, that single replica
|
|
is still allowed for disruption, leading to an effective unavailability of 100%.
|
|
|
|
In typical usage, a single budget would be used for a collection of pods managed by
|
|
a controller—for example, the pods in a single ReplicaSet or StatefulSet.
|
|
|
|
{{< note >}}
|
|
A disruption budget does not truly guarantee that the specified
|
|
number/percentage of pods will always be up. For example, a node that hosts a
|
|
pod from the collection may fail when the collection is at the minimum size
|
|
specified in the budget, thus bringing the number of available pods from the
|
|
collection below the specified size. The budget can only protect against
|
|
voluntary evictions, not all causes of unavailability.
|
|
{{< /note >}}
|
|
|
|
If you set `maxUnavailable` to 0% or 0, or you set `minAvailable` to 100% or the number of replicas,
|
|
you are requiring zero voluntary evictions. When you set zero voluntary evictions for a workload
|
|
object such as ReplicaSet, then you cannot successfully drain a Node running one of those Pods.
|
|
If you try to drain a Node where an unevictable Pod is running, the drain never completes.
|
|
This is permitted as per the semantics of `PodDisruptionBudget`.
|
|
|
|
You can find examples of pod disruption budgets defined below. They match pods with the label
|
|
`app: zookeeper`.
|
|
|
|
Example PDB Using minAvailable:
|
|
|
|
{{% code_sample file="policy/zookeeper-pod-disruption-budget-minavailable.yaml" %}}
|
|
|
|
Example PDB Using maxUnavailable:
|
|
|
|
{{% code_sample file="policy/zookeeper-pod-disruption-budget-maxunavailable.yaml" %}}
|
|
|
|
For example, if the above `zk-pdb` object selects the pods of a StatefulSet of size 3, both
|
|
specifications have the exact same meaning. The use of `maxUnavailable` is recommended as it
|
|
automatically responds to changes in the number of replicas of the corresponding controller.
|
|
|
|
## Create the PDB object
|
|
|
|
You can create or update the PDB object using kubectl.
|
|
```shell
|
|
kubectl apply -f mypdb.yaml
|
|
```
|
|
|
|
## Check the status of the PDB
|
|
|
|
Use kubectl to check that your PDB is created.
|
|
|
|
Assuming you don't actually have pods matching `app: zookeeper` in your namespace,
|
|
then you'll see something like this:
|
|
|
|
```shell
|
|
kubectl get poddisruptionbudgets
|
|
```
|
|
```
|
|
NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
|
|
zk-pdb 2 N/A 0 7s
|
|
```
|
|
|
|
If there are matching pods (say, 3), then you would see something like this:
|
|
|
|
```shell
|
|
kubectl get poddisruptionbudgets
|
|
```
|
|
```
|
|
NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
|
|
zk-pdb 2 N/A 1 7s
|
|
```
|
|
|
|
The non-zero value for `ALLOWED DISRUPTIONS` means that the disruption controller has seen the pods,
|
|
counted the matching pods, and updated the status of the PDB.
|
|
|
|
You can get more information about the status of a PDB with this command:
|
|
|
|
```shell
|
|
kubectl get poddisruptionbudgets zk-pdb -o yaml
|
|
```
|
|
```yaml
|
|
apiVersion: policy/v1
|
|
kind: PodDisruptionBudget
|
|
metadata:
|
|
annotations:
|
|
…
|
|
creationTimestamp: "2020-03-04T04:22:56Z"
|
|
generation: 1
|
|
name: zk-pdb
|
|
…
|
|
status:
|
|
currentHealthy: 3
|
|
desiredHealthy: 2
|
|
disruptionsAllowed: 1
|
|
expectedPods: 3
|
|
observedGeneration: 1
|
|
```
|
|
|
|
### Healthiness of a Pod
|
|
|
|
The current implementation considers healthy pods, as pods that have `.status.conditions`
|
|
item with `type="Ready"` and `status="True"`.
|
|
These pods are tracked via `.status.currentHealthy` field in the PDB status.
|
|
|
|
## Unhealthy Pod Eviction Policy
|
|
|
|
{{< feature-state for_k8s_version="v1.27" state="beta" >}}
|
|
|
|
{{< note >}}
|
|
This feature is enabled by default. You can disable it by disabling the `PDBUnhealthyPodEvictionPolicy`
|
|
[feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
|
|
on the [API server](/docs/reference/command-line-tools-reference/kube-apiserver/).
|
|
{{< /note >}}
|
|
|
|
PodDisruptionBudget guarding an application ensures that `.status.currentHealthy` number of pods
|
|
does not fall below the number specified in `.status.desiredHealthy` by disallowing eviction of healthy pods.
|
|
By using `.spec.unhealthyPodEvictionPolicy`, you can also define the criteria when unhealthy pods
|
|
should be considered for eviction. The default behavior when no policy is specified corresponds
|
|
to the `IfHealthyBudget` policy.
|
|
|
|
Policies:
|
|
|
|
`IfHealthyBudget`
|
|
: Running pods (`.status.phase="Running"`), but not yet healthy can be evicted only
|
|
if the guarded application is not disrupted (`.status.currentHealthy` is at least
|
|
equal to `.status.desiredHealthy`).
|
|
|
|
: This policy ensures that running pods of an already disrupted application have
|
|
the best chance to become healthy. This has negative implications for draining
|
|
nodes, which can be blocked by misbehaving applications that are guarded by a PDB.
|
|
More specifically applications with pods in `CrashLoopBackOff` state
|
|
(due to a bug or misconfiguration), or pods that are just failing to report the
|
|
`Ready` condition.
|
|
|
|
`AlwaysAllow`
|
|
: Running pods (`.status.phase="Running"`), but not yet healthy are considered
|
|
disrupted and can be evicted regardless of whether the criteria in a PDB is met.
|
|
|
|
: This means prospective running pods of a disrupted application might not get a
|
|
chance to become healthy. By using this policy, cluster managers can easily evict
|
|
misbehaving applications that are guarded by a PDB. More specifically applications
|
|
with pods in `CrashLoopBackOff` state (due to a bug or misconfiguration), or pods
|
|
that are just failing to report the `Ready` condition.
|
|
|
|
{{< note >}}
|
|
Pods in `Pending`, `Succeeded` or `Failed` phase are always considered for eviction.
|
|
{{< /note >}}
|
|
|
|
## Arbitrary workloads and arbitrary selectors {#arbitrary-controllers-and-selectors}
|
|
|
|
You can skip this section if you only use PDBs with the built-in
|
|
workload resources (Deployment, ReplicaSet, StatefulSet and ReplicationController)
|
|
or with {{< glossary_tooltip term_id="CustomResourceDefinition" text="custom resources" >}}
|
|
that implement a `scale` [subresource](/docs/concepts/extend-kubernetes/api-extension/custom-resources/#advanced-features-and-flexibility),
|
|
and where the PDB selector exactly matches the selector of the Pod's owning resource.
|
|
|
|
You can use a PDB with pods controlled by another resource, by an
|
|
"operator", or bare pods, but with these restrictions:
|
|
|
|
- only `.spec.minAvailable` can be used, not `.spec.maxUnavailable`.
|
|
- only an integer value can be used with `.spec.minAvailable`, not a percentage.
|
|
|
|
It is not possible to use other availability configurations,
|
|
because Kubernetes cannot derive a total number of pods without a supported owning resource.
|
|
|
|
You can use a selector which selects a subset or superset of the pods belonging to a
|
|
workload resource. The eviction API will disallow eviction of any pod covered by multiple PDBs,
|
|
so most users will want to avoid overlapping selectors. One reasonable use of overlapping
|
|
PDBs is when pods are being transitioned from one PDB to another.
|