Merge remote-tracking branch 'upstream/master' into release-1.6
commit
4956e43fb1
|
@ -3,45 +3,63 @@ abstract: "Detailed explanations of Kubernetes system concepts and abstractions.
|
|||
toc:
|
||||
- docs/concepts/index.md
|
||||
|
||||
- title: Kubectl Command Line
|
||||
- title: Overview
|
||||
section:
|
||||
- docs/concepts/tools/kubectl/object-management-overview.md
|
||||
- docs/concepts/tools/kubectl/object-management-using-imperative-commands.md
|
||||
- docs/concepts/tools/kubectl/object-management-using-imperative-config.md
|
||||
- docs/concepts/tools/kubectl/object-management-using-declarative-config.md
|
||||
- docs/concepts/overview/what-is-kubernetes.md
|
||||
- docs/concepts/overview/components.md
|
||||
- title: Working with Kubernetes Objects
|
||||
section:
|
||||
- docs/concepts/overview/working-with-objects/kubernetes-objects.md
|
||||
- docs/concepts/overview/working-with-objects/labels.md
|
||||
- docs/concepts/overview/working-with-objects/annotations.md
|
||||
- docs/concepts/overview/kubernetes-api.md
|
||||
|
||||
- title: Kubernetes Objects
|
||||
section:
|
||||
- docs/concepts/abstractions/overview.md
|
||||
- title: Pods
|
||||
section:
|
||||
- docs/concepts/abstractions/pod.md
|
||||
- docs/concepts/abstractions/init-containers.md
|
||||
- title: Controllers
|
||||
section:
|
||||
- docs/concepts/abstractions/controllers/statefulsets.md
|
||||
- docs/concepts/abstractions/controllers/petsets.md
|
||||
- docs/concepts/abstractions/controllers/garbage-collection.md
|
||||
|
||||
- title: Object Metadata
|
||||
section:
|
||||
- docs/concepts/object-metadata/annotations.md
|
||||
|
||||
- title: Workloads
|
||||
section:
|
||||
- title: Pods
|
||||
section:
|
||||
- docs/concepts/workloads/pods/pod-overview.md
|
||||
- docs/concepts/workloads/pods/pod-lifecycle.md
|
||||
- docs/concepts/workloads/pods/init-containers.md
|
||||
- title: Jobs
|
||||
section:
|
||||
- docs/concepts/jobs/run-to-completion-finite-workloads.md
|
||||
|
||||
- title: Clusters
|
||||
- title: Cluster Administration
|
||||
section:
|
||||
- docs/concepts/clusters/logging.md
|
||||
- docs/concepts/cluster-administration/manage-deployment.md
|
||||
- docs/concepts/cluster-administration/networking.md
|
||||
- docs/concepts/cluster-administration/network-plugins.md
|
||||
- docs/concepts/cluster-administration/logging.md
|
||||
- docs/concepts/cluster-administration/audit.md
|
||||
- docs/concepts/cluster-administration/out-of-resource.md
|
||||
- docs/concepts/cluster-administration/multiple-clusters.md
|
||||
- docs/concepts/cluster-administration/federation.md
|
||||
- docs/concepts/cluster-administration/guaranteed-scheduling-critical-addon-pods.md
|
||||
- docs/concepts/cluster-administration/static-pod.md
|
||||
- docs/concepts/cluster-administration/sysctl-cluster.md
|
||||
|
||||
- title: Services, Load Balancing, and Networking
|
||||
section:
|
||||
- docs/concepts/services-networking/dns-pod-service.md
|
||||
|
||||
- title: Configuration
|
||||
section:
|
||||
- docs/concepts/configuration/overview.md
|
||||
- docs/concepts/configuration/container-command-args.md
|
||||
- docs/concepts/configuration/manage-compute-resources-container.md
|
||||
|
||||
- title: Policies
|
||||
section:
|
||||
- docs/concepts/policy/container-capabilities.md
|
||||
- docs/concepts/policy/resource-quotas.md
|
||||
|
|
|
@ -170,6 +170,7 @@ toc:
|
|||
section:
|
||||
- docs/admin/index.md
|
||||
- docs/admin/cluster-management.md
|
||||
- docs/admin/upgrade-1-6.md
|
||||
- docs/admin/kubeadm.md
|
||||
- docs/admin/addons.md
|
||||
- docs/admin/node-allocatable.md
|
||||
|
|
|
@ -13,6 +13,8 @@ toc:
|
|||
- docs/tasks/configure-pod-container/define-environment-variable-container.md
|
||||
- docs/tasks/configure-pod-container/define-command-argument-container.md
|
||||
- docs/tasks/configure-pod-container/assign-cpu-ram-container.md
|
||||
- docs/tasks/configure-pod-container/limit-range.md
|
||||
- docs/tasks/configure-pod-container/apply-resource-quota-limit.md
|
||||
- docs/tasks/configure-pod-container/configure-volume-storage.md
|
||||
- docs/tasks/configure-pod-container/configure-persistent-volume-storage.md
|
||||
- docs/tasks/configure-pod-container/environment-variable-expose-pod-information.md
|
||||
|
@ -23,6 +25,17 @@ toc:
|
|||
- docs/tasks/configure-pod-container/communicate-containers-same-pod.md
|
||||
- docs/tasks/configure-pod-container/configure-pod-initialization.md
|
||||
- docs/tasks/configure-pod-container/attach-handler-lifecycle-event.md
|
||||
- docs/tasks/configure-pod-container/configure-pod-disruption-budget.md
|
||||
|
||||
- title: Running Applications
|
||||
section:
|
||||
- docs/tasks/run-application/rolling-update-replication-controller.md
|
||||
|
||||
- title: Running Jobs
|
||||
section:
|
||||
- docs/tasks/job/parallel-processing-expansion.md
|
||||
- docs/tasks/job/work-queue-1/index.md
|
||||
- docs/tasks/job/fine-parallel-processing-work-queue/index.md
|
||||
|
||||
- title: Accessing Applications in a Cluster
|
||||
section:
|
||||
|
@ -34,6 +47,7 @@ toc:
|
|||
- docs/tasks/debug-application-cluster/determine-reason-pod-failure.md
|
||||
- docs/tasks/debug-application-cluster/debug-init-containers.md
|
||||
- docs/tasks/debug-application-cluster/logging-stackdriver.md
|
||||
- docs/tasks/debug-application-cluster/monitor-node-health.md
|
||||
- docs/tasks/debug-application-cluster/logging-elasticsearch-kibana.md
|
||||
|
||||
- title: Accessing the Kubernetes API
|
||||
|
@ -46,6 +60,18 @@ toc:
|
|||
- docs/tasks/administer-cluster/dns-horizontal-autoscaling.md
|
||||
- docs/tasks/administer-cluster/safely-drain-node.md
|
||||
- docs/tasks/administer-cluster/change-pv-reclaim-policy.md
|
||||
- docs/tasks/administer-cluster/limit-storage-consumption.md
|
||||
|
||||
- title: Administering Federation
|
||||
section:
|
||||
- docs/tasks/administer-federation/configmap.md
|
||||
- docs/tasks/administer-federation/daemonset.md
|
||||
- docs/tasks/administer-federation/deployment.md
|
||||
- docs/tasks/administer-federation/events.md
|
||||
- docs/tasks/administer-federation/ingress.md
|
||||
- docs/tasks/administer-federation/namespaces.md
|
||||
- docs/tasks/administer-federation/replicaset.md
|
||||
- docs/tasks/administer-federation/secret.md
|
||||
|
||||
- title: Managing Stateful Applications
|
||||
section:
|
||||
|
|
|
@ -32,11 +32,18 @@ toc:
|
|||
- title: Online Training Course
|
||||
path: https://www.udacity.com/course/scalable-microservices-with-kubernetes--ud615
|
||||
- docs/tutorials/stateless-application/hello-minikube.md
|
||||
- title: Object Management Using kubectl
|
||||
section:
|
||||
- docs/tutorials/object-management-kubectl/object-management.md
|
||||
- docs/tutorials/object-management-kubectl/imperative-object-management-command.md
|
||||
- docs/tutorials/object-management-kubectl/imperative-object-management-configuration.md
|
||||
- docs/tutorials/object-management-kubectl/declarative-object-management-configuration.md
|
||||
- title: Stateless Applications
|
||||
section:
|
||||
- docs/tutorials/stateless-application/run-stateless-application-deployment.md
|
||||
- docs/tutorials/stateless-application/expose-external-ip-address-service.md
|
||||
- docs/tutorials/stateless-application/expose-external-ip-address.md
|
||||
- docs/tutorials/stateless-application/run-stateless-ap-replication-controller.md
|
||||
- title: Stateful Applications
|
||||
section:
|
||||
- docs/tutorials/stateful-application/basic-stateful-set.md
|
||||
|
@ -46,6 +53,12 @@ toc:
|
|||
- title: Connecting Applications
|
||||
section:
|
||||
- docs/tutorials/connecting-apps/connecting-frontend-backend.md
|
||||
- title: Clusters
|
||||
section:
|
||||
- docs/tutorials/clusters/apparmor.md
|
||||
- title: Services
|
||||
section:
|
||||
- docs/tutorials/services/source-ip.md
|
||||
- title: Federated Cluster Administration
|
||||
section:
|
||||
- docs/tutorials/federation/set-up-cluster-federation-kubefed.md
|
||||
|
|
|
@ -0,0 +1,12 @@
|
|||
|
||||
|
||||
<table style="background-color:#eeeeee">
|
||||
<tr>
|
||||
<td>
|
||||
<p><b>NOTICE</b></p>
|
||||
<p>As of March 14, 2017, the <a href="https://github.com/orgs/kubernetes/teams/sig-docs-maintainers">@kubernetes/sig-docs-maintainers</a> have begun migration of the User Guide content as announced previously to the <a href="https://github.com/kubernetes/community/tree/master/sig-docs">SIG Docs community</a> through the <a href="https://groups.google.com/forum/#!forum/kubernetes-sig-docs">kubernetes-sig-docs</a> group and <a href="https://kubernetes.slack.com/messages/sig-docs/">kubernetes.slack.com #sig-docs</a> channel.</p>
|
||||
<p>The user guides within this section are being refactored into topics within Tutorials, Tasks, and Concepts. Anything that has been moved will have a notice placed in its previous location as well as a link to its new location. The reorganization implements the table of contents (ToC) outlined in the <a href="https://docs.google.com/a/google.com/document/d/18hRCIorVarExB2eBVHTUR6eEJ2VVk5xq1iBmkQv8O6I/edit?usp=sharing">kubernetes-docs-toc</a> document and should improve the documentation's findability and readability for a wider range of audiences.</p>
|
||||
<p>For any questions, please contact: <a href="mailto:kubernetes-sig-docs@googlegroups.com">kubernetes-sig-docs@googlegroups.com</a></p>
|
||||
</td>
|
||||
</tr>
|
||||
</table>
|
|
@ -1280,7 +1280,7 @@ $feature-box-div-margin-bottom: 40px
|
|||
background-color: $white
|
||||
box-shadow: 0 5px 5px rgba(0,0,0,.24),0 0 5px rgba(0,0,0,.12)
|
||||
|
||||
#calendarWrapper
|
||||
#calendarMeetings
|
||||
position: relative
|
||||
width: 80vw
|
||||
height: 60vw
|
||||
|
@ -1288,6 +1288,14 @@ $feature-box-div-margin-bottom: 40px
|
|||
max-height: 900px
|
||||
margin: 20px auto
|
||||
|
||||
#calendarEvents
|
||||
position: relative
|
||||
width: 80vw
|
||||
height: 30vw
|
||||
max-width: 1200px
|
||||
max-height: 450px
|
||||
margin: 20px auto
|
||||
|
||||
iframe
|
||||
position: absolute
|
||||
border: 0
|
||||
|
|
|
@ -30,14 +30,14 @@ cid: community
|
|||
|
||||
<p>As a member of the Kubernetes community, you are welcome to join any of the SIG meetings
|
||||
you are interested in. No registration required.</p>
|
||||
<div id="calendarWrapper">
|
||||
<div id="calendarMeetings">
|
||||
<iframe src="https://calendar.google.com/calendar/embed?src=cgnt364vd8s86hr2phapfjc6uk%40group.calendar.google.com&ctz=America/Los_Angeles"
|
||||
frameborder="0" scrolling="no"></iframe>
|
||||
</div>
|
||||
</div>
|
||||
<div class="content">
|
||||
<h3>Events</h3>
|
||||
<div id="calendarWrapper">
|
||||
<div id="calendarEvents">
|
||||
<iframe src="https://calendar.google.com/calendar/embed?src=nt2tcnbtbied3l6gi2h29slvc0%40group.calendar.google.com&ctz=America/Los_Angeles"
|
||||
frameborder="0" scrolling="no"></iframe>
|
||||
</div>
|
||||
|
|
|
@ -4,389 +4,6 @@ assignees:
|
|||
title: AppArmor
|
||||
---
|
||||
|
||||
AppArmor is a Linux kernel enhancement that can reduce the potential attack surface of an
|
||||
application and provide greater defense in depth for Applications. Beta support for AppArmor was
|
||||
added in Kubernetes v1.4.
|
||||
{% include user-guide-content-moved.md %}
|
||||
|
||||
* TOC
|
||||
{:toc}
|
||||
|
||||
## What is AppArmor
|
||||
|
||||
AppArmor is a Linux kernel security module that supplements the standard Linux user and group based
|
||||
permissions to confine programs to a limited set of resources. AppArmor can be configured for any
|
||||
application to reduce its potential attack surface and provide greater defense in depth. It is
|
||||
configured through profiles tuned to whitelist the access needed by a specific program or container,
|
||||
such as Linux capabilities, network access, file permissions, etc. Each profile can be run in either
|
||||
enforcing mode, which blocks access to disallowed resources, or complain mode, which only reports
|
||||
violations.
|
||||
|
||||
AppArmor can help you to run a more secure deployment by restricting what containers are allowed to
|
||||
do, and /or providing better auditing through system logs. However, it is important to keep in mind
|
||||
that AppArmor is not a silver bullet, and can only do so much to protect against exploits in your
|
||||
application code. It is important to provide good, restrictive profiles, and harden your
|
||||
applications and cluster from other angles as well.
|
||||
|
||||
AppArmor support in Kubernetes is currently in beta.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
1. **Kubernetes version is at least v1.4**. Kubernetes support for AppArmor was added in
|
||||
v1.4. Kubernetes components older than v1.4 are not aware of the new AppArmor annotations, and
|
||||
will **silently ignore** any AppArmor settings that are provided. To ensure that your Pods are
|
||||
receiving the expected protections, it is important to verify the Kubelet version of your nodes:
|
||||
|
||||
$ kubectl get nodes -o=jsonpath=$'{range .items[*]}{@.metadata.name}: {@.status.nodeInfo.kubeletVersion}\n{end}'
|
||||
gke-test-default-pool-239f5d02-gyn2: v1.4.0
|
||||
gke-test-default-pool-239f5d02-x1kf: v1.4.0
|
||||
gke-test-default-pool-239f5d02-xwux: v1.4.0
|
||||
|
||||
2. **AppArmor kernel module is enabled**. For the Linux kernel to enforce an AppArmor profile, the
|
||||
AppArmor kernel module must be installed and enabled. Several distributions enable the module by
|
||||
default, such as Ubuntu and SUSE, and many others provide optional support. To check whether the
|
||||
module is enabled, check the `/sys/module/apparmor/parameters/enabled` file:
|
||||
|
||||
$ cat /sys/module/apparmor/parameters/enabled
|
||||
Y
|
||||
|
||||
If the Kubelet contains AppArmor support (>= v1.4), it will refuse to run a Pod with AppArmor
|
||||
options if the kernel module is not enabled.
|
||||
|
||||
*Note: Ubuntu carries many AppArmor patches that have not been merged into the upstream Linux
|
||||
kernel, including patches that add additional hooks and features. Kubernetes has only been
|
||||
tested with the upstream version, and does not promise support for other features.*
|
||||
|
||||
3. **Container runtime is Docker**. Currently the only Kubernetes-supported container runtime that
|
||||
also supports AppArmor is Docker. As more runtimes add AppArmor support, the options will be
|
||||
expanded. You can verify that your nodes are running docker with:
|
||||
|
||||
$ kubectl get nodes -o=jsonpath=$'{range .items[*]}{@.metadata.name}: {@.status.nodeInfo.containerRuntimeVersion}\n{end}'
|
||||
gke-test-default-pool-239f5d02-gyn2: docker://1.11.2
|
||||
gke-test-default-pool-239f5d02-x1kf: docker://1.11.2
|
||||
gke-test-default-pool-239f5d02-xwux: docker://1.11.2
|
||||
|
||||
If the Kubelet contains AppArmor support (>= v1.4), it will refuse to run a Pod with AppArmor
|
||||
options if the runtime is not Docker.
|
||||
|
||||
4. **Profile is loaded**. AppArmor is applied to a Pod by specifying an AppArmor profile that each
|
||||
container should be run with. If any of the specified profiles is not already loaded in the
|
||||
kernel, the Kubelet (>= v1.4) will reject the Pod. You can view which profiles are loaded on a
|
||||
node by checking the `/sys/kernel/security/apparmor/profiles` file. For example:
|
||||
|
||||
$ ssh gke-test-default-pool-239f5d02-gyn2 "sudo cat /sys/kernel/security/apparmor/profiles | sort"
|
||||
apparmor-test-deny-write (enforce)
|
||||
apparmor-test-audit-write (enforce)
|
||||
docker-default (enforce)
|
||||
k8s-nginx (enforce)
|
||||
|
||||
For more details on loading profiles on nodes, see
|
||||
[Setting up nodes with profiles](#setting-up-nodes-with-profiles).
|
||||
|
||||
As long as the Kubelet version includes AppArmor support (>= v1.4), the Kubelet will reject a Pod
|
||||
with AppArmor options if any of the prerequisites are not met. You can also verify AppArmor support
|
||||
on nodes by checking the node ready condition message (though this is likely to be removed in a
|
||||
later release):
|
||||
|
||||
$ kubectl get nodes -o=jsonpath=$'{range .items[*]}{@.metadata.name}: {.status.conditions[?(@.reason=="KubeletReady")].message}\n{end}'
|
||||
gke-test-default-pool-239f5d02-gyn2: kubelet is posting ready status. AppArmor enabled
|
||||
gke-test-default-pool-239f5d02-x1kf: kubelet is posting ready status. AppArmor enabled
|
||||
gke-test-default-pool-239f5d02-xwux: kubelet is posting ready status. AppArmor enabled
|
||||
|
||||
## Securing a Pod
|
||||
|
||||
*Note: AppArmor is currently in beta, so options are specified as annotations. Once support graduates to
|
||||
general availability, the annotations will be replaced with first-class fields (more details in
|
||||
[Upgrade path to GA](#upgrade-path-to-general-availability)).*
|
||||
|
||||
AppArmor profiles are specified *per-container*. To specify the AppArmor profile to run a Pod
|
||||
container with, add an annotation to the Pod's metadata:
|
||||
|
||||
container.apparmor.security.beta.kubernetes.io/<container_name>: <profile_ref>
|
||||
|
||||
Where `<container_name>` is the name of the container to apply the profile to, and `<profile_ref>`
|
||||
specifies the profile to apply. The `profile_ref` can be one of:
|
||||
|
||||
- `runtime/default` to apply the runtime's default profile.
|
||||
- `localhost/<profile_name>` to apply the profile loaded on the host with the name `<profile_name>`
|
||||
|
||||
See the [API Reference](#api-reference) for the full details on the annotation and profile name formats.
|
||||
|
||||
The Kubernetes AppArmor enforcement works by first checking that all the prerequisites have been
|
||||
met, and then forwarding the profile selection to the container runtime for enforcement. If the
|
||||
prerequisites have not been met, the Pod will be rejected, and will not run.
|
||||
|
||||
To verify that the profile was applied, you can expect to see the AppArmor security option listed in the container created event:
|
||||
|
||||
$ kubectl get events | grep Created
|
||||
22s 22s 1 hello-apparmor Pod spec.containers{hello} Normal Created {kubelet e2e-test-stclair-minion-group-31nt} Created container with docker id 269a53b202d3; Security:[seccomp=unconfined apparmor=k8s-apparmor-example-deny-write]
|
||||
|
||||
You can also verify directly that the container's root process is running with the correct profile by checking its proc attr:
|
||||
|
||||
$ kubectl exec <pod_name> cat /proc/1/attr/current
|
||||
k8s-apparmor-example-deny-write (enforce)
|
||||
|
||||
## Example
|
||||
|
||||
In this example you'll see:
|
||||
|
||||
- One way to load a profile on a node
|
||||
- How to enforce the profile on a Pod
|
||||
- How to check that the profile is loaded
|
||||
- What happens when a profile is violated
|
||||
- What happens when a profile cannot be loaded
|
||||
|
||||
*This example assumes you have already set up a cluster with AppArmor support.*
|
||||
|
||||
First, we need to load the profile we want to use onto our nodes. The profile we'll use simply
|
||||
denies all file writes:
|
||||
|
||||
{% include code.html language="text" file="deny-write.profile" ghlink="/docs/admin/apparmor/deny-write.profile" %}
|
||||
|
||||
Since we don't know where the Pod will be scheduled, we'll need to load the profile on all our
|
||||
nodes. For this example we'll just use SSH to install the profiles, but other approaches are
|
||||
discussed in [Setting up nodes with profiles](#setting-up-nodes-with-profiles).
|
||||
|
||||
$ NODES=(
|
||||
# The SSH-accessible domain names of your nodes
|
||||
gke-test-default-pool-239f5d02-gyn2.us-central1-a.my-k8s
|
||||
gke-test-default-pool-239f5d02-x1kf.us-central1-a.my-k8s
|
||||
gke-test-default-pool-239f5d02-xwux.us-central1-a.my-k8s)
|
||||
$ for NODE in ${NODES[*]}; do ssh $NODE 'sudo apparmor_parser -q <<EOF
|
||||
#include <tunables/global>
|
||||
|
||||
profile k8s-apparmor-example-deny-write flags=(attach_disconnected) {
|
||||
#include <abstractions/base>
|
||||
|
||||
file,
|
||||
|
||||
# Deny all file writes.
|
||||
deny /** w,
|
||||
}
|
||||
EOF'
|
||||
done
|
||||
|
||||
Next, we'll run a simple "Hello AppArmor" pod with the deny-write profile:
|
||||
|
||||
{% include code.html language="yaml" file="hello-apparmor-pod.yaml" ghlink="/docs/admin/apparmor/hello-apparmor-pod.yaml" %}
|
||||
|
||||
$ kubectl create -f /dev/stdin <<EOF
|
||||
apiVersion: v1
|
||||
kind: Pod
|
||||
metadata:
|
||||
name: hello-apparmor
|
||||
annotations:
|
||||
container.apparmor.security.beta.kubernetes.io/hello: localhost/k8s-apparmor-example-deny-write
|
||||
spec:
|
||||
containers:
|
||||
- name: hello
|
||||
image: busybox
|
||||
command: [ "sh", "-c", "echo 'Hello AppArmor!' && sleep 1h" ]
|
||||
EOF
|
||||
pod "hello-apparmor" created
|
||||
|
||||
If we look at the pod events, we can see that the Pod container was created with the AppArmor
|
||||
profile "k8s-apparmor-example-deny-write":
|
||||
|
||||
$ kubectl get events | grep hello-apparmor
|
||||
14s 14s 1 hello-apparmor Pod Normal Scheduled {default-scheduler } Successfully assigned hello-apparmor to gke-test-default-pool-239f5d02-gyn2
|
||||
14s 14s 1 hello-apparmor Pod spec.containers{hello} Normal Pulling {kubelet gke-test-default-pool-239f5d02-gyn2} pulling image "busybox"
|
||||
13s 13s 1 hello-apparmor Pod spec.containers{hello} Normal Pulled {kubelet gke-test-default-pool-239f5d02-gyn2} Successfully pulled image "busybox"
|
||||
13s 13s 1 hello-apparmor Pod spec.containers{hello} Normal Created {kubelet gke-test-default-pool-239f5d02-gyn2} Created container with docker id 06b6cd1c0989; Security:[seccomp=unconfined apparmor=k8s-apparmor-example-deny-write]
|
||||
13s 13s 1 hello-apparmor Pod spec.containers{hello} Normal Started {kubelet gke-test-default-pool-239f5d02-gyn2} Started container with docker id 06b6cd1c0989
|
||||
|
||||
We can verify that the container is actually running with that profile by checking its proc attr:
|
||||
|
||||
$ kubectl exec hello-apparmor cat /proc/1/attr/current
|
||||
k8s-apparmor-example-deny-write (enforce)
|
||||
|
||||
Finally, we can see what happens if we try to violate the profile by writing to a file:
|
||||
|
||||
$ kubectl exec hello-apparmor touch /tmp/test
|
||||
touch: /tmp/test: Permission denied
|
||||
error: error executing remote command: command terminated with non-zero exit code: Error executing in Docker Container: 1
|
||||
|
||||
To wrap up, let's look at what happens if we try to specify a profile that hasn't been loaded:
|
||||
|
||||
$ kubectl create -f /dev/stdin <<EOF
|
||||
apiVersion: v1
|
||||
kind: Pod
|
||||
metadata:
|
||||
name: hello-apparmor-2
|
||||
annotations:
|
||||
container.apparmor.security.beta.kubernetes.io/hello: localhost/k8s-apparmor-example-allow-write
|
||||
spec:
|
||||
containers:
|
||||
- name: hello
|
||||
image: busybox
|
||||
command: [ "sh", "-c", "echo 'Hello AppArmor!' && sleep 1h" ]
|
||||
EOF
|
||||
pod "hello-apparmor-2" created
|
||||
|
||||
$ kubectl describe pod hello-apparmor-2
|
||||
Name: hello-apparmor-2
|
||||
Namespace: default
|
||||
Node: gke-test-default-pool-239f5d02-x1kf/
|
||||
Start Time: Tue, 30 Aug 2016 17:58:56 -0700
|
||||
Labels: <none>
|
||||
Status: Failed
|
||||
Reason: AppArmor
|
||||
Message: Pod Cannot enforce AppArmor: profile "k8s-apparmor-example-allow-write" is not loaded
|
||||
IP:
|
||||
Controllers: <none>
|
||||
Containers:
|
||||
hello:
|
||||
Image: busybox
|
||||
Port:
|
||||
Command:
|
||||
sh
|
||||
-c
|
||||
echo 'Hello AppArmor!' && sleep 1h
|
||||
Requests:
|
||||
cpu: 100m
|
||||
Environment Variables: <none>
|
||||
Volumes:
|
||||
default-token-dnz7v:
|
||||
Type: Secret (a volume populated by a Secret)
|
||||
SecretName: default-token-dnz7v
|
||||
QoS Tier: Burstable
|
||||
Events:
|
||||
FirstSeen LastSeen Count From SubobjectPath Type Reason Message
|
||||
--------- -------- ----- ---- ------------- -------- ------ -------
|
||||
23s 23s 1 {default-scheduler } Normal Scheduled Successfully assigned hello-apparmor-2 to e2e-test-stclair-minion-group-t1f5
|
||||
23s 23s 1 {kubelet e2e-test-stclair-minion-group-t1f5} Warning AppArmor Cannot enforce AppArmor: profile "k8s-apparmor-example-allow-write" is not loaded
|
||||
|
||||
Note the pod status is Failed, with a helpful error message: `Pod Cannot enforce AppArmor: profile
|
||||
"k8s-apparmor-example-allow-write" is not loaded`. An event was also recorded with the same message.
|
||||
|
||||
## Administration
|
||||
|
||||
### Setting up nodes with profiles
|
||||
|
||||
Kubernetes does not currently provide any native mechanisms for loading AppArmor profiles onto
|
||||
nodes. There are lots of ways to setup the profiles though, such as:
|
||||
|
||||
- Through a [DaemonSet](../daemons/) that runs a Pod on each node to
|
||||
ensure the correct profiles are loaded. An example implementation can be found
|
||||
[here](https://github.com/kubernetes/contrib/tree/master/apparmor/loader).
|
||||
- At node initialization time, using your node initialization scripts (e.g. Salt, Ansible, etc.) or
|
||||
image.
|
||||
- By copying the profiles to each node and loading them through SSH, as demonstrated in the
|
||||
[Example](#example).
|
||||
|
||||
The scheduler is not aware of which profiles are loaded onto which node, so the full set of profiles
|
||||
must be loaded onto every node. An alternative approach is to add a node label for each profile (or
|
||||
class of profiles) on the node, and use a
|
||||
[node selector](../../user-guide/node-selection/) to ensure the Pod is run on a
|
||||
node with the required profile.
|
||||
|
||||
### Restricting profiles with the PodSecurityPolicy
|
||||
|
||||
If the PodSecurityPolicy extension is enabled, cluster-wide AppArmor restrictions can be applied. To
|
||||
enable the PodSecurityPolicy, two flags must be set on the `apiserver`:
|
||||
|
||||
--admission-control=PodSecurityPolicy[,others...]
|
||||
--runtime-config=extensions/v1beta1/podsecuritypolicy[,others...]
|
||||
|
||||
With the extension enabled, the AppArmor options can be specified as annotations on the PodSecurityPolicy:
|
||||
|
||||
apparmor.security.beta.kubernetes.io/defaultProfileName: <profile_ref>
|
||||
apparmor.security.beta.kubernetes.io/allowedProfileNames: <profile_ref>[,others...]
|
||||
|
||||
The default profile name option specifies the profile to apply to containers by default when none is
|
||||
specified. The allowed profile names option specifies a list of profiles that Pod containers are
|
||||
allowed to be run with. If both options are provided, the default must be allowed. The profiles are
|
||||
specified in the same format as on containers. See the [API Reference](#api-reference) for the full
|
||||
specification.
|
||||
|
||||
### Disabling AppArmor
|
||||
|
||||
If you do not want AppArmor to be available on your cluster, it can be disabled by a command-line flag:
|
||||
|
||||
--feature-gates=AppArmor=false
|
||||
|
||||
When disabled, any Pod that includes an AppArmor profile will fail validation with a "Forbidden"
|
||||
error. Note that by default docker always enables the "docker-default" profile on non-privileged
|
||||
pods (if the AppArmor kernel module is enabled), and will continue to do so even if the feature-gate
|
||||
is disabled. The option to disable AppArmor will be removed when AppArmor graduates to general
|
||||
availability (GA).
|
||||
|
||||
### Upgrading to Kubernetes v1.4 with AppArmor
|
||||
|
||||
No action is required with respect to AppArmor to upgrade your cluster to v1.4. However, if any
|
||||
existing pods had an AppArmor annotation, they will not go through validation (or PodSecurityPolicy
|
||||
admission). If permissive profiles are loaded on the nodes, a malicious user could pre-apply a
|
||||
permissive profile to escalate the pod privileges above the docker-default. If this is a concern, it
|
||||
is recommended to scrub the cluster of any pods containing an annotation with
|
||||
`apparmor.security.beta.kubernetes.io`.
|
||||
|
||||
### Upgrade path to General Availability
|
||||
|
||||
When AppArmor is ready to be graduated to general availability (GA), the options currently specified
|
||||
through annotations will be converted to fields. Supporting all the upgrade and downgrade paths
|
||||
through the transition is very nuanced, and will be explained in detail when the transition
|
||||
occurs. We will commit to supporting both fields and annotations for at least 2 releases, and will
|
||||
explicitly reject the annotations for at least 2 releases after that.
|
||||
|
||||
## Authoring Profiles
|
||||
|
||||
Getting AppArmor profiles specified correctly can be a tricky business. Fortunately there are some
|
||||
tools to help with that:
|
||||
|
||||
- `aa-genprof` and `aa-logprof` generate profile rules by monitoring an application's activity and
|
||||
logs, and admitting the actions it takes. Further instructions are provided by the
|
||||
[AppArmor documentation](http://wiki.apparmor.net/index.php/Profiling_with_tools).
|
||||
- [bane](https://github.com/jfrazelle/bane) is an AppArmor profile generator for Docker that uses a
|
||||
simplified profile language.
|
||||
|
||||
It is recommended to run your application through Docker on a development workstation to generate
|
||||
the profiles, but there is nothing preventing running the tools on the Kubernetes node where your
|
||||
Pod is running.
|
||||
|
||||
To debug problems with AppArmor, you can check the system logs to see what, specifically, was
|
||||
denied. AppArmor logs verbose messages to `dmesg`, and errors can usually be found in the system
|
||||
logs or through `journalctl`. More information is provided in
|
||||
[AppArmor failures](http://wiki.apparmor.net/index.php/AppArmor_Failures).
|
||||
|
||||
Additional resources:
|
||||
|
||||
- [Quick guide to the AppArmor profile language](http://wiki.apparmor.net/index.php/QuickProfileLanguage)
|
||||
- [AppArmor core policy reference](http://wiki.apparmor.net/index.php/ProfileLanguage)
|
||||
|
||||
## API Reference
|
||||
|
||||
**Pod Annotation**:
|
||||
|
||||
Specifying the profile a container will run with:
|
||||
|
||||
- **key**: `container.apparmor.security.beta.kubernetes.io/<container_name>`
|
||||
Where `<container_name>` matches the name of a container in the Pod.
|
||||
A separate profile can be specified for each container in the Pod.
|
||||
- **value**: a profile reference, described below
|
||||
|
||||
**Profile Reference**:
|
||||
|
||||
- `runtime/default`: Refers to the default runtime profile.
|
||||
- Equivalent to not specifying a profile (without a PodSecurityPolicy default), except it still
|
||||
requires AppArmor to be enabled.
|
||||
- For Docker, this resolves to the
|
||||
[`docker-default`](https://docs.docker.com/engine/security/apparmor/) profile for non-privileged
|
||||
containers, and unconfined (no profile) for privileged containers.
|
||||
- `localhost/<profile_name>`: Refers to a profile loaded on the node (localhost) by name.
|
||||
- The possible profile names are detailed in the
|
||||
[core policy reference](http://wiki.apparmor.net/index.php/AppArmor_Core_Policy_Reference#Profile_names_and_attachment_specifications)
|
||||
|
||||
Any other profile reference format is invalid.
|
||||
|
||||
**PodSecurityPolicy Annotations**
|
||||
|
||||
Specifying the default profile to apply to containers when none is provided:
|
||||
|
||||
- **key**: `apparmor.security.beta.kubernetes.io/defaultProfileName`
|
||||
- **value**: a profile reference, described above
|
||||
|
||||
Specifying the list of profiles Pod containers is allowed to specify:
|
||||
|
||||
- **key**: `apparmor.security.beta.kubernetes.io/allowedProfileNames`
|
||||
- **value**: a comma-separated list of profile references (described above)
|
||||
- Although an escaped comma is a legal character in a profile name, it cannot be explicitly
|
||||
allowed here
|
||||
[AppArmor](/docs/tutorials/clusters/apparmor/)
|
||||
|
|
|
@ -5,63 +5,6 @@ assignees:
|
|||
title: Audit in Kubernetes
|
||||
---
|
||||
|
||||
* TOC
|
||||
{:toc}
|
||||
{% include user-guide-content-moved.md %}
|
||||
|
||||
Kubernetes Audit provides a security-relevant chronological set of records documenting
|
||||
the sequence of activities that have affected system by individual users, administrators
|
||||
or other components of the system. It allows cluster administrator to
|
||||
answer the following questions:
|
||||
- what happened?
|
||||
- when did it happen?
|
||||
- who initiated it?
|
||||
- on what did it happen?
|
||||
- where was it observed?
|
||||
- from where was it initiated?
|
||||
- to where was it going?
|
||||
|
||||
NOTE: Currently, Kubernetes provides only basic audit capabilities, there is still a lot
|
||||
of work going on to provide fully featured auditing capabilities (see [this issue](https://github.com/kubernetes/features/issues/22)).
|
||||
|
||||
Kubernetes audit is part of [kube-apiserver](/docs/admin/kube-apiserver) logging all requests
|
||||
coming to the server. Each audit log contains two entries:
|
||||
|
||||
1. The request line containing:
|
||||
- unique id allowing to match the response line (see 2)
|
||||
- source ip of the request
|
||||
- HTTP method being invoked
|
||||
- original user invoking the operation
|
||||
- impersonated user for the operation
|
||||
- namespace of the request or <none>
|
||||
- URI as requested
|
||||
2. The response line containing:
|
||||
- the unique id from 1
|
||||
- response code
|
||||
|
||||
Example output for user `admin` asking for a list of pods:
|
||||
|
||||
```
|
||||
2016-09-07T13:03:57.400333046Z AUDIT: id="5c3b8227-4af9-4322-8a71-542231c3887b" ip="127.0.0.1" method="GET" user="admin" as="<self>" namespace="default" uri="/api/v1/namespaces/default/pods"
|
||||
2016-09-07T13:03:57.400710987Z AUDIT: id="5c3b8227-4af9-4322-8a71-542231c3887b" response="200"
|
||||
```
|
||||
|
||||
NOTE: The audit capabilities are available *only* for the secured endpoint of the API server.
|
||||
|
||||
## Configuration
|
||||
|
||||
[Kube-apiserver](/docs/admin/kube-apiserver) provides following options which are responsible
|
||||
for configuring where and how audit logs are handled:
|
||||
|
||||
- `audit-log-path` - enables the audit log pointing to a file where the requests are being logged to.
|
||||
- `audit-log-maxage` - specifies maximum number of days to retain old audit log files based on the timestamp encoded in their filename.
|
||||
- `audit-log-maxbackup` - specifies maximum number of old audit log files to retain.
|
||||
- `audit-log-maxsize` - specifies maximum size in megabytes of the audit log file before it gets rotated. Defaults to 100MB
|
||||
|
||||
If an audit log file already exists, Kubernetes appends new audit logs to that file.
|
||||
Otherwise, Kubernetes creates an audit log file at the location you specified in
|
||||
`audit-log-path`. If the audit log file exceeds the size you specify in `audit-log-maxsize`,
|
||||
Kubernetes will rename the current log file by appending the current timestamp on
|
||||
the file name (before the file extension) and create a new audit log file.
|
||||
Kubernetes may delete old log files when creating a new log file; you can configure
|
||||
how many files are retained and how old they can be by specifying the `audit-log-maxbackup`
|
||||
and `audit-log-maxage` options.
|
||||
[Auditing](/docs/concepts/cluster-administration/audit/)
|
||||
|
|
|
@ -4,133 +4,6 @@ assignees:
|
|||
title: Kubernetes Components
|
||||
---
|
||||
|
||||
This document outlines the various binary components that need to run to
|
||||
deliver a functioning Kubernetes cluster.
|
||||
{% include user-guide-content-moved.md %}
|
||||
|
||||
## Master Components
|
||||
|
||||
Master components are those that provide the cluster's control plane. For
|
||||
example, master components are responsible for making global decisions about the
|
||||
cluster (e.g., scheduling), and detecting and responding to cluster events
|
||||
(e.g., starting up a new pod when a replication controller's 'replicas' field is
|
||||
unsatisfied).
|
||||
|
||||
In theory, Master components can be run on any node in the cluster. However,
|
||||
for simplicity, current set up scripts typically start all master components on
|
||||
the same VM, and does not run user containers on this VM. See
|
||||
[high-availability.md](/docs/admin/high-availability) for an example multi-master-VM setup.
|
||||
|
||||
Even in the future, when Kubernetes is fully self-hosting, it will probably be
|
||||
wise to only allow master components to schedule on a subset of nodes, to limit
|
||||
co-running with user-run pods, reducing the possible scope of a
|
||||
node-compromising security exploit.
|
||||
|
||||
### kube-apiserver
|
||||
|
||||
[kube-apiserver](/docs/admin/kube-apiserver) exposes the Kubernetes API; it is the front-end for the
|
||||
Kubernetes control plane. It is designed to scale horizontally (i.e., one scales
|
||||
it by running more of them-- [high-availability.md](/docs/admin/high-availability)).
|
||||
|
||||
### etcd
|
||||
|
||||
[etcd](/docs/admin/etcd) is used as Kubernetes' backing store. All cluster data is stored here.
|
||||
Proper administration of a Kubernetes cluster includes a backup plan for etcd's
|
||||
data.
|
||||
|
||||
### kube-controller-manager
|
||||
|
||||
[kube-controller-manager](/docs/admin/kube-controller-manager) is a binary that runs controllers, which are the
|
||||
background threads that handle routine tasks in the cluster. Logically, each
|
||||
controller is a separate process, but to reduce the number of moving pieces in
|
||||
the system, they are all compiled into a single binary and run in a single
|
||||
process.
|
||||
|
||||
These controllers include:
|
||||
|
||||
* Node Controller: Responsible for noticing & responding when nodes go down.
|
||||
* Replication Controller: Responsible for maintaining the correct number of pods for every replication
|
||||
controller object in the system.
|
||||
* Endpoints Controller: Populates the Endpoints object (i.e., join Services & Pods).
|
||||
* Service Account & Token Controllers: Create default accounts and API access tokens for new namespaces.
|
||||
* ... and others.
|
||||
|
||||
### kube-scheduler
|
||||
|
||||
[kube-scheduler](/docs/admin/kube-scheduler) watches newly created pods that have no node assigned, and
|
||||
selects a node for them to run on.
|
||||
|
||||
### addons
|
||||
|
||||
Addons are pods and services that implement cluster features. The pods may be managed
|
||||
by Deployments, ReplicationContollers, etc. Namespaced addon objects are created in
|
||||
the "kube-system" namespace.
|
||||
|
||||
Addon manager takes the responsibility for creating and maintaining addon resources.
|
||||
See [here](http://releases.k8s.io/HEAD/cluster/addons) for more details.
|
||||
|
||||
#### DNS
|
||||
|
||||
While the other addons are not strictly required, all Kubernetes
|
||||
clusters should have [cluster DNS](/docs/admin/dns/), as many examples rely on it.
|
||||
|
||||
Cluster DNS is a DNS server, in addition to the other DNS server(s) in your
|
||||
environment, which serves DNS records for Kubernetes services.
|
||||
|
||||
Containers started by Kubernetes automatically include this DNS server
|
||||
in their DNS searches.
|
||||
|
||||
#### User interface
|
||||
|
||||
The kube-ui provides a read-only overview of the cluster state. Access
|
||||
[the UI using kubectl proxy](/docs/user-guide/connecting-to-applications-proxy/#connecting-to-the-kube-ui-service-from-your-local-workstation)
|
||||
|
||||
#### Container Resource Monitoring
|
||||
|
||||
[Container Resource Monitoring](/docs/user-guide/monitoring) records generic time-series metrics
|
||||
about containers in a central database, and provides a UI for browsing that data.
|
||||
|
||||
#### Cluster-level Logging
|
||||
|
||||
A [Cluster-level logging](/docs/user-guide/logging/overview) mechanism is responsible for
|
||||
saving container logs to a central log store with search/browsing interface.
|
||||
|
||||
## Node components
|
||||
|
||||
Node components run on every node, maintaining running pods and providing them
|
||||
the Kubernetes runtime environment.
|
||||
|
||||
### kubelet
|
||||
|
||||
[kubelet](/docs/admin/kubelet) is the primary node agent. It:
|
||||
|
||||
* Watches for pods that have been assigned to its node (either by apiserver
|
||||
or via local configuration file) and:
|
||||
* Mounts the pod's required volumes
|
||||
* Downloads the pod's secrets
|
||||
* Runs the pod's containers via docker (or, experimentally, rkt).
|
||||
* Periodically executes any requested container liveness probes.
|
||||
* Reports the status of the pod back to the rest of the system, by creating a
|
||||
"mirror pod" if necessary.
|
||||
* Reports the status of the node back to the rest of the system.
|
||||
|
||||
### kube-proxy
|
||||
|
||||
[kube-proxy](/docs/admin/kube-proxy) enables the Kubernetes service abstraction by maintaining
|
||||
network rules on the host and performing connection forwarding.
|
||||
|
||||
### docker
|
||||
|
||||
`docker` is of course used for actually running containers.
|
||||
|
||||
### rkt
|
||||
|
||||
`rkt` is supported experimentally as an alternative to docker.
|
||||
|
||||
### supervisord
|
||||
|
||||
`supervisord` is a lightweight process babysitting system for keeping kubelet and docker
|
||||
running.
|
||||
|
||||
### fluentd
|
||||
|
||||
`fluentd` is a daemon which helps provide [cluster-level logging](#cluster-level-logging).
|
||||
[Kubernetes Components](/docs/concepts/overview/components/)
|
||||
|
|
|
@ -19,7 +19,9 @@ To install Kubernetes on a set of machines, consult one of the existing [Getting
|
|||
|
||||
## Upgrading a cluster
|
||||
|
||||
The current state of cluster upgrades is provider dependent.
|
||||
The current state of cluster upgrades is provider dependent, and some releases may require special care when upgrading. It is recommended that administrators consult both the [release notes](https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG.md), as well as the version specific upgrade notes prior to upgrading their clusters.
|
||||
|
||||
* [Upgrading to 1.6](/docs/admin/upgrade)
|
||||
|
||||
### Upgrading Google Compute Engine clusters
|
||||
|
||||
|
@ -56,8 +58,12 @@ The node upgrade process is user-initiated and is described in the [GKE document
|
|||
|
||||
### Upgrading clusters on other platforms
|
||||
|
||||
The `cluster/kube-push.sh` script will do a rudimentary update. This process is still quite experimental, we
|
||||
recommend testing the upgrade on an experimental cluster before performing the update on a production cluster.
|
||||
Different providers, and tools, will manage upgrades differently. It is recommended that you consult their main documentation regarding upgrades.
|
||||
|
||||
* [kops](https://github.com/kubernetes/kops)
|
||||
* [kargo](https://github.com/kubernetes-incubator/kargo)
|
||||
* [CoreOS Tectonic](https://coreos.com/tectonic/docs/latest/admin/upgrade.html)
|
||||
* ...
|
||||
|
||||
## Resizing a cluster
|
||||
|
||||
|
|
|
@ -51,7 +51,7 @@ A pod template in a DaemonSet must have a [`RestartPolicy`](/docs/user-guide/pod
|
|||
### Pod Selector
|
||||
|
||||
The `.spec.selector` field is a pod selector. It works the same as the `.spec.selector` of
|
||||
a [Job](/docs/user-guide/jobs/) or other new resources.
|
||||
a [Job](/docs/concepts/jobs/run-to-completion-finite-workloads/) or other new resources.
|
||||
|
||||
The `spec.selector` is an object consisting of two fields:
|
||||
|
||||
|
|
|
@ -3,93 +3,7 @@ assignees:
|
|||
- davidopp
|
||||
title: Pod Disruption Budget
|
||||
---
|
||||
This guide is for anyone wishing to specify safety constraints on pods or anyone
|
||||
wishing to write software (typically automation software) that respects those
|
||||
constraints.
|
||||
|
||||
* TOC
|
||||
{:toc}
|
||||
{% include user-guide-content-moved.md %}
|
||||
|
||||
## Rationale
|
||||
|
||||
Various cluster management operations may voluntarily evict pods. "Voluntary"
|
||||
means an eviction can be safely delayed for a reasonable period of time. The
|
||||
principal examples today are draining a node for maintenance or upgrade
|
||||
(`kubectl drain`), and cluster autoscaling down. In the future the
|
||||
[rescheduler](https://github.com/kubernetes/kubernetes/blob/master/docs/proposals/rescheduling.md)
|
||||
may also perform voluntary evictions. By contrast, something like evicting pods
|
||||
because a node has become unreachable or reports `NotReady`, is not "voluntary."
|
||||
|
||||
For voluntary evictions, it can be useful for applications to be able to limit
|
||||
the number of pods that are down simultaneously. For example, a quorum-based application would
|
||||
like to ensure that the number of replicas running is never brought below the
|
||||
number needed for a quorum, even temporarily. Or a web front end might want to
|
||||
ensure that the number of replicas serving load never falls below a certain
|
||||
percentage of the total, even briefly. `PodDisruptionBudget` is an API object
|
||||
that specifies the minimum number or percentage of replicas of a collection that
|
||||
must be up at a time. Components that wish to evict a pod subject to disruption
|
||||
budget use the `/eviction` subresource; unlike a regular pod deletion, this
|
||||
operation may be rejected by the API server if the eviction would cause a
|
||||
disruption budget to be violated.
|
||||
|
||||
## Specifying a PodDisruptionBudget
|
||||
|
||||
A `PodDisruptionBudget` has two components: a label selector `selector` to specify the set of
|
||||
pods to which it applies, and `minAvailable` which is a description of the number of pods from that
|
||||
set that must still be available after the eviction, i.e. even in the absence
|
||||
of the evicted pod. `minAvailable` can be either an absolute number or a percentage.
|
||||
So for example, 100% means no voluntary evictions from the set are permitted. In
|
||||
typical usage, a single budget would be used for a collection of pods managed by
|
||||
a controller—for example, the pods in a single ReplicaSet.
|
||||
|
||||
Note that a disruption budget does not truly guarantee that the specified
|
||||
number/percentage of pods will always be up. For example, a node that hosts a
|
||||
pod from the collection may fail when the collection is at the minimum size
|
||||
specified in the budget, thus bringing the number of available pods from the
|
||||
collection below the specified size. The budget can only protect against
|
||||
voluntary evictions, not all causes of unavailability.
|
||||
|
||||
## Requesting an eviction
|
||||
|
||||
If you are writing infrastructure software that wants to produce these voluntary
|
||||
evictions, you will need to use the eviction API. The eviction subresource of a
|
||||
pod can be thought of as a kind of policy-controlled DELETE operation on the pod
|
||||
itself. To attempt an eviction (perhaps more REST-precisely, to attempt to
|
||||
*create* an eviction), you POST an attempted operation. Here's an example:
|
||||
|
||||
```json
|
||||
{
|
||||
"apiVersion": "policy/v1beta1",
|
||||
"kind": "Eviction",
|
||||
"metadata": {
|
||||
"name": "quux",
|
||||
"namespace": "default"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
You can attempt an eviction using `curl`:
|
||||
|
||||
```bash
|
||||
$ curl -v -H 'Content-type: application/json' http://127.0.0.1:8080/api/v1/namespaces/default/pods/quux/eviction -d @eviction.json
|
||||
```
|
||||
|
||||
The API can respond in one of three ways.
|
||||
|
||||
1. If the eviction is granted, then the pod is deleted just as if you had sent
|
||||
a `DELETE` request to the pod's URL and you get back `200 OK`.
|
||||
2. If the current state of affairs wouldn't allow an eviction by the rules set
|
||||
forth in the budget, you get back `429 Too Many Requests`. This is
|
||||
typically used for generic rate limiting of *any* requests, but here we mean
|
||||
that this request isn't allowed *right now* but it may be allowed later.
|
||||
Currently, callers do not get any `Retry-After` advice, but they may in
|
||||
future versions.
|
||||
3. If there is some kind of misconfiguration, like multiple budgets pointing at
|
||||
the same pod, you will get `500 Internal Server Error`.
|
||||
|
||||
For a given eviction request, there are two cases.
|
||||
|
||||
1. There is no budget that matches this pod. In this case, the server always
|
||||
returns `200 OK`.
|
||||
2. There is at least one budget. In this case, any of the three above responses may
|
||||
apply.
|
||||
[Configuring a Pod Disruption Budget](/docs/tasks/configure-pod-container/configure-pod-disruption-budget/)
|
||||
|
|
|
@ -5,385 +5,6 @@ assignees:
|
|||
title: Using DNS Pods and Services
|
||||
---
|
||||
|
||||
## Introduction
|
||||
{% include user-guide-content-moved.md %}
|
||||
|
||||
As of Kubernetes 1.3, DNS is a built-in service launched automatically using the addon manager [cluster add-on](http://releases.k8s.io/{{page.githubbranch}}/cluster/addons/README.md).
|
||||
|
||||
Kubernetes DNS schedules a DNS Pod and Service on the cluster, and configures
|
||||
the kubelets to tell individual containers to use the DNS Service's IP to
|
||||
resolve DNS names.
|
||||
|
||||
## What things get DNS names?
|
||||
|
||||
Every Service defined in the cluster (including the DNS server itself) is
|
||||
assigned a DNS name. By default, a client Pod's DNS search list will
|
||||
include the Pod's own namespace and the cluster's default domain. This is best
|
||||
illustrated by example:
|
||||
|
||||
Assume a Service named `foo` in the Kubernetes namespace `bar`. A Pod running
|
||||
in namespace `bar` can look up this service by simply doing a DNS query for
|
||||
`foo`. A Pod running in namespace `quux` can look up this service by doing a
|
||||
DNS query for `foo.bar`.
|
||||
|
||||
## Supported DNS schema
|
||||
|
||||
The following sections detail the supported record types and layout that is
|
||||
supported. Any other layout or names or queries that happen to work are
|
||||
considered implementation details and are subject to change without warning.
|
||||
|
||||
### Services
|
||||
|
||||
#### A records
|
||||
|
||||
"Normal" (not headless) Services are assigned a DNS A record for a name of the
|
||||
form `my-svc.my-namespace.svc.cluster.local`. This resolves to the cluster IP
|
||||
of the Service.
|
||||
|
||||
"Headless" (without a cluster IP) Services are also assigned a DNS A record for
|
||||
a name of the form `my-svc.my-namespace.svc.cluster.local`. Unlike normal
|
||||
Services, this resolves to the set of IPs of the pods selected by the Service.
|
||||
Clients are expected to consume the set or else use standard round-robin
|
||||
selection from the set.
|
||||
|
||||
### SRV records
|
||||
|
||||
SRV Records are created for named ports that are part of normal or [Headless
|
||||
Services](http://releases.k8s.io/docs/user-guide/services/#headless-services).
|
||||
For each named port, the SRV record would have the form
|
||||
`_my-port-name._my-port-protocol.my-svc.my-namespace.svc.cluster.local`.
|
||||
For a regular service, this resolves to the port number and the CNAME:
|
||||
`my-svc.my-namespace.svc.cluster.local`.
|
||||
For a headless service, this resolves to multiple answers, one for each pod
|
||||
that is backing the service, and contains the port number and a CNAME of the pod
|
||||
of the form `auto-generated-name.my-svc.my-namespace.svc.cluster.local`.
|
||||
|
||||
### Backwards compatibility
|
||||
|
||||
Previous versions of kube-dns made names of the form
|
||||
`my-svc.my-namespace.cluster.local` (the 'svc' level was added later). This
|
||||
is no longer supported.
|
||||
|
||||
### Pods
|
||||
|
||||
#### A Records
|
||||
|
||||
When enabled, pods are assigned a DNS A record in the form of `pod-ip-address.my-namespace.pod.cluster.local`.
|
||||
|
||||
For example, a pod with IP `1.2.3.4` in the namespace `default` with a DNS name of `cluster.local` would have an entry: `1-2-3-4.default.pod.cluster.local`.
|
||||
|
||||
#### A Records and hostname based on Pod's hostname and subdomain fields
|
||||
|
||||
Currently when a pod is created, its hostname is the Pod's `metadata.name` value.
|
||||
|
||||
With v1.2, users can specify a Pod annotation, `pod.beta.kubernetes.io/hostname`, to specify what the Pod's hostname should be.
|
||||
The Pod annotation, if specified, takes precedence over the Pod's name, to be the hostname of the pod.
|
||||
For example, given a Pod with annotation `pod.beta.kubernetes.io/hostname: my-pod-name`, the Pod will have its hostname set to "my-pod-name".
|
||||
|
||||
With v1.3, the PodSpec has a `hostname` field, which can be used to specify the Pod's hostname. This field value takes precedence over the
|
||||
`pod.beta.kubernetes.io/hostname` annotation value.
|
||||
|
||||
v1.2 introduces a beta feature where the user can specify a Pod annotation, `pod.beta.kubernetes.io/subdomain`, to specify the Pod's subdomain.
|
||||
The final domain will be "<hostname>.<subdomain>.<pod namespace>.svc.<cluster domain>".
|
||||
For example, a Pod with the hostname annotation set to "foo", and the subdomain annotation set to "bar", in namespace "my-namespace", will have the FQDN "foo.bar.my-namespace.svc.cluster.local"
|
||||
|
||||
With v1.3, the PodSpec has a `subdomain` field, which can be used to specify the Pod's subdomain. This field value takes precedence over the
|
||||
`pod.beta.kubernetes.io/subdomain` annotation value.
|
||||
|
||||
Example:
|
||||
|
||||
```yaml
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: default-subdomain
|
||||
spec:
|
||||
selector:
|
||||
name: busybox
|
||||
clusterIP: None
|
||||
ports:
|
||||
- name: foo # Actually, no port is needed.
|
||||
port: 1234
|
||||
targetPort: 1234
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Pod
|
||||
metadata:
|
||||
name: busybox1
|
||||
labels:
|
||||
name: busybox
|
||||
spec:
|
||||
hostname: busybox-1
|
||||
subdomain: default-subdomain
|
||||
containers:
|
||||
- image: busybox
|
||||
command:
|
||||
- sleep
|
||||
- "3600"
|
||||
name: busybox
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Pod
|
||||
metadata:
|
||||
name: busybox2
|
||||
labels:
|
||||
name: busybox
|
||||
spec:
|
||||
hostname: busybox-2
|
||||
subdomain: default-subdomain
|
||||
containers:
|
||||
- image: busybox
|
||||
command:
|
||||
- sleep
|
||||
- "3600"
|
||||
name: busybox
|
||||
```
|
||||
|
||||
If there exists a headless service in the same namespace as the pod and with the same name as the subdomain, the cluster's KubeDNS Server also returns an A record for the Pod's fully qualified hostname.
|
||||
Given a Pod with the hostname set to "busybox-1" and the subdomain set to "default-subdomain", and a headless Service named "default-subdomain" in the same namespace, the pod will see it's own FQDN as "busybox-1.default-subdomain.my-namespace.svc.cluster.local". DNS serves an A record at that name, pointing to the Pod's IP. Both pods "busybox1" and "busybox2" can have their distinct A records.
|
||||
|
||||
As of Kubernetes v1.2, the Endpoints object also has the annotation `endpoints.beta.kubernetes.io/hostnames-map`. Its value is the json representation of map[string(IP)][endpoints.HostRecord], for example: '{"10.245.1.6":{HostName: "my-webserver"}}'.
|
||||
If the Endpoints are for a headless service, an A record is created with the format <hostname>.<service name>.<pod namespace>.svc.<cluster domain>
|
||||
For the example json, if endpoints are for a headless service named "bar", and one of the endpoints has IP "10.245.1.6", an A record is created with the name "my-webserver.bar.my-namespace.svc.cluster.local" and the A record lookup would return "10.245.1.6".
|
||||
This endpoints annotation generally does not need to be specified by end-users, but can used by the internal service controller to deliver the aforementioned feature.
|
||||
|
||||
With v1.3, The Endpoints object can specify the `hostname` for any endpoint, along with its IP. The hostname field takes precedence over the hostname value
|
||||
that might have been specified via the `endpoints.beta.kubernetes.io/hostnames-map` annotation.
|
||||
|
||||
With v1.3, the following annotations are deprecated: `pod.beta.kubernetes.io/hostname`, `pod.beta.kubernetes.io/subdomain`, `endpoints.beta.kubernetes.io/hostnames-map`
|
||||
|
||||
## How do I test if it is working?
|
||||
|
||||
### Create a simple Pod to use as a test environment
|
||||
|
||||
Create a file named busybox.yaml with the
|
||||
following contents:
|
||||
|
||||
```yaml
|
||||
apiVersion: v1
|
||||
kind: Pod
|
||||
metadata:
|
||||
name: busybox
|
||||
namespace: default
|
||||
spec:
|
||||
containers:
|
||||
- image: busybox
|
||||
command:
|
||||
- sleep
|
||||
- "3600"
|
||||
imagePullPolicy: IfNotPresent
|
||||
name: busybox
|
||||
restartPolicy: Always
|
||||
```
|
||||
|
||||
Then create a pod using this file:
|
||||
|
||||
```
|
||||
kubectl create -f busybox.yaml
|
||||
```
|
||||
|
||||
### Wait for this pod to go into the running state
|
||||
|
||||
You can get its status with:
|
||||
```
|
||||
kubectl get pods busybox
|
||||
```
|
||||
|
||||
You should see:
|
||||
|
||||
```
|
||||
NAME READY STATUS RESTARTS AGE
|
||||
busybox 1/1 Running 0 <some-time>
|
||||
```
|
||||
|
||||
### Validate that DNS is working
|
||||
|
||||
Once that pod is running, you can exec nslookup in that environment:
|
||||
|
||||
```
|
||||
kubectl exec -ti busybox -- nslookup kubernetes.default
|
||||
```
|
||||
|
||||
You should see something like:
|
||||
|
||||
```
|
||||
Server: 10.0.0.10
|
||||
Address 1: 10.0.0.10
|
||||
|
||||
Name: kubernetes.default
|
||||
Address 1: 10.0.0.1
|
||||
```
|
||||
|
||||
If you see that, DNS is working correctly.
|
||||
|
||||
### Troubleshooting Tips
|
||||
|
||||
If the nslookup command fails, check the following:
|
||||
|
||||
#### Check the local DNS configuration first
|
||||
Take a look inside the resolv.conf file. (See "Inheriting DNS from the node" and "Known issues" below for more information)
|
||||
|
||||
```
|
||||
kubectl exec busybox cat /etc/resolv.conf
|
||||
```
|
||||
|
||||
Verify that the search path and name server are set up like the following (note that search path may vary for different cloud providers):
|
||||
|
||||
```
|
||||
search default.svc.cluster.local svc.cluster.local cluster.local google.internal c.gce_project_id.internal
|
||||
nameserver 10.0.0.10
|
||||
options ndots:5
|
||||
```
|
||||
|
||||
#### Quick diagnosis
|
||||
|
||||
Errors such as the following indicate a problem with the kube-dns add-on or associated Services:
|
||||
|
||||
```
|
||||
$ kubectl exec -ti busybox -- nslookup kubernetes.default
|
||||
Server: 10.0.0.10
|
||||
Address 1: 10.0.0.10
|
||||
|
||||
nslookup: can't resolve 'kubernetes.default'
|
||||
```
|
||||
|
||||
or
|
||||
|
||||
```
|
||||
$ kubectl exec -ti busybox -- nslookup kubernetes.default
|
||||
Server: 10.0.0.10
|
||||
Address 1: 10.0.0.10 kube-dns.kube-system.svc.cluster.local
|
||||
|
||||
nslookup: can't resolve 'kubernetes.default'
|
||||
```
|
||||
|
||||
#### Check if the DNS pod is running
|
||||
|
||||
Use the kubectl get pods command to verify that the DNS pod is running.
|
||||
|
||||
```
|
||||
kubectl get pods --namespace=kube-system -l k8s-app=kube-dns
|
||||
```
|
||||
|
||||
You should see something like:
|
||||
|
||||
```
|
||||
NAME READY STATUS RESTARTS AGE
|
||||
...
|
||||
kube-dns-v19-ezo1y 3/3 Running 0 1h
|
||||
...
|
||||
```
|
||||
|
||||
If you see that no pod is running or that the pod has failed/completed, the DNS add-on may not be deployed by default in your current environment and you will have to deploy it manually.
|
||||
|
||||
#### Check for Errors in the DNS pod
|
||||
|
||||
Use `kubectl logs` command to see logs for the DNS daemons.
|
||||
|
||||
```
|
||||
kubectl logs --namespace=kube-system $(kubectl get pods --namespace=kube-system -l k8s-app=kube-dns -o name) -c kubedns
|
||||
kubectl logs --namespace=kube-system $(kubectl get pods --namespace=kube-system -l k8s-app=kube-dns -o name) -c dnsmasq
|
||||
kubectl logs --namespace=kube-system $(kubectl get pods --namespace=kube-system -l k8s-app=kube-dns -o name) -c healthz
|
||||
```
|
||||
|
||||
See if there is any suspicious log. W, E, F letter at the beginning represent Warning, Error and Failure. Please search for entries that have these as the logging level and use [kubernetes issues](https://github.com/kubernetes/kubernetes/issues) to report unexpected errors.
|
||||
|
||||
#### Is DNS service up?
|
||||
|
||||
Verify that the DNS service is up by using the `kubectl get service` command.
|
||||
|
||||
```
|
||||
kubectl get svc --namespace=kube-system
|
||||
```
|
||||
|
||||
You should see:
|
||||
|
||||
```
|
||||
NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
|
||||
...
|
||||
kube-dns 10.0.0.10 <none> 53/UDP,53/TCP 1h
|
||||
...
|
||||
```
|
||||
|
||||
If you have created the service or in the case it should be created by default but it does not appear, see this [debugging services page](http://kubernetes.io/docs/user-guide/debugging-services/) for more information.
|
||||
|
||||
#### Are DNS endpoints exposed?
|
||||
|
||||
You can verify that DNS endpoints are exposed by using the `kubectl get endpoints` command.
|
||||
|
||||
```
|
||||
kubectl get ep kube-dns --namespace=kube-system
|
||||
```
|
||||
|
||||
You should see something like:
|
||||
```
|
||||
NAME ENDPOINTS AGE
|
||||
kube-dns 10.180.3.17:53,10.180.3.17:53 1h
|
||||
```
|
||||
|
||||
If you do not see the endpoints, see endpoints section in the [debugging services documentation](http://kubernetes.io/docs/user-guide/debugging-services/).
|
||||
|
||||
For additional Kubernetes DNS examples, see the [cluster-dns examples](https://github.com/kubernetes/kubernetes/tree/master/examples/cluster-dns) in the Kubernetes GitHub repository.
|
||||
|
||||
## Kubernetes Federation (Multiple Zone support)
|
||||
|
||||
Release 1.3 introduced Cluster Federation support for multi-site
|
||||
Kubernetes installations. This required some minor
|
||||
(backward-compatible) changes to the way
|
||||
the Kubernetes cluster DNS server processes DNS queries, to facilitate
|
||||
the lookup of federated services (which span multiple Kubernetes clusters).
|
||||
See the [Cluster Federation Administrators' Guide](/docs/admin/federation) for more
|
||||
details on Cluster Federation and multi-site support.
|
||||
|
||||
## How it Works
|
||||
|
||||
The running Kubernetes DNS pod holds 3 containers - kubedns, dnsmasq and a health check called healthz.
|
||||
The kubedns process watches the Kubernetes master for changes in Services and Endpoints, and maintains
|
||||
in-memory lookup structures to service DNS requests. The dnsmasq container adds DNS caching to improve
|
||||
performance. The healthz container provides a single health check endpoint while performing dual healthchecks
|
||||
(for dnsmasq and kubedns).
|
||||
|
||||
The DNS pod is exposed as a Kubernetes Service with a static IP. Once assigned the
|
||||
kubelet passes DNS configured using the `--cluster-dns=10.0.0.10` flag to each
|
||||
container.
|
||||
|
||||
DNS names also need domains. The local domain is configurable, in the kubelet using
|
||||
the flag `--cluster-domain=<default local domain>`
|
||||
|
||||
The Kubernetes cluster DNS server (based off the [SkyDNS](https://github.com/skynetservices/skydns) library)
|
||||
supports forward lookups (A records), service lookups (SRV records) and reverse IP address lookups (PTR records).
|
||||
|
||||
## Inheriting DNS from the node
|
||||
When running a pod, kubelet will prepend the cluster DNS server and search
|
||||
paths to the node's own DNS settings. If the node is able to resolve DNS names
|
||||
specific to the larger environment, pods should be able to, also. See "Known
|
||||
issues" below for a caveat.
|
||||
|
||||
If you don't want this, or if you want a different DNS config for pods, you can
|
||||
use the kubelet's `--resolv-conf` flag. Setting it to "" means that pods will
|
||||
not inherit DNS. Setting it to a valid file path means that kubelet will use
|
||||
this file instead of `/etc/resolv.conf` for DNS inheritance.
|
||||
|
||||
## Known issues
|
||||
Kubernetes installs do not configure the nodes' resolv.conf files to use the
|
||||
cluster DNS by default, because that process is inherently distro-specific.
|
||||
This should probably be implemented eventually.
|
||||
|
||||
Linux's libc is impossibly stuck ([see this bug from
|
||||
2005](https://bugzilla.redhat.com/show_bug.cgi?id=168253)) with limits of just
|
||||
3 DNS `nameserver` records and 6 DNS `search` records. Kubernetes needs to
|
||||
consume 1 `nameserver` record and 3 `search` records. This means that if a
|
||||
local installation already uses 3 `nameserver`s or uses more than 3 `search`es,
|
||||
some of those settings will be lost. As a partial workaround, the node can run
|
||||
`dnsmasq` which will provide more `nameserver` entries, but not more `search`
|
||||
entries. You can also use kubelet's `--resolv-conf` flag.
|
||||
|
||||
If you are using Alpine version 3.3 or earlier as your base image, DNS may not
|
||||
work properly owing to a known issue with Alpine. Check [here](https://github.com/kubernetes/kubernetes/issues/30215)
|
||||
for more information.
|
||||
|
||||
## References
|
||||
|
||||
- [Docs for the DNS cluster addon](http://releases.k8s.io/{{page.githubbranch}}/cluster/addons/dns/README.md)
|
||||
|
||||
## What's next
|
||||
- [Autoscaling the DNS Service in a Cluster](/docs/tasks/administer-cluster/dns-horizontal-autoscaling/).
|
||||
[DNS Pods and Services](/docs/concepts/services-networking/dns-pod-service/)
|
||||
|
|
|
@ -4,205 +4,6 @@ assignees:
|
|||
title: Setting up Cluster Federation with Kubefed
|
||||
---
|
||||
|
||||
* TOC
|
||||
{:toc}
|
||||
{% include user-guide-content-moved.md %}
|
||||
|
||||
Kubernetes version 1.5 includes a new command line tool called
|
||||
`kubefed` to help you administrate your federated clusters.
|
||||
`kubefed` helps you to deploy a new Kubernetes cluster federation
|
||||
control plane, and to add clusters to or remove clusters from an
|
||||
existing federation control plane.
|
||||
|
||||
This guide explains how to administer a Kubernetes Cluster Federation
|
||||
using `kubefed`.
|
||||
|
||||
> Note: `kubefed` is an alpha feature in Kubernetes 1.5.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
This guide assumes that you have a running Kubernetes cluster. Please
|
||||
see one of the [getting started](/docs/getting-started-guides/) guides
|
||||
for installation instructions for your platform.
|
||||
|
||||
|
||||
## Getting `kubefed`
|
||||
|
||||
Download the client tarball corresponding to Kubernetes version 1.5
|
||||
or later
|
||||
[from the release page](https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG.md),
|
||||
extract the binaries in the tarball to one of the directories
|
||||
in your `$PATH` and set the executable permission on those binaries.
|
||||
|
||||
Note: The URL in the curl command below downloads the binaries for
|
||||
Linux amd64. If you are on a different platform, please use the URL
|
||||
for the binaries appropriate for your platform. You can find the list
|
||||
of available binaries on the [release page](https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG.md#client-binaries-1).
|
||||
|
||||
|
||||
```shell
|
||||
curl -O https://storage.googleapis.com/kubernetes-release/release/v1.5.2/kubernetes-client-linux-amd64.tar.gz
|
||||
tar -xzvf kubernetes-client-linux-amd64.tar.gz
|
||||
sudo cp kubernetes/client/bin/kubefed /usr/local/bin
|
||||
sudo chmod +x /usr/local/bin/kubefed
|
||||
sudo cp kubernetes/client/bin/kubectl /usr/local/bin
|
||||
sudo chmod +x /usr/local/bin/kubectl
|
||||
```
|
||||
|
||||
|
||||
## Choosing a host cluster.
|
||||
|
||||
You'll need to choose one of your Kubernetes clusters to be the
|
||||
*host cluster*. The host cluster hosts the components that make up
|
||||
your federation control plane. Ensure that you have a `kubeconfig`
|
||||
entry in your local `kubeconfig` that corresponds to the host cluster.
|
||||
You can verify that you have the required `kubeconfig` entry by
|
||||
running:
|
||||
|
||||
```shell
|
||||
kubectl config get-contexts
|
||||
```
|
||||
|
||||
The output should contain an entry corresponding to your host cluster,
|
||||
similar to the following:
|
||||
|
||||
```
|
||||
CURRENT NAME CLUSTER AUTHINFO NAMESPACE
|
||||
gke_myproject_asia-east1-b_gce-asia-east1 gke_myproject_asia-east1-b_gce-asia-east1 gke_myproject_asia-east1-b_gce-asia-east1
|
||||
```
|
||||
|
||||
|
||||
You'll need to provide the `kubeconfig` context (called name in the
|
||||
entry above) for your host cluster when you deploy your federation
|
||||
control plane.
|
||||
|
||||
|
||||
## Deploying a federation control plane.
|
||||
|
||||
To deploy a federation control plane on your host cluster, run
|
||||
`kubefed init` command. When you use `kubefed init`, you must provide
|
||||
the following:
|
||||
|
||||
* Federation name
|
||||
* `--host-cluster-context`, the `kubeconfig` context for the host cluster
|
||||
* `--dns-zone-name`, a domain name suffix for your federated services
|
||||
|
||||
The following example command deploys a federation control plane with
|
||||
the name `fellowship`, a host cluster context `rivendell`, and the
|
||||
domain suffix `example.com`:
|
||||
|
||||
```shell
|
||||
kubefed init fellowship --host-cluster-context=rivendell --dns-zone-name="example.com"
|
||||
```
|
||||
|
||||
The domain suffix specified in `--dns-zone-name` must be an existing
|
||||
domain that you control, and that is programmable by your DNS provider.
|
||||
|
||||
`kubefed init` sets up the federation control plane in the host
|
||||
cluster and also adds an entry for the federation API server in your
|
||||
local kubeconfig. Note that in the alpha release in Kubernetes 1.5,
|
||||
`kubefed init` does not automatically set the current context to the
|
||||
newly deployed federation. You can set the current context manually by
|
||||
running:
|
||||
|
||||
```shell
|
||||
kubectl config use-context fellowship
|
||||
```
|
||||
|
||||
where `fellowship` is the name of your federation.
|
||||
|
||||
|
||||
## Adding a cluster to a federation
|
||||
|
||||
Once you've deployed a federation control plane, you'll need to make
|
||||
that control plane aware of the clusters it should manage. You can add
|
||||
a cluster to your federation by using the `kubefed join` command.
|
||||
|
||||
To use `kubefed join`, you'll need to provide the name of the cluster
|
||||
you want to add to the federation, and the `--host-cluster-context`
|
||||
for the federation control plane's host cluster.
|
||||
|
||||
The following example command adds the cluster `gondor` to the
|
||||
federation with host cluster `rivendell`:
|
||||
|
||||
```
|
||||
kubefed join gondor --host-cluster-context=rivendell
|
||||
```
|
||||
|
||||
> Note: Kubernetes requires that you manually join clusters to a
|
||||
federation because the federation control plane manages only those
|
||||
clusters that it is responsible for managing. Adding a cluster tells
|
||||
the federation control plane that it is responsible for managing that
|
||||
cluster.
|
||||
|
||||
### Naming rules and customization
|
||||
|
||||
The cluster name you supply to `kubefed join` must be a valid RFC 1035
|
||||
label.
|
||||
|
||||
Furthermore, federation control plane requires credentials of the
|
||||
joined clusters to operate on them. These credentials are obtained
|
||||
from the local kubeconfig. `kubefed join` uses the cluster name
|
||||
specified as the argument to look for the cluster's context in the
|
||||
local kubeconfig. If it fails to find a matching context, it exits
|
||||
with an error.
|
||||
|
||||
This might cause issues in cases where context names for each cluster
|
||||
in the federation don't follow
|
||||
[RFC 1035](https://www.ietf.org/rfc/rfc1035.txt) label naming rules.
|
||||
In such cases, you can specify a cluster name that conforms to the
|
||||
[RFC 1035](https://www.ietf.org/rfc/rfc1035.txt) label naming rules
|
||||
and specify the cluster context using the `--cluster-context` flag.
|
||||
For example, if context of the cluster your are joining is
|
||||
`gondor_needs-no_king`, then you can join the cluster by running:
|
||||
|
||||
```shell
|
||||
kubefed join gondor --host-cluster-context=rivendell --cluster-context=gondor_needs-no_king
|
||||
```
|
||||
|
||||
#### Secret name
|
||||
|
||||
Cluster credentials required by the federation control plane as
|
||||
described above are stored as a secret in the host cluster. The name
|
||||
of the secret is also derived from the cluster name.
|
||||
|
||||
However, the name of a secret object in Kubernetes should conform
|
||||
to the DNS subdomain name specification described in
|
||||
[RFC 1123](https://tools.ietf.org/html/rfc1123). If this isn't the
|
||||
case, you can pass the secret name to `kubefed join` using the
|
||||
`--secret-name` flag. For example, if the cluster name is `noldor` and
|
||||
the secret name is `11kingdom`, you can join the cluster by
|
||||
running:
|
||||
|
||||
```shell
|
||||
kubefed join noldor --host-cluster-context=rivendell --secret-name=11kingdom
|
||||
```
|
||||
|
||||
Note: If your cluster name does not conform to the DNS subdomain name
|
||||
specification, all you need to do is supply the secret name via the
|
||||
`--secret-name` flag. `kubefed join` automatically creates the secret
|
||||
for you.
|
||||
|
||||
|
||||
## Removing a cluster from a federation
|
||||
|
||||
To remove a cluster from a federation, run the `kubefed unjoin`
|
||||
command with the cluster name and the federation's
|
||||
`--host-cluster-context`:
|
||||
|
||||
```
|
||||
kubefed unjoin gondor --host-cluster-context=rivendell
|
||||
```
|
||||
|
||||
|
||||
## Turning down the federation control plane:
|
||||
|
||||
Proper cleanup of federation control plane is not fully implemented in
|
||||
this alpha release of `kubefed`. However, for the time being, deleting
|
||||
the federation system namespace should remove all the resources except
|
||||
the persistent storage volume dynamically provisioned for the
|
||||
federation control plane's etcd. You can delete the federation
|
||||
namespace by running the following command:
|
||||
|
||||
```
|
||||
$ kubectl delete ns federation-system
|
||||
```
|
||||
[Setting up Cluster Federation with kubefed](/docs/tutorials/federation/set-up-cluster-federation-kubefed/)
|
||||
|
|
|
@ -5,210 +5,6 @@ assignees:
|
|||
title: Setting Pod CPU and Memory Limits
|
||||
---
|
||||
|
||||
By default, pods run with unbounded CPU and memory limits. This means that any pod in the
|
||||
system will be able to consume as much CPU and memory on the node that executes the pod.
|
||||
{% include user-guide-content-moved.md %}
|
||||
|
||||
Users may want to impose restrictions on the amount of resources a single pod in the system may consume
|
||||
for a variety of reasons.
|
||||
|
||||
For example:
|
||||
|
||||
1. Each node in the cluster has 2GB of memory. The cluster operator does not want to accept pods
|
||||
that require more than 2GB of memory since no node in the cluster can support the requirement. To prevent a
|
||||
pod from being permanently unscheduled to a node, the operator instead chooses to reject pods that exceed 2GB
|
||||
of memory as part of admission control.
|
||||
2. A cluster is shared by two communities in an organization that runs production and development workloads
|
||||
respectively. Production workloads may consume up to 8GB of memory, but development workloads may consume up
|
||||
to 512MB of memory. The cluster operator creates a separate namespace for each workload, and applies limits to
|
||||
each namespace.
|
||||
3. Users may create a pod which consumes resources just below the capacity of a machine. The left over space
|
||||
may be too small to be useful, but big enough for the waste to be costly over the entire cluster. As a result,
|
||||
the cluster operator may want to set limits that a pod must consume at least 20% of the memory and CPU of their
|
||||
average node size in order to provide for more uniform scheduling and limit waste.
|
||||
|
||||
This example demonstrates how limits can be applied to a Kubernetes [namespace](/docs/admin/namespaces/walkthrough/) to control
|
||||
min/max resource limits per pod. In addition, this example demonstrates how you can
|
||||
apply default resource limits to pods in the absence of an end-user specified value.
|
||||
|
||||
See [LimitRange design doc](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/admission_control_limit_range.md) for more information. For a detailed description of the Kubernetes resource model, see [Resources](/docs/user-guide/compute-resources/)
|
||||
|
||||
## Step 0: Prerequisites
|
||||
|
||||
This example requires a running Kubernetes cluster. See the [Getting Started guides](/docs/getting-started-guides/) for how to get started.
|
||||
|
||||
Change to the `<kubernetes>` directory if you're not already there.
|
||||
|
||||
## Step 1: Create a namespace
|
||||
|
||||
This example will work in a custom namespace to demonstrate the concepts involved.
|
||||
|
||||
Let's create a new namespace called limit-example:
|
||||
|
||||
```shell
|
||||
$ kubectl create namespace limit-example
|
||||
namespace "limit-example" created
|
||||
```
|
||||
|
||||
Note that `kubectl` commands will print the type and name of the resource created or mutated, which can then be used in subsequent commands:
|
||||
|
||||
```shell
|
||||
$ kubectl get namespaces
|
||||
NAME STATUS AGE
|
||||
default Active 51s
|
||||
limit-example Active 45s
|
||||
```
|
||||
|
||||
## Step 2: Apply a limit to the namespace
|
||||
|
||||
Let's create a simple limit in our namespace.
|
||||
|
||||
```shell
|
||||
$ kubectl create -f docs/admin/limitrange/limits.yaml --namespace=limit-example
|
||||
limitrange "mylimits" created
|
||||
```
|
||||
|
||||
Let's describe the limits that we have imposed in our namespace.
|
||||
|
||||
```shell
|
||||
$ kubectl describe limits mylimits --namespace=limit-example
|
||||
Name: mylimits
|
||||
Namespace: limit-example
|
||||
Type Resource Min Max Default Request Default Limit Max Limit/Request Ratio
|
||||
---- -------- --- --- --------------- ------------- -----------------------
|
||||
Pod cpu 200m 2 - - -
|
||||
Pod memory 6Mi 1Gi - - -
|
||||
Container cpu 100m 2 200m 300m -
|
||||
Container memory 3Mi 1Gi 100Mi 200Mi -
|
||||
```
|
||||
|
||||
In this scenario, we have said the following:
|
||||
|
||||
1. If a max constraint is specified for a resource (2 CPU and 1Gi memory in this case), then a limit
|
||||
must be specified for that resource across all containers. Failure to specify a limit will result in
|
||||
a validation error when attempting to create the pod. Note that a default value of limit is set by
|
||||
*default* in file `limits.yaml` (300m CPU and 200Mi memory).
|
||||
2. If a min constraint is specified for a resource (100m CPU and 3Mi memory in this case), then a
|
||||
request must be specified for that resource across all containers. Failure to specify a request will
|
||||
result in a validation error when attempting to create the pod. Note that a default value of request is
|
||||
set by *defaultRequest* in file `limits.yaml` (200m CPU and 100Mi memory).
|
||||
3. For any pod, the sum of all containers memory requests must be >= 6Mi and the sum of all containers
|
||||
memory limits must be <= 1Gi; the sum of all containers CPU requests must be >= 200m and the sum of all
|
||||
containers CPU limits must be <= 2.
|
||||
|
||||
## Step 3: Enforcing limits at point of creation
|
||||
|
||||
The limits enumerated in a namespace are only enforced when a pod is created or updated in
|
||||
the cluster. If you change the limits to a different value range, it does not affect pods that
|
||||
were previously created in a namespace.
|
||||
|
||||
If a resource (CPU or memory) is being restricted by a limit, the user will get an error at time
|
||||
of creation explaining why.
|
||||
|
||||
Let's first spin up a [Deployment](/docs/user-guide/deployments) that creates a single container Pod to demonstrate
|
||||
how default values are applied to each pod.
|
||||
|
||||
```shell
|
||||
$ kubectl run nginx --image=nginx --replicas=1 --namespace=limit-example
|
||||
deployment "nginx" created
|
||||
```
|
||||
|
||||
Note that `kubectl run` creates a Deployment named "nginx" on Kubernetes cluster >= v1.2. If you are running older versions, it creates replication controllers instead.
|
||||
If you want to obtain the old behavior, use `--generator=run/v1` to create replication controllers. See [`kubectl run`](/docs/user-guide/kubectl/kubectl_run/) for more details.
|
||||
The Deployment manages 1 replica of single container Pod. Let's take a look at the Pod it manages. First, find the name of the Pod:
|
||||
|
||||
```shell
|
||||
$ kubectl get pods --namespace=limit-example
|
||||
NAME READY STATUS RESTARTS AGE
|
||||
nginx-2040093540-s8vzu 1/1 Running 0 11s
|
||||
```
|
||||
|
||||
Let's print this Pod with yaml output format (using `-o yaml` flag), and then `grep` the `resources` field. Note that your pod name will be different.
|
||||
|
||||
```shell
|
||||
$ kubectl get pods nginx-2040093540-s8vzu --namespace=limit-example -o yaml | grep resources -C 8
|
||||
resourceVersion: "57"
|
||||
selfLink: /api/v1/namespaces/limit-example/pods/nginx-2040093540-ivimu
|
||||
uid: 67b20741-f53b-11e5-b066-64510658e388
|
||||
spec:
|
||||
containers:
|
||||
- image: nginx
|
||||
imagePullPolicy: Always
|
||||
name: nginx
|
||||
resources:
|
||||
limits:
|
||||
cpu: 300m
|
||||
memory: 200Mi
|
||||
requests:
|
||||
cpu: 200m
|
||||
memory: 100Mi
|
||||
terminationMessagePath: /dev/termination-log
|
||||
volumeMounts:
|
||||
```
|
||||
|
||||
Note that our nginx container has picked up the namespace default CPU and memory resource *limits* and *requests*.
|
||||
|
||||
Let's create a pod that exceeds our allowed limits by having it have a container that requests 3 CPU cores.
|
||||
|
||||
```shell
|
||||
$ kubectl create -f docs/admin/limitrange/invalid-pod.yaml --namespace=limit-example
|
||||
Error from server: error when creating "docs/admin/limitrange/invalid-pod.yaml": Pod "invalid-pod" is forbidden: [Maximum cpu usage per Pod is 2, but limit is 3., Maximum cpu usage per Container is 2, but limit is 3.]
|
||||
```
|
||||
|
||||
Let's create a pod that falls within the allowed limit boundaries.
|
||||
|
||||
```shell
|
||||
$ kubectl create -f docs/admin/limitrange/valid-pod.yaml --namespace=limit-example
|
||||
pod "valid-pod" created
|
||||
```
|
||||
|
||||
Now look at the Pod's resources field:
|
||||
|
||||
```shell
|
||||
$ kubectl get pods valid-pod --namespace=limit-example -o yaml | grep -C 6 resources
|
||||
uid: 3b1bfd7a-f53c-11e5-b066-64510658e388
|
||||
spec:
|
||||
containers:
|
||||
- image: gcr.io/google_containers/serve_hostname
|
||||
imagePullPolicy: Always
|
||||
name: kubernetes-serve-hostname
|
||||
resources:
|
||||
limits:
|
||||
cpu: "1"
|
||||
memory: 512Mi
|
||||
requests:
|
||||
cpu: "1"
|
||||
memory: 512Mi
|
||||
```
|
||||
|
||||
Note that this pod specifies explicit resource *limits* and *requests* so it did not pick up the namespace
|
||||
default values.
|
||||
|
||||
Note: The *limits* for CPU resource are enforced in the default Kubernetes setup on the physical node
|
||||
that runs the container unless the administrator deploys the kubelet with the following flag:
|
||||
|
||||
```shell
|
||||
$ kubelet --help
|
||||
Usage of kubelet
|
||||
....
|
||||
--cpu-cfs-quota[=true]: Enable CPU CFS quota enforcement for containers that specify CPU limits
|
||||
$ kubelet --cpu-cfs-quota=false ...
|
||||
```
|
||||
|
||||
## Step 4: Cleanup
|
||||
|
||||
To remove the resources used by this example, you can just delete the limit-example namespace.
|
||||
|
||||
```shell
|
||||
$ kubectl delete namespace limit-example
|
||||
namespace "limit-example" deleted
|
||||
$ kubectl get namespaces
|
||||
NAME STATUS AGE
|
||||
default Active 12m
|
||||
```
|
||||
|
||||
## Summary
|
||||
|
||||
Cluster operators that want to restrict the amount of resources a single container or pod may consume
|
||||
are able to define allowable ranges per Kubernetes namespace. In the absence of any explicit assignments,
|
||||
the Kubernetes system is able to apply default resource *limits* and *requests* if desired in order to
|
||||
constrain the amount of resource a pod consumes on a node.
|
||||
[Setting Pod CPU and Memory Limits](/docs/tasks/configure-pod-container/limit-range/)
|
||||
|
|
|
@ -4,63 +4,6 @@ assignees:
|
|||
title: Using Multiple Clusters
|
||||
---
|
||||
|
||||
You may want to set up multiple Kubernetes clusters, both to
|
||||
have clusters in different regions to be nearer to your users, and to tolerate failures and/or invasive maintenance.
|
||||
This document describes some of the issues to consider when making a decision about doing so.
|
||||
{% include user-guide-content-moved.md %}
|
||||
|
||||
If you decide to have multiple clusters, Kubernetes provides a way to [federate them](/docs/admin/federation/).
|
||||
|
||||
## Scope of a single cluster
|
||||
|
||||
On IaaS providers such as Google Compute Engine or Amazon Web Services, a VM exists in a
|
||||
[zone](https://cloud.google.com/compute/docs/zones) or [availability
|
||||
zone](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html).
|
||||
We suggest that all the VMs in a Kubernetes cluster should be in the same availability zone, because:
|
||||
|
||||
- compared to having a single global Kubernetes cluster, there are fewer single-points of failure
|
||||
- compared to a cluster that spans availability zones, it is easier to reason about the availability properties of a
|
||||
single-zone cluster.
|
||||
- when the Kubernetes developers are designing the system (e.g. making assumptions about latency, bandwidth, or
|
||||
correlated failures) they are assuming all the machines are in a single data center, or otherwise closely connected.
|
||||
|
||||
It is okay to have multiple clusters per availability zone, though on balance we think fewer is better.
|
||||
Reasons to prefer fewer clusters are:
|
||||
|
||||
- improved bin packing of Pods in some cases with more nodes in one cluster (less resource fragmentation)
|
||||
- reduced operational overhead (though the advantage is diminished as ops tooling and processes matures)
|
||||
- reduced costs for per-cluster fixed resource costs, e.g. apiserver VMs (but small as a percentage
|
||||
of overall cluster cost for medium to large clusters).
|
||||
|
||||
Reasons to have multiple clusters include:
|
||||
|
||||
- strict security policies requiring isolation of one class of work from another (but, see Partitioning Clusters
|
||||
below).
|
||||
- test clusters to canary new Kubernetes releases or other cluster software.
|
||||
|
||||
## Selecting the right number of clusters
|
||||
|
||||
The selection of the number of Kubernetes clusters may be a relatively static choice, only revisited occasionally.
|
||||
By contrast, the number of nodes in a cluster and the number of pods in a service may change frequently according to
|
||||
load and growth.
|
||||
|
||||
To pick the number of clusters, first, decide which regions you need to be in to have adequate latency to all your end users, for services that will run
|
||||
on Kubernetes (if you use a Content Distribution Network, the latency requirements for the CDN-hosted content need not
|
||||
be considered). Legal issues might influence this as well. For example, a company with a global customer base might decide to have clusters in US, EU, AP, and SA regions.
|
||||
Call the number of regions to be in `R`.
|
||||
|
||||
Second, decide how many clusters should be able to be unavailable at the same time, while still being available. Call
|
||||
the number that can be unavailable `U`. If you are not sure, then 1 is a fine choice.
|
||||
|
||||
If it is allowable for load-balancing to direct traffic to any region in the event of a cluster failure, then
|
||||
you need at least the larger of `R` or `U + 1` clusters. If it is not (e.g. you want to ensure low latency for all
|
||||
users in the event of a cluster failure), then you need to have `R * (U + 1)` clusters
|
||||
(`U + 1` in each of `R` regions). In any case, try to put each cluster in a different zone.
|
||||
|
||||
Finally, if any of your clusters would need more than the maximum recommended number of nodes for a Kubernetes cluster, then
|
||||
you may need even more clusters. Kubernetes v1.3 supports clusters up to 1000 nodes in size.
|
||||
|
||||
## Working with multiple clusters
|
||||
|
||||
When you have multiple clusters, you would typically create services with the same config in each cluster and put each of those
|
||||
service instances behind a load balancer (AWS Elastic Load Balancer, GCE Forwarding Rule or HTTP Load Balancer) spanning all of them, so that
|
||||
failures of a single cluster are not visible to end users.
|
||||
[Using Multiple Clusters](/docs/concepts/cluster-administration/multiple-clusters/)
|
||||
|
|
|
@ -6,68 +6,6 @@ assignees:
|
|||
title: Network Plugins
|
||||
---
|
||||
|
||||
* TOC
|
||||
{:toc}
|
||||
{% include user-guide-content-moved.md %}
|
||||
|
||||
__Disclaimer__: Network plugins are in alpha. Its contents will change rapidly.
|
||||
|
||||
Network plugins in Kubernetes come in a few flavors:
|
||||
|
||||
* CNI plugins: adhere to the appc/CNI specification, designed for interoperability.
|
||||
* Kubenet plugin: implements basic `cbr0` using the `bridge` and `host-local` CNI plugins
|
||||
|
||||
## Installation
|
||||
|
||||
The kubelet has a single default network plugin, and a default network common to the entire cluster. It probes for plugins when it starts up, remembers what it found, and executes the selected plugin at appropriate times in the pod lifecycle (this is only true for docker, as rkt manages its own CNI plugins). There are two Kubelet command line parameters to keep in mind when using plugins:
|
||||
|
||||
* `network-plugin-dir`: Kubelet probes this directory for plugins on startup
|
||||
* `network-plugin`: The network plugin to use from `network-plugin-dir`. It must match the name reported by a plugin probed from the plugin directory. For CNI plugins, this is simply "cni".
|
||||
|
||||
## Network Plugin Requirements
|
||||
|
||||
Besides providing the [`NetworkPlugin` interface](https://github.com/kubernetes/kubernetes/tree/{{page.version}}/pkg/kubelet/network/plugins.go) to configure and clean up pod networking, the plugin may also need specific support for kube-proxy. The iptables proxy obviously depends on iptables, and the plugin may need to ensure that container traffic is made available to iptables. For example, if the plugin connects containers to a Linux bridge, the plugin must set the `net/bridge/bridge-nf-call-iptables` sysctl to `1` to ensure that the iptables proxy functions correctly. If the plugin does not use a Linux bridge (but instead something like Open vSwitch or some other mechanism) it should ensure container traffic is appropriately routed for the proxy.
|
||||
|
||||
By default if no kubelet network plugin is specified, the `noop` plugin is used, which sets `net/bridge/bridge-nf-call-iptables=1` to ensure simple configurations (like docker with a bridge) work correctly with the iptables proxy.
|
||||
|
||||
### CNI
|
||||
|
||||
The CNI plugin is selected by passing Kubelet the `--network-plugin=cni` command-line option. Kubelet reads a file from `--cni-conf-dir` (default `/etc/cni/net.d`) and uses the CNI configuration from that file to set up each pod's network. The CNI configuration file must match the [CNI specification](https://github.com/containernetworking/cni/blob/master/SPEC.md#network-configuration), and any required CNI plugins referenced by the configuration must be present in `--cni-bin-dir` (default `/opt/cni/bin`).
|
||||
|
||||
If there are multiple CNI configuration files in the directory, the first one in lexicographic order of file name is used.
|
||||
|
||||
In addition to the CNI plugin specified by the configuration file, Kubernetes requires the standard CNI [`lo`](https://github.com/containernetworking/cni/blob/master/plugins/main/loopback/loopback.go) plugin, at minimum version 0.2.0
|
||||
|
||||
Limitation: Due to [#31307](https://github.com/kubernetes/kubernetes/issues/31307), `HostPort` won't work with CNI networking plugin at the moment. That means all `hostPort` attribute in pod would be simply ignored.
|
||||
|
||||
### kubenet
|
||||
|
||||
Kubenet is a very basic, simple network plugin, on Linux only. It does not, of itself, implement more advanced features like cross-node networking or network policy. It is typically used together with a cloud provider that sets up routing rules for communication between nodes, or in single-node environments.
|
||||
|
||||
Kubenet creates a Linux bridge named `cbr0` and creates a veth pair for each pod with the host end of each pair connected to `cbr0`. The pod end of the pair is assigned an IP address allocated from a range assigned to the node either through configuration or by the controller-manager. `cbr0` is assigned an MTU matching the smallest MTU of an enabled normal interface on the host.
|
||||
|
||||
The plugin requires a few things:
|
||||
|
||||
* The standard CNI `bridge`, `lo` and `host-local` plugins are required, at minimum version 0.2.0. Kubenet will first search for them in `/opt/cni/bin`. Specify `network-plugin-dir` to supply additional search path. The first found match will take effect.
|
||||
* Kubelet must be run with the `--network-plugin=kubenet` argument to enable the plugin
|
||||
* Kubelet should also be run with the `--non-masquerade-cidr=<clusterCidr>` argument to ensure traffic to IPs outside this range will use IP masquerade.
|
||||
* The node must be assigned an IP subnet through either the `--pod-cidr` kubelet command-line option or the `--allocate-node-cidrs=true --cluster-cidr=<cidr>` controller-manager command-line options.
|
||||
|
||||
### Customizing the MTU (with kubenet)
|
||||
|
||||
The MTU should always be configured correctly to get the best networking performance. Network plugins will usually try
|
||||
to infer a sensible MTU, but sometimes the logic will not result in an optimal MTU. For example, if the
|
||||
Docker bridge or another interface has a small MTU, kubenet will currently select that MTU. Or if you are
|
||||
using IPSEC encapsulation, the MTU must be reduced, and this calculation is out-of-scope for
|
||||
most network plugins.
|
||||
|
||||
Where needed, you can specify the MTU explicitly with the `network-plugin-mtu` kubelet option. For example,
|
||||
on AWS the `eth0` MTU is typically 9001, so you might specify `--network-plugin-mtu=9001`. If you're using IPSEC you
|
||||
might reduce it to allow for encapsulation overhead e.g. `--network-plugin-mtu=8873`.
|
||||
|
||||
This option is provided to the network-plugin; currently **only kubenet supports `network-plugin-mtu`**.
|
||||
|
||||
## Usage Summary
|
||||
|
||||
* `--network-plugin=cni` specifies that we use the `cni` network plugin with actual CNI plugin binaries located in `--cni-bin-dir` (default `/opt/cni/bin`) and CNI plugin configuration located in `--cni-conf-dir` (default `/etc/cni/net.d`).
|
||||
* `--network-plugin=kubenet` specifies that we use the `kubenet` network plugin with CNI `bridge` and `host-local` plugins placed in `/opt/cni/bin` or `network-plugin-dir`.
|
||||
* `--network-plugin-mtu=9001` specifies the MTU to use, currently only used by the `kubenet` network plugin.
|
||||
[Network Plugins](/docs/concepts/cluster-administration/network-plugins/)
|
||||
|
|
|
@ -4,212 +4,6 @@ assignees:
|
|||
title: Networking in Kubernetes
|
||||
---
|
||||
|
||||
Kubernetes approaches networking somewhat differently than Docker does by
|
||||
default. There are 4 distinct networking problems to solve:
|
||||
{% include user-guide-content-moved.md %}
|
||||
|
||||
1. Highly-coupled container-to-container communications: this is solved by
|
||||
[pods](/docs/user-guide/pods/) and `localhost` communications.
|
||||
2. Pod-to-Pod communications: this is the primary focus of this document.
|
||||
3. Pod-to-Service communications: this is covered by [services](/docs/user-guide/services/).
|
||||
4. External-to-Service communications: this is covered by [services](/docs/user-guide/services/).
|
||||
|
||||
* TOC
|
||||
{:toc}
|
||||
|
||||
|
||||
## Summary
|
||||
|
||||
Kubernetes assumes that pods can communicate with other pods, regardless of
|
||||
which host they land on. We give every pod its own IP address so you do not
|
||||
need to explicitly create links between pods and you almost never need to deal
|
||||
with mapping container ports to host ports. This creates a clean,
|
||||
backwards-compatible model where pods can be treated much like VMs or physical
|
||||
hosts from the perspectives of port allocation, naming, service discovery, load
|
||||
balancing, application configuration, and migration.
|
||||
|
||||
To achieve this we must impose some requirements on how you set up your cluster
|
||||
networking.
|
||||
|
||||
## Docker model
|
||||
|
||||
Before discussing the Kubernetes approach to networking, it is worthwhile to
|
||||
review the "normal" way that networking works with Docker. By default, Docker
|
||||
uses host-private networking. It creates a virtual bridge, called `docker0` by
|
||||
default, and allocates a subnet from one of the private address blocks defined
|
||||
in [RFC1918](https://tools.ietf.org/html/rfc1918) for that bridge. For each
|
||||
container that Docker creates, it allocates a virtual ethernet device (called
|
||||
`veth`) which is attached to the bridge. The veth is mapped to appear as `eth0`
|
||||
in the container, using Linux namespaces. The in-container `eth0` interface is
|
||||
given an IP address from the bridge's address range.
|
||||
|
||||
The result is that Docker containers can talk to other containers only if they
|
||||
are on the same machine (and thus the same virtual bridge). Containers on
|
||||
different machines can not reach each other - in fact they may end up with the
|
||||
exact same network ranges and IP addresses.
|
||||
|
||||
In order for Docker containers to communicate across nodes, they must be
|
||||
allocated ports on the machine's own IP address, which are then forwarded or
|
||||
proxied to the containers. This obviously means that containers must either
|
||||
coordinate which ports they use very carefully or else be allocated ports
|
||||
dynamically.
|
||||
|
||||
## Kubernetes model
|
||||
|
||||
Coordinating ports across multiple developers is very difficult to do at
|
||||
scale and exposes users to cluster-level issues outside of their control.
|
||||
Dynamic port allocation brings a lot of complications to the system - every
|
||||
application has to take ports as flags, the API servers have to know how to
|
||||
insert dynamic port numbers into configuration blocks, services have to know
|
||||
how to find each other, etc. Rather than deal with this, Kubernetes takes a
|
||||
different approach.
|
||||
|
||||
Kubernetes imposes the following fundamental requirements on any networking
|
||||
implementation (barring any intentional network segmentation policies):
|
||||
|
||||
* all containers can communicate with all other containers without NAT
|
||||
* all nodes can communicate with all containers (and vice-versa) without NAT
|
||||
* the IP that a container sees itself as is the same IP that others see it as
|
||||
|
||||
What this means in practice is that you can not just take two computers
|
||||
running Docker and expect Kubernetes to work. You must ensure that the
|
||||
fundamental requirements are met.
|
||||
|
||||
This model is not only less complex overall, but it is principally compatible
|
||||
with the desire for Kubernetes to enable low-friction porting of apps from VMs
|
||||
to containers. If your job previously ran in a VM, your VM had an IP and could
|
||||
talk to other VMs in your project. This is the same basic model.
|
||||
|
||||
Until now this document has talked about containers. In reality, Kubernetes
|
||||
applies IP addresses at the `Pod` scope - containers within a `Pod` share their
|
||||
network namespaces - including their IP address. This means that containers
|
||||
within a `Pod` can all reach each other's ports on `localhost`. This does imply
|
||||
that containers within a `Pod` must coordinate port usage, but this is no
|
||||
different than processes in a VM. We call this the "IP-per-pod" model. This
|
||||
is implemented in Docker as a "pod container" which holds the network namespace
|
||||
open while "app containers" (the things the user specified) join that namespace
|
||||
with Docker's `--net=container:<id>` function.
|
||||
|
||||
As with Docker, it is possible to request host ports, but this is reduced to a
|
||||
very niche operation. In this case a port will be allocated on the host `Node`
|
||||
and traffic will be forwarded to the `Pod`. The `Pod` itself is blind to the
|
||||
existence or non-existence of host ports.
|
||||
|
||||
## How to achieve this
|
||||
|
||||
There are a number of ways that this network model can be implemented. This
|
||||
document is not an exhaustive study of the various methods, but hopefully serves
|
||||
as an introduction to various technologies and serves as a jumping-off point.
|
||||
|
||||
The following networking options are sorted alphabetically - the order does not
|
||||
imply any preferential status.
|
||||
|
||||
### Contiv
|
||||
|
||||
[Contiv](https://github.com/contiv/netplugin) provides configurable networking (native l3 using BGP, overlay using vxlan, classic l2, or Cisco-SDN/ACI) for various use cases. [Contiv](http://contiv.io) is all open sourced.
|
||||
|
||||
### Flannel
|
||||
|
||||
[Flannel](https://github.com/coreos/flannel#flannel) is a very simple overlay
|
||||
network that satisfies the Kubernetes requirements. Many
|
||||
people have reported success with Flannel and Kubernetes.
|
||||
|
||||
### Google Compute Engine (GCE)
|
||||
|
||||
For the Google Compute Engine cluster configuration scripts, we use [advanced
|
||||
routing](https://cloud.google.com/compute/docs/networking#routing) to
|
||||
assign each VM a subnet (default is `/24` - 254 IPs). Any traffic bound for that
|
||||
subnet will be routed directly to the VM by the GCE network fabric. This is in
|
||||
addition to the "main" IP address assigned to the VM, which is NAT'ed for
|
||||
outbound internet access. A linux bridge (called `cbr0`) is configured to exist
|
||||
on that subnet, and is passed to docker's `--bridge` flag.
|
||||
|
||||
We start Docker with:
|
||||
|
||||
```shell
|
||||
DOCKER_OPTS="--bridge=cbr0 --iptables=false --ip-masq=false"
|
||||
```
|
||||
|
||||
This bridge is created by Kubelet (controlled by the `--network-plugin=kubenet`
|
||||
flag) according to the `Node`'s `spec.podCIDR`.
|
||||
|
||||
Docker will now allocate IPs from the `cbr-cidr` block. Containers can reach
|
||||
each other and `Nodes` over the `cbr0` bridge. Those IPs are all routable
|
||||
within the GCE project network.
|
||||
|
||||
GCE itself does not know anything about these IPs, though, so it will not NAT
|
||||
them for outbound internet traffic. To achieve that we use an iptables rule to
|
||||
masquerade (aka SNAT - to make it seem as if packets came from the `Node`
|
||||
itself) traffic that is bound for IPs outside the GCE project network
|
||||
(10.0.0.0/8).
|
||||
|
||||
```shell
|
||||
iptables -t nat -A POSTROUTING ! -d 10.0.0.0/8 -o eth0 -j MASQUERADE
|
||||
```
|
||||
|
||||
Lastly we enable IP forwarding in the kernel (so the kernel will process
|
||||
packets for bridged containers):
|
||||
|
||||
```shell
|
||||
sysctl net.ipv4.ip_forward=1
|
||||
```
|
||||
|
||||
The result of all this is that all `Pods` can reach each other and can egress
|
||||
traffic to the internet.
|
||||
|
||||
### L2 networks and linux bridging
|
||||
|
||||
If you have a "dumb" L2 network, such as a simple switch in a "bare-metal"
|
||||
environment, you should be able to do something similar to the above GCE setup.
|
||||
Note that these instructions have only been tried very casually - it seems to
|
||||
work, but has not been thoroughly tested. If you use this technique and
|
||||
perfect the process, please let us know.
|
||||
|
||||
Follow the "With Linux Bridge devices" section of [this very nice
|
||||
tutorial](http://blog.oddbit.com/2014/08/11/four-ways-to-connect-a-docker/) from
|
||||
Lars Kellogg-Stedman.
|
||||
|
||||
### Nuage Networks VCS (Virtualized Cloud Services)
|
||||
|
||||
[Nuage](http://www.nuagenetworks.net) provides a highly scalable policy-based Software-Defined Networking (SDN) platform. Nuage uses the open source Open vSwitch for the data plane along with a feature rich SDN Controller built on open standards.
|
||||
|
||||
The Nuage platform uses overlays to provide seamless policy-based networking between Kubernetes Pods and non-Kubernetes environments (VMs and bare metal servers). Nuage's policy abstraction model is designed with applications in mind and makes it easy to declare fine-grained policies for applications.The platform's real-time analytics engine enables visibility and security monitoring for Kubernetes applications.
|
||||
|
||||
### OpenVSwitch
|
||||
|
||||
[OpenVSwitch](/docs/admin/ovs-networking) is a somewhat more mature but also
|
||||
complicated way to build an overlay network. This is endorsed by several of the
|
||||
"Big Shops" for networking.
|
||||
|
||||
### OVN (Open Virtual Networking)
|
||||
|
||||
OVN is an opensource network virtualization solution developed by the
|
||||
Open vSwitch community. It lets one create logical switches, logical routers,
|
||||
stateful ACLs, load-balancers etc to build different virtual networking
|
||||
topologies. The project has a specific Kubernetes plugin and documentation
|
||||
at [ovn-kubernetes](https://github.com/openvswitch/ovn-kubernetes).
|
||||
|
||||
### Project Calico
|
||||
|
||||
[Project Calico](http://docs.projectcalico.org/) is an open source container networking provider and network policy engine.
|
||||
|
||||
Calico provides a highly scalable networking and network policy solution for connecting Kubernetes pods based on the same IP networking principles as the internet. Calico can be deployed without encapsulation or overlays to provide high-performance, high-scale data center networking. Calico also provides fine-grained, intent based network security policy for Kubernetes pods via its distributed firewall.
|
||||
|
||||
Calico can also be run in policy enforcement mode in conjunction with other networking solutions such as Flannel, aka [canal](https://github.com/tigera/canal), or native GCE networking.
|
||||
|
||||
### Romana
|
||||
|
||||
[Romana](http://romana.io) is an open source network and security automation solution that lets you deploy Kubernetes without an overlay network. Romana supports Kubernetes [Network Policy](/docs/user-guide/networkpolicies/) to provide isolation across network namespaces.
|
||||
|
||||
### Weave Net from Weaveworks
|
||||
|
||||
[Weave Net](https://www.weave.works/products/weave-net/) is a
|
||||
resilient and simple to use network for Kubernetes and its hosted applications.
|
||||
Weave Net runs as a [CNI plug-in](https://www.weave.works/docs/net/latest/cni-plugin/)
|
||||
or stand-alone. In either version, it doesn't require any configuration or extra code
|
||||
to run, and in both cases, the network provides one IP address per pod - as is standard for Kubernetes.
|
||||
|
||||
## Other reading
|
||||
|
||||
The early design of the networking model and its rationale, and some future
|
||||
plans are described in more detail in the [networking design
|
||||
document](https://github.com/kubernetes/kubernetes/blob/{{page.githubbranch}}/docs/design/networking.md).
|
||||
[Cluster Networking](/docs/concepts/cluster-administration/networking/)
|
||||
|
|
|
@ -5,244 +5,6 @@ assignees:
|
|||
title: Monitoring Node Health
|
||||
---
|
||||
|
||||
* TOC
|
||||
{:toc}
|
||||
{% include user-guide-content-moved.md %}
|
||||
|
||||
## Node Problem Detector
|
||||
|
||||
*Node problem detector* is a [DaemonSet](/docs/admin/daemons/) monitoring the
|
||||
node health. It collects node problems from various daemons and reports them
|
||||
to the apiserver as [NodeCondition](/docs/admin/node/#node-condition) and
|
||||
[Event](/docs/api-reference/v1/definitions/#_v1_event).
|
||||
|
||||
It supports some known kernel issue detection now, and will detect more and
|
||||
more node problems over time.
|
||||
|
||||
Currently Kubernetes won't take any action on the node conditions and events
|
||||
generated by node problem detector. In the future, a remedy system could be
|
||||
introduced to deal with node problems.
|
||||
|
||||
See more information
|
||||
[here](https://github.com/kubernetes/node-problem-detector).
|
||||
|
||||
## Limitations
|
||||
|
||||
* The kernel issue detection of node problem detector only supports file based
|
||||
kernel log now. It doesn't support log tools like journald.
|
||||
|
||||
* The kernel issue detection of node problem detector has assumption on kernel
|
||||
log format, and now it only works on Ubuntu and Debian. However, it is easy to extend
|
||||
it to [support other log format](/docs/admin/node-problem/#support-other-log-format).
|
||||
|
||||
## Enable/Disable in GCE cluster
|
||||
|
||||
Node problem detector is [running as a cluster addon](cluster-large.md/#addon-resources) enabled by default in the
|
||||
gce cluster.
|
||||
|
||||
You can enable/disable it by setting the environment variable
|
||||
`KUBE_ENABLE_NODE_PROBLEM_DETECTOR` before `kube-up.sh`.
|
||||
|
||||
## Use in Other Environment
|
||||
|
||||
To enable node problem detector in other environment outside of GCE, you can use
|
||||
either `kubectl` or addon pod.
|
||||
|
||||
### Kubectl
|
||||
|
||||
This is the recommended way to start node problem detector outside of GCE. It
|
||||
provides more flexible management, such as overwriting the default
|
||||
configuration to fit it into your environment or detect
|
||||
customized node problems.
|
||||
|
||||
* **Step 1:** Create `node-problem-detector.yaml`:
|
||||
|
||||
```yaml
|
||||
apiVersion: extensions/v1beta1
|
||||
kind: DaemonSet
|
||||
metadata:
|
||||
name: node-problem-detector-v0.1
|
||||
namespace: kube-system
|
||||
labels:
|
||||
k8s-app: node-problem-detector
|
||||
version: v0.1
|
||||
kubernetes.io/cluster-service: "true"
|
||||
spec:
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
k8s-app: node-problem-detector
|
||||
version: v0.1
|
||||
kubernetes.io/cluster-service: "true"
|
||||
spec:
|
||||
hostNetwork: true
|
||||
containers:
|
||||
- name: node-problem-detector
|
||||
image: gcr.io/google_containers/node-problem-detector:v0.1
|
||||
securityContext:
|
||||
privileged: true
|
||||
resources:
|
||||
limits:
|
||||
cpu: "200m"
|
||||
memory: "100Mi"
|
||||
requests:
|
||||
cpu: "20m"
|
||||
memory: "20Mi"
|
||||
volumeMounts:
|
||||
- name: log
|
||||
mountPath: /log
|
||||
readOnly: true
|
||||
volumes:
|
||||
- name: log
|
||||
hostPath:
|
||||
path: /var/log/
|
||||
```
|
||||
|
||||
***Notice that you should make sure the system log directory is right for your
|
||||
OS distro.***
|
||||
|
||||
* **Step 2:** Start node problem detector with `kubectl`:
|
||||
|
||||
```shell
|
||||
kubectl create -f node-problem-detector.yaml
|
||||
```
|
||||
|
||||
### Addon Pod
|
||||
|
||||
This is for those who have their own cluster bootstrap solution, and don't need
|
||||
to overwrite the default configuration. They could leverage the addon pod to
|
||||
further automate the deployment.
|
||||
|
||||
Just create `node-problem-detector.yaml`, and put it under the addon pods directory
|
||||
`/etc/kubernetes/addons/node-problem-detector` on master node.
|
||||
|
||||
## Overwrite the Configuration
|
||||
|
||||
The [default configuration](https://github.com/kubernetes/node-problem-detector/tree/v0.1/config)
|
||||
is embedded when building the docker image of node problem detector.
|
||||
|
||||
However, you can use [ConfigMap](/docs/user-guide/configmap/) to overwrite it
|
||||
following the steps:
|
||||
|
||||
* **Step 1:** Change the config files in `config/`.
|
||||
* **Step 2:** Create the ConfigMap `node-problem-detector-config` with `kubectl create configmap
|
||||
node-problem-detector-config --from-file=config/`.
|
||||
* **Step 3:** Change the `node-problem-detector.yaml` to use the ConfigMap:
|
||||
|
||||
```yaml
|
||||
apiVersion: extensions/v1beta1
|
||||
kind: DaemonSet
|
||||
metadata:
|
||||
name: node-problem-detector-v0.1
|
||||
namespace: kube-system
|
||||
labels:
|
||||
k8s-app: node-problem-detector
|
||||
version: v0.1
|
||||
kubernetes.io/cluster-service: "true"
|
||||
spec:
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
k8s-app: node-problem-detector
|
||||
version: v0.1
|
||||
kubernetes.io/cluster-service: "true"
|
||||
spec:
|
||||
hostNetwork: true
|
||||
containers:
|
||||
- name: node-problem-detector
|
||||
image: gcr.io/google_containers/node-problem-detector:v0.1
|
||||
securityContext:
|
||||
privileged: true
|
||||
resources:
|
||||
limits:
|
||||
cpu: "200m"
|
||||
memory: "100Mi"
|
||||
requests:
|
||||
cpu: "20m"
|
||||
memory: "20Mi"
|
||||
volumeMounts:
|
||||
- name: log
|
||||
mountPath: /log
|
||||
readOnly: true
|
||||
- name: config # Overwrite the config/ directory with ConfigMap volume
|
||||
mountPath: /config
|
||||
readOnly: true
|
||||
volumes:
|
||||
- name: log
|
||||
hostPath:
|
||||
path: /var/log/
|
||||
- name: config # Define ConfigMap volume
|
||||
configMap:
|
||||
name: node-problem-detector-config
|
||||
```
|
||||
|
||||
* **Step 4:** Re-create the node problem detector with the new yaml file:
|
||||
|
||||
```shell
|
||||
kubectl delete -f node-problem-detector.yaml # If you have a node-problem-detector running
|
||||
kubectl create -f node-problem-detector.yaml
|
||||
```
|
||||
|
||||
***Notice that this approach only applies to node problem detector started with `kubectl`.***
|
||||
|
||||
For node problem detector running as cluster addon, because addon manager doesn't support
|
||||
ConfigMap, configuration overwriting is not supported now.
|
||||
|
||||
## Kernel Monitor
|
||||
|
||||
*Kernel Monitor* is a problem daemon in node problem detector. It monitors kernel log
|
||||
and detects known kernel issues following predefined rules.
|
||||
|
||||
The Kernel Monitor matches kernel issues according to a set of predefined rule list in
|
||||
[`config/kernel-monitor.json`](https://github.com/kubernetes/node-problem-detector/blob/v0.1/config/kernel-monitor.json).
|
||||
The rule list is extensible, and you can always extend it by [overwriting the
|
||||
configuration](/docs/admin/node-problem/#overwrite-the-configuration).
|
||||
|
||||
### Add New NodeConditions
|
||||
|
||||
To support new node conditions, you can extend the `conditions` field in
|
||||
`config/kernel-monitor.json` with new condition definition:
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "NodeConditionType",
|
||||
"reason": "CamelCaseDefaultNodeConditionReason",
|
||||
"message": "arbitrary default node condition message"
|
||||
}
|
||||
```
|
||||
|
||||
### Detect New Problems
|
||||
|
||||
To detect new problems, you can extend the `rules` field in `config/kernel-monitor.json`
|
||||
with new rule definition:
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "temporary/permanent",
|
||||
"condition": "NodeConditionOfPermanentIssue",
|
||||
"reason": "CamelCaseShortReason",
|
||||
"message": "regexp matching the issue in the kernel log"
|
||||
}
|
||||
```
|
||||
|
||||
### Change Log Path
|
||||
|
||||
Kernel log in different OS distros may locate in different path. The `log`
|
||||
field in `config/kernel-monitor.json` is the log path inside the container.
|
||||
You can always configure it to match your OS distro.
|
||||
|
||||
### Support Other Log Format
|
||||
|
||||
Kernel monitor uses [`Translator`](https://github.com/kubernetes/node-problem-detector/blob/v0.1/pkg/kernelmonitor/translator/translator.go)
|
||||
plugin to translate kernel log the internal data structure. It is easy to
|
||||
implement a new translator for a new log format.
|
||||
|
||||
## Caveats
|
||||
|
||||
It is recommended to run the node problem detector in your cluster to monitor
|
||||
the node health. However, you should be aware that this will introduce extra
|
||||
resource overhead on each node. Usually this is fine, because:
|
||||
|
||||
* The kernel log is generated relatively slowly.
|
||||
* Resource limit is set for node problem detector.
|
||||
* Even under high load, the resource usage is acceptable.
|
||||
(see [benchmark result](https://github.com/kubernetes/node-problem-detector/issues/2#issuecomment-220255629))
|
||||
[Monitoring Node Health](/docs/tasks/debug-application-cluster/monitor-node-health/)
|
||||
|
|
|
@ -6,363 +6,6 @@ assignees:
|
|||
title: Configuring Out Of Resource Handling
|
||||
---
|
||||
|
||||
* TOC
|
||||
{:toc}
|
||||
{% include user-guide-content-moved.md %}
|
||||
|
||||
The `kubelet` needs to preserve node stability when available compute resources are low.
|
||||
|
||||
This is especially important when dealing with incompressible resources such as memory or disk.
|
||||
|
||||
If either resource is exhausted, the node would become unstable.
|
||||
|
||||
## Eviction Policy
|
||||
|
||||
The `kubelet` can pro-actively monitor for and prevent against total starvation of a compute resource. In those cases, the `kubelet` can pro-actively fail one or more pods in order to reclaim
|
||||
the starved resource. When the `kubelet` fails a pod, it terminates all containers in the pod, and the `PodPhase`
|
||||
is transitioned to `Failed`.
|
||||
|
||||
### Eviction Signals
|
||||
|
||||
The `kubelet` can support the ability to trigger eviction decisions on the signals described in the
|
||||
table below. The value of each signal is described in the description column based on the `kubelet`
|
||||
summary API.
|
||||
|
||||
| Eviction Signal | Description |
|
||||
|----------------------------|-----------------------------------------------------------------------|
|
||||
| `memory.available` | `memory.available` := `node.status.capacity[memory]` - `node.stats.memory.workingSet` |
|
||||
| `nodefs.available` | `nodefs.available` := `node.stats.fs.available` |
|
||||
| `nodefs.inodesFree` | `nodefs.inodesFree` := `node.stats.fs.inodesFree` |
|
||||
| `imagefs.available` | `imagefs.available` := `node.stats.runtime.imagefs.available` |
|
||||
| `imagefs.inodesFree` | `imagefs.inodesFree` := `node.stats.runtime.imagefs.inodesFree` |
|
||||
|
||||
Each of the above signals supports either a literal or percentage based value. The percentage based value
|
||||
is calculated relative to the total capacity associated with each signal.
|
||||
|
||||
`kubelet` supports only two filesystem partitions.
|
||||
|
||||
1. The `nodefs` filesystem that kubelet uses for volumes, daemon logs, etc.
|
||||
1. The `imagefs` filesystem that container runtimes uses for storing images and container writable layers.
|
||||
|
||||
`imagefs` is optional. `kubelet` auto-discovers these filesystems using cAdvisor. `kubelet` does not care about any
|
||||
other filesystems. Any other types of configurations are not currently supported by the kubelet. For example, it is
|
||||
*not OK* to store volumes and logs in a dedicated `filesystem`.
|
||||
|
||||
In future releases, the `kubelet` will deprecate the existing [garbage collection](/docs/admin/garbage-collection/)
|
||||
support in favor of eviction in response to disk pressure.
|
||||
|
||||
### Eviction Thresholds
|
||||
|
||||
The `kubelet` supports the ability to specify eviction thresholds that trigger the `kubelet` to reclaim resources.
|
||||
|
||||
Each threshold is of the following form:
|
||||
|
||||
`<eviction-signal><operator><quantity>`
|
||||
|
||||
* valid `eviction-signal` tokens as defined above.
|
||||
* valid `operator` tokens are `<`
|
||||
* valid `quantity` tokens must match the quantity representation used by Kubernetes
|
||||
* an eviction threshold can be expressed as a percentage if ends with `%` token.
|
||||
|
||||
For example, if a node has `10Gi` of memory, and the desire is to induce eviction
|
||||
if available memory falls below `1Gi`, an eviction threshold can be specified as either
|
||||
of the following (but not both).
|
||||
|
||||
* `memory.available<10%`
|
||||
* `memory.available<1Gi`
|
||||
|
||||
#### Soft Eviction Thresholds
|
||||
|
||||
A soft eviction threshold pairs an eviction threshold with a required
|
||||
administrator specified grace period. No action is taken by the `kubelet`
|
||||
to reclaim resources associated with the eviction signal until that grace
|
||||
period has been exceeded. If no grace period is provided, the `kubelet` will
|
||||
error on startup.
|
||||
|
||||
In addition, if a soft eviction threshold has been met, an operator can
|
||||
specify a maximum allowed pod termination grace period to use when evicting
|
||||
pods from the node. If specified, the `kubelet` will use the lesser value among
|
||||
the `pod.Spec.TerminationGracePeriodSeconds` and the max allowed grace period.
|
||||
If not specified, the `kubelet` will kill pods immediately with no graceful
|
||||
termination.
|
||||
|
||||
To configure soft eviction thresholds, the following flags are supported:
|
||||
|
||||
* `eviction-soft` describes a set of eviction thresholds (e.g. `memory.available<1.5Gi`) that if met over a
|
||||
corresponding grace period would trigger a pod eviction.
|
||||
* `eviction-soft-grace-period` describes a set of eviction grace periods (e.g. `memory.available=1m30s`) that
|
||||
correspond to how long a soft eviction threshold must hold before triggering a pod eviction.
|
||||
* `eviction-max-pod-grace-period` describes the maximum allowed grace period (in seconds) to use when terminating
|
||||
pods in response to a soft eviction threshold being met.
|
||||
|
||||
#### Hard Eviction Thresholds
|
||||
|
||||
A hard eviction threshold has no grace period, and if observed, the `kubelet`
|
||||
will take immediate action to reclaim the associated starved resource. If a
|
||||
hard eviction threshold is met, the `kubelet` will kill the pod immediately
|
||||
with no graceful termination.
|
||||
|
||||
To configure hard eviction thresholds, the following flag is supported:
|
||||
|
||||
* `eviction-hard` describes a set of eviction thresholds (e.g. `memory.available<1Gi`) that if met
|
||||
would trigger a pod eviction.
|
||||
|
||||
The `kubelet` has the following default hard eviction thresholds:
|
||||
|
||||
* `--eviction-hard=memory.available<100Mi`
|
||||
|
||||
### Eviction Monitoring Interval
|
||||
|
||||
The `kubelet` evaluates eviction thresholds per its configured housekeeping interval.
|
||||
|
||||
* `housekeeping-interval` is the interval between container housekeepings.
|
||||
|
||||
### Node Conditions
|
||||
|
||||
The `kubelet` will map one or more eviction signals to a corresponding node condition.
|
||||
|
||||
If a hard eviction threshold has been met, or a soft eviction threshold has been met
|
||||
independent of its associated grace period, the `kubelet` will report a condition that
|
||||
reflects the node is under pressure.
|
||||
|
||||
The following node conditions are defined that correspond to the specified eviction signal.
|
||||
|
||||
| Node Condition | Eviction Signal | Description |
|
||||
|-------------------------|-------------------------------|--------------------------------------------|
|
||||
| `MemoryPressure` | `memory.available` | Available memory on the node has satisfied an eviction threshold |
|
||||
| `DiskPressure` | `nodefs.available`, `nodefs.inodesFree`, `imagefs.available`, or `imagefs.inodesFree` | Available disk space and inodes on either the node's root filesytem or image filesystem has satisfied an eviction threshold |
|
||||
|
||||
The `kubelet` will continue to report node status updates at the frequency specified by
|
||||
`--node-status-update-frequency` which defaults to `10s`.
|
||||
|
||||
### Oscillation of node conditions
|
||||
|
||||
If a node is oscillating above and below a soft eviction threshold, but not exceeding
|
||||
its associated grace period, it would cause the corresponding node condition to
|
||||
constantly oscillate between true and false, and could cause poor scheduling decisions
|
||||
as a consequence.
|
||||
|
||||
To protect against this oscillation, the following flag is defined to control how
|
||||
long the `kubelet` must wait before transitioning out of a pressure condition.
|
||||
|
||||
* `eviction-pressure-transition-period` is the duration for which the `kubelet` has
|
||||
to wait before transitioning out of an eviction pressure condition.
|
||||
|
||||
The `kubelet` would ensure that it has not observed an eviction threshold being met
|
||||
for the specified pressure condition for the period specified before toggling the
|
||||
condition back to `false`.
|
||||
|
||||
### Reclaiming node level resources
|
||||
|
||||
If an eviction threshold has been met and the grace period has passed,
|
||||
the `kubelet` will initiate the process of reclaiming the pressured resource
|
||||
until it has observed the signal has gone below its defined threshold.
|
||||
|
||||
The `kubelet` attempts to reclaim node level resources prior to evicting end-user pods. If
|
||||
disk pressure is observed, the `kubelet` reclaims node level resources differently if the
|
||||
machine has a dedicated `imagefs` configured for the container runtime.
|
||||
|
||||
#### With Imagefs
|
||||
|
||||
If `nodefs` filesystem has met eviction thresholds, `kubelet` will free up disk space in the following order:
|
||||
|
||||
1. Delete dead pods/containers
|
||||
|
||||
If `imagefs` filesystem has met eviction thresholds, `kubelet` will free up disk space in the following order:
|
||||
|
||||
1. Delete all unused images
|
||||
|
||||
#### Without Imagefs
|
||||
|
||||
If `nodefs` filesystem has met eviction thresholds, `kubelet` will free up disk space in the following order:
|
||||
|
||||
1. Delete dead pods/containers
|
||||
1. Delete all unused images
|
||||
|
||||
### Evicting end-user pods
|
||||
|
||||
If the `kubelet` is unable to reclaim sufficient resource on the node,
|
||||
it will begin evicting pods.
|
||||
|
||||
The `kubelet` ranks pods for eviction as follows:
|
||||
|
||||
* by their quality of service
|
||||
* by the consumption of the starved compute resource relative to the pods scheduling request.
|
||||
|
||||
As a result, pod eviction occurs in the following order:
|
||||
|
||||
* `BestEffort` pods that consume the most of the starved resource are failed
|
||||
first.
|
||||
* `Burstable` pods that consume the greatest amount of the starved resource
|
||||
relative to their request for that resource are killed first. If no pod
|
||||
has exceeded its request, the strategy targets the largest consumer of the
|
||||
starved resource.
|
||||
* `Guaranteed` pods that consume the greatest amount of the starved resource
|
||||
relative to their request are killed first. If no pod has exceeded its request,
|
||||
the strategy targets the largest consumer of the starved resource.
|
||||
|
||||
A `Guaranteed` pod is guaranteed to never be evicted because of another pod's
|
||||
resource consumption. If a system daemon (i.e. `kubelet`, `docker`, `journald`, etc.)
|
||||
is consuming more resources than were reserved via `system-reserved` or `kube-reserved` allocations,
|
||||
and the node only has `Guaranteed` pod(s) remaining, then the node must choose to evict a
|
||||
`Guaranteed` pod in order to preserve node stability, and to limit the impact
|
||||
of the unexpected consumption to other `Guaranteed` pod(s).
|
||||
|
||||
Local disk is a `BestEffort` resource. If necessary, `kubelet` will evict pods one at a time to reclaim
|
||||
disk when `DiskPressure` is encountered. The `kubelet` will rank pods by quality of service. If the `kubelet`
|
||||
is responding to `inode` starvation, it will reclaim `inodes` by evicting pods with the lowest quality of service
|
||||
first. If the `kubelet` is responding to lack of available disk, it will rank pods within a quality of service
|
||||
that consumes the largest amount of disk and kill those first.
|
||||
|
||||
#### With Imagefs
|
||||
|
||||
If `nodefs` is triggering evictions, `kubelet` will sort pods based on the usage on `nodefs`
|
||||
- local volumes + logs of all its containers.
|
||||
|
||||
If `imagefs` is triggering evictions, `kubelet` will sort pods based on the writable layer usage of all its containers.
|
||||
|
||||
#### Without Imagefs
|
||||
|
||||
If `nodefs` is triggering evictions, `kubelet` will sort pods based on their total disk usage
|
||||
- local volumes + logs & writable layer of all its containers.
|
||||
|
||||
### Minimum eviction reclaim
|
||||
|
||||
In certain scenarios, eviction of pods could result in reclamation of small amount of resources. This can result in
|
||||
`kubelet` hitting eviction thresholds in repeated successions. In addition to that, eviction of resources like `disk`,
|
||||
is time consuming.
|
||||
|
||||
To mitigate these issues, `kubelet` can have a per-resource `minimum-reclaim`. Whenever `kubelet` observes
|
||||
resource pressure, `kubelet` will attempt to reclaim at least `minimum-reclaim` amount of resource below
|
||||
the configured eviction threshold.
|
||||
|
||||
For example, with the following configuration:
|
||||
|
||||
```
|
||||
--eviction-hard=memory.available<500Mi,nodefs.available<1Gi,imagefs.available<100Gi
|
||||
--eviction-minimum-reclaim="memory.available=0Mi,nodefs.available=500Mi,imagefs.available=2Gi"`
|
||||
```
|
||||
|
||||
If an eviction threshold is triggered for `memory.available`, the `kubelet` will work to ensure
|
||||
that `memory.available` is at least `500Mi`. For `nodefs.available`, the `kubelet` will work
|
||||
to ensure that `nodefs.available` is at least `1.5Gi`, and for `imagefs.available` it will
|
||||
work to ensure that `imagefs.available` is at least `102Gi` before no longer reporting pressure
|
||||
on their associated resources.
|
||||
|
||||
The default `eviction-minimum-reclaim` is `0` for all resources.
|
||||
|
||||
### Scheduler
|
||||
|
||||
The node will report a condition when a compute resource is under pressure. The
|
||||
scheduler views that condition as a signal to dissuade placing additional
|
||||
pods on the node.
|
||||
|
||||
| Node Condition | Scheduler Behavior |
|
||||
| ---------------- | ------------------------------------------------ |
|
||||
| `MemoryPressure` | No new `BestEffort` pods are scheduled to the node. |
|
||||
| `DiskPressure` | No new pods are scheduled to the node. |
|
||||
|
||||
## Node OOM Behavior
|
||||
|
||||
If the node experiences a system OOM (out of memory) event prior to the `kubelet` is able to reclaim memory,
|
||||
the node depends on the [oom_killer](https://lwn.net/Articles/391222/) to respond.
|
||||
|
||||
The `kubelet` sets a `oom_score_adj` value for each container based on the quality of service for the pod.
|
||||
|
||||
| Quality of Service | oom_score_adj |
|
||||
|----------------------------|-----------------------------------------------------------------------|
|
||||
| `Guaranteed` | -998 |
|
||||
| `BestEffort` | 1000 |
|
||||
| `Burstable` | min(max(2, 1000 - (1000 * memoryRequestBytes) / machineMemoryCapacityBytes), 999) |
|
||||
|
||||
If the `kubelet` is unable to reclaim memory prior to a node experiencing system OOM, the `oom_killer` will calculate
|
||||
an `oom_score` based on the percentage of memory its using on the node, and then add the `oom_score_adj` to get an
|
||||
effective `oom_score` for the container, and then kills the container with the highest score.
|
||||
|
||||
The intended behavior should be that containers with the lowest quality of service that
|
||||
are consuming the largest amount of memory relative to the scheduling request should be killed first in order
|
||||
to reclaim memory.
|
||||
|
||||
Unlike pod eviction, if a pod container is OOM killed, it may be restarted by the `kubelet` based on its `RestartPolicy`.
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Schedulable resources and eviction policies
|
||||
|
||||
Let's imagine the following scenario:
|
||||
|
||||
* Node memory capacity: `10Gi`
|
||||
* Operator wants to reserve 10% of memory capacity for system daemons (kernel, `kubelet`, etc.)
|
||||
* Operator wants to evict pods at 95% memory utilization to reduce thrashing and incidence of system OOM.
|
||||
|
||||
To facilitate this scenario, the `kubelet` would be launched as follows:
|
||||
|
||||
```
|
||||
--eviction-hard=memory.available<500Mi
|
||||
--system-reserved=memory=1.5Gi
|
||||
```
|
||||
|
||||
Implicit in this configuration is the understanding that "System reserved" should include the amount of memory
|
||||
covered by the eviction threshold.
|
||||
|
||||
To reach that capacity, either some pod is using more than its request, or the system is using more than `500Mi`.
|
||||
|
||||
This configuration will ensure that the scheduler does not place pods on a node that immediately induce memory pressure
|
||||
and trigger eviction assuming those pods use less than their configured request.
|
||||
|
||||
### DaemonSet
|
||||
|
||||
It is never desired for a `kubelet` to evict a pod that was derived from
|
||||
a `DaemonSet` since the pod will immediately be recreated and rescheduled
|
||||
back to the same node.
|
||||
|
||||
At the moment, the `kubelet` has no ability to distinguish a pod created
|
||||
from `DaemonSet` versus any other object. If/when that information is
|
||||
available, the `kubelet` could pro-actively filter those pods from the
|
||||
candidate set of pods provided to the eviction strategy.
|
||||
|
||||
In general, it is strongly recommended that `DaemonSet` not
|
||||
create `BestEffort` pods to avoid being identified as a candidate pod
|
||||
for eviction. Instead `DaemonSet` should ideally launch `Guaranteed` pods.
|
||||
|
||||
## Deprecation of existing feature flags to reclaim disk
|
||||
|
||||
`kubelet` has been freeing up disk space on demand to keep the node stable.
|
||||
|
||||
As disk based eviction matures, the following `kubelet` flags will be marked for deprecation
|
||||
in favor of the simpler configuration supported around eviction.
|
||||
|
||||
| Existing Flag | New Flag |
|
||||
| ------------- | -------- |
|
||||
| `--image-gc-high-threshold` | `--eviction-hard` or `eviction-soft` |
|
||||
| `--image-gc-low-threshold` | `--eviction-minimum-reclaim` |
|
||||
| `--maximum-dead-containers` | deprecated |
|
||||
| `--maximum-dead-containers-per-container` | deprecated |
|
||||
| `--minimum-container-ttl-duration` | deprecated |
|
||||
| `--low-diskspace-threshold-mb` | `--eviction-hard` or `eviction-soft` |
|
||||
| `--outofdisk-transition-frequency` | `--eviction-pressure-transition-period` |
|
||||
|
||||
## Known issues
|
||||
|
||||
### kubelet may not observe memory pressure right away
|
||||
|
||||
The `kubelet` currently polls `cAdvisor` to collect memory usage stats at a regular interval. If memory usage
|
||||
increases within that window rapidly, the `kubelet` may not observe `MemoryPressure` fast enough, and the `OOMKiller`
|
||||
will still be invoked. We intend to integrate with the `memcg` notification API in a future release to reduce this
|
||||
latency, and instead have the kernel tell us when a threshold has been crossed immediately.
|
||||
|
||||
If you are not trying to achieve extreme utilization, but a sensible measure of overcommit, a viable workaround for
|
||||
this issue is to set eviction thresholds at approximately 75% capacity. This increases the ability of this feature
|
||||
to prevent system OOMs, and promote eviction of workloads so cluster state can rebalance.
|
||||
|
||||
### kubelet may evict more pods than needed
|
||||
|
||||
The pod eviction may evict more pods than needed due to stats collection timing gap. This can be mitigated by adding
|
||||
the ability to get root container stats on an on-demand basis (https://github.com/google/cadvisor/issues/1247) in the future.
|
||||
|
||||
### How kubelet ranks pods for eviction in response to inode exhaustion
|
||||
|
||||
At this time, it is not possible to know how many inodes were consumed by a particular container. If the `kubelet` observes
|
||||
inode exhaustion, it will evict pods by ranking them by quality of service. The following issue has been opened in cadvisor
|
||||
to track per container inode consumption (https://github.com/google/cadvisor/issues/1422) which would allow us to rank pods
|
||||
by inode consumption. For example, this would let us identify a container that created large numbers of 0 byte files, and evict
|
||||
that pod over others.
|
||||
[Configuring Out of Resource Handling](/docs/concepts/cluster-administration/out-of-resource/)
|
||||
|
|
|
@ -6,52 +6,6 @@ assignees:
|
|||
title: Guaranteed Scheduling For Critical Add-On Pods
|
||||
---
|
||||
|
||||
* TOC
|
||||
{:toc}
|
||||
{% include user-guide-content-moved.md %}
|
||||
|
||||
## Overview
|
||||
|
||||
In addition to Kubernetes core components like api-server, scheduler, controller-manager running on a master machine
|
||||
there are a number of add-ons which, for various reasons, must run on a regular cluster node (rather than the Kubernetes master).
|
||||
Some of these add-ons are critical to a fully functional cluster, such as Heapster, DNS, and UI.
|
||||
A cluster may stop working properly if a critical add-on is evicted (either manually or as a side effect of another operation like upgrade)
|
||||
and becomes pending (for example when the cluster is highly utilized and either there are other pending pods that schedule into the space
|
||||
vacated by the evicted critical add-on pod or the amount of resources available on the node changed for some other reason).
|
||||
|
||||
## Rescheduler: guaranteed scheduling of critical add-ons
|
||||
|
||||
Rescheduler ensures that critical add-ons are always scheduled
|
||||
(assuming the cluster has enough resources to run the critical add-on pods in the absence of regular pods).
|
||||
If the scheduler determines that no node has enough free resources to run the critical add-on pod
|
||||
given the pods that are already running in the cluster
|
||||
(indicated by critical add-on pod's pod condition PodScheduled set to false, the reason set to Unschedulable)
|
||||
the rescheduler tries to free up space for the add-on by evicting some pods; then the scheduler will schedule the add-on pod.
|
||||
|
||||
To avoid situation when another pod is scheduled into the space prepared for the critical add-on,
|
||||
the chosen node gets a temporary taint "CriticalAddonsOnly" before the eviction(s)
|
||||
(see [more details](https://github.com/kubernetes/kubernetes/blob/master/docs/design/taint-toleration-dedicated.md)).
|
||||
Each critical add-on has to tolerate it,
|
||||
while the other pods shouldn't tolerate the taint. The taint is removed once the add-on is successfully scheduled.
|
||||
|
||||
*Warning:* currently there is no guarantee which node is chosen and which pods are being killed
|
||||
in order to schedule critical pods, so if rescheduler is enabled your pods might be occasionally
|
||||
killed for this purpose.
|
||||
|
||||
## Config
|
||||
|
||||
Rescheduler doesn't have any user facing configuration (component config) or API.
|
||||
It's enabled by default. It can be disabled:
|
||||
|
||||
* during cluster setup by setting `ENABLE_RESCHEDULER` flag to `false`
|
||||
* on running cluster by deleting its manifest from master node
|
||||
(default path `/etc/kubernetes/manifests/rescheduler.manifest`)
|
||||
|
||||
### Marking add-on as critical
|
||||
|
||||
To be critical an add-on has to run in `kube-system` namespace (configurable via flag)
|
||||
and have the following annotations specified:
|
||||
|
||||
* `scheduler.alpha.kubernetes.io/critical-pod` set to empty string
|
||||
* `scheduler.alpha.kubernetes.io/tolerations` set to `[{"key":"CriticalAddonsOnly", "operator":"Exists"}]`
|
||||
|
||||
The first one marks a pod a critical. The second one is required by Rescheduler algorithm.
|
||||
[Guaranteed Scheduling for Critical Add-On Pods](/docs/concepts/cluster-administration/guaranteed-scheduling-critical-addon-pods/)
|
||||
|
|
|
@ -4,237 +4,6 @@ assignees:
|
|||
title: Resource Quotas
|
||||
---
|
||||
|
||||
When several users or teams share a cluster with a fixed number of nodes,
|
||||
there is a concern that one team could use more than its fair share of resources.
|
||||
{% include user-guide-content-moved.md %}
|
||||
|
||||
Resource quotas are a tool for administrators to address this concern.
|
||||
|
||||
A resource quota, defined by a `ResourceQuota` object, provides constraints that limit
|
||||
aggregate resource consumption per namespace. It can limit the quantity of objects that can
|
||||
be created in a namespace by type, as well as the total amount of compute resources that may
|
||||
be consumed by resources in that project.
|
||||
|
||||
Resource quotas work like this:
|
||||
|
||||
- Different teams work in different namespaces. Currently this is voluntary, but
|
||||
support for making this mandatory via ACLs is planned.
|
||||
- The administrator creates one or more Resource Quota objects for each namespace.
|
||||
- Users create resources (pods, services, etc.) in the namespace, and the quota system
|
||||
tracks usage to ensure it does not exceed hard resource limits defined in a Resource Quota.
|
||||
- If creating or updating a resource violates a quota constraint, the request will fail with HTTP
|
||||
status code `403 FORBIDDEN` with a message explaining the constraint that would have been violated.
|
||||
- If quota is enabled in a namespace for compute resources like `cpu` and `memory`, users must specify
|
||||
requests or limits for those values; otherwise, the quota system may reject pod creation. Hint: Use
|
||||
the LimitRange admission controller to force defaults for pods that make no compute resource requirements.
|
||||
See the [walkthrough](/docs/admin/resourcequota/walkthrough/) for an example to avoid this problem.
|
||||
|
||||
Examples of policies that could be created using namespaces and quotas are:
|
||||
|
||||
- In a cluster with a capacity of 32 GiB RAM, and 16 cores, let team A use 20 Gib and 10 cores,
|
||||
let B use 10GiB and 4 cores, and hold 2GiB and 2 cores in reserve for future allocation.
|
||||
- Limit the "testing" namespace to using 1 core and 1GiB RAM. Let the "production" namespace
|
||||
use any amount.
|
||||
|
||||
In the case where the total capacity of the cluster is less than the sum of the quotas of the namespaces,
|
||||
there may be contention for resources. This is handled on a first-come-first-served basis.
|
||||
|
||||
Neither contention nor changes to quota will affect already created resources.
|
||||
|
||||
## Enabling Resource Quota
|
||||
|
||||
Resource Quota support is enabled by default for many Kubernetes distributions. It is
|
||||
enabled when the apiserver `--admission-control=` flag has `ResourceQuota` as
|
||||
one of its arguments.
|
||||
|
||||
Resource Quota is enforced in a particular namespace when there is a
|
||||
`ResourceQuota` object in that namespace. There should be at most one
|
||||
`ResourceQuota` object in a namespace.
|
||||
|
||||
## Compute Resource Quota
|
||||
|
||||
You can limit the total sum of [compute resources](/docs/user-guide/compute-resources) that can be requested in a given namespace.
|
||||
|
||||
The following resource types are supported:
|
||||
|
||||
| Resource Name | Description |
|
||||
| --------------------- | ----------------------------------------------------------- |
|
||||
| `cpu` | Across all pods in a non-terminal state, the sum of CPU requests cannot exceed this value. |
|
||||
| `limits.cpu` | Across all pods in a non-terminal state, the sum of CPU limits cannot exceed this value. |
|
||||
| `limits.memory` | Across all pods in a non-terminal state, the sum of memory limits cannot exceed this value. |
|
||||
| `memory` | Across all pods in a non-terminal state, the sum of memory requests cannot exceed this value. |
|
||||
| `requests.cpu` | Across all pods in a non-terminal state, the sum of CPU requests cannot exceed this value. |
|
||||
| `requests.memory` | Across all pods in a non-terminal state, the sum of memory requests cannot exceed this value. |
|
||||
|
||||
## Storage Resource Quota
|
||||
|
||||
You can limit the total sum of [storage resources](/docs/user-guide/persistent-volumes) that can be requested in a given namespace.
|
||||
|
||||
In addition, you can limit consumption of storage resources based on associated storage-class.
|
||||
|
||||
| Resource Name | Description |
|
||||
| --------------------- | ----------------------------------------------------------- |
|
||||
| `requests.storage` | Across all persistent volume claims, the sum of storage requests cannot exceed this value. |
|
||||
| `persistentvolumeclaims` | The total number of [persistent volume claims](/docs/user-guide/persistent-volumes/#persistentvolumeclaims) that can exist in the namespace. |
|
||||
| `<storage-class-name>.storageclass.storage.k8s.io/requests.storage` | Across all persistent volume claims associated with the storage-class-name, the sum of storage requests cannot exceed this value. |
|
||||
| `<storage-class-name>.storageclass.storage.k8s.io/persistentvolumeclaims` | Across all persistent volume claims associated with the storage-class-name, the total number of [persistent volume claims](/docs/user-guide/persistent-volumes/#persistentvolumeclaims) that can exist in the namespace. |
|
||||
|
||||
For example, if an operator wants to quota storage with `gold` storage class separate from `bronze` storage class, the operator can
|
||||
define a quota as follows:
|
||||
|
||||
* `gold.storageclass.storage.k8s.io/requests.storage: 500Gi`
|
||||
* `bronze.storageclass.storage.k8s.io/requests.storage: 100Gi`
|
||||
|
||||
## Object Count Quota
|
||||
|
||||
The number of objects of a given type can be restricted. The following types
|
||||
are supported:
|
||||
|
||||
| Resource Name | Description |
|
||||
| ------------------------------- | ------------------------------------------------- |
|
||||
| `configmaps` | The total number of config maps that can exist in the namespace. |
|
||||
| `persistentvolumeclaims` | The total number of [persistent volume claims](/docs/user-guide/persistent-volumes/#persistentvolumeclaims) that can exist in the namespace. |
|
||||
| `pods` | The total number of pods in a non-terminal state that can exist in the namespace. A pod is in a terminal state if `status.phase in (Failed, Succeeded)` is true. |
|
||||
| `replicationcontrollers` | The total number of replication controllers that can exist in the namespace. |
|
||||
| `resourcequotas` | The total number of [resource quotas](/docs/admin/admission-controllers/#resourcequota) that can exist in the namespace. |
|
||||
| `services` | The total number of services that can exist in the namespace. |
|
||||
| `services.loadbalancers` | The total number of services of type load balancer that can exist in the namespace. |
|
||||
| `services.nodeports` | The total number of services of type node port that can exist in the namespace. |
|
||||
| `secrets` | The total number of secrets that can exist in the namespace. |
|
||||
|
||||
For example, `pods` quota counts and enforces a maximum on the number of `pods`
|
||||
created in a single namespace.
|
||||
|
||||
You might want to set a pods quota on a namespace
|
||||
to avoid the case where a user creates many small pods and exhausts the cluster's
|
||||
supply of Pod IPs.
|
||||
|
||||
## Quota Scopes
|
||||
|
||||
Each quota can have an associated set of scopes. A quota will only measure usage for a resource if it matches
|
||||
the intersection of enumerated scopes.
|
||||
|
||||
When a scope is added to the quota, it limits the number of resources it supports to those that pertain to the scope.
|
||||
Resources specified on the quota outside of the allowed set results in a validation error.
|
||||
|
||||
| Scope | Description |
|
||||
| ----- | ----------- |
|
||||
| `Terminating` | Match pods where `spec.activeDeadlineSeconds >= 0` |
|
||||
| `NotTerminating` | Match pods where `spec.activeDeadlineSeconds is nil` |
|
||||
| `BestEffort` | Match pods that have best effort quality of service. |
|
||||
| `NotBestEffort` | Match pods that do not have best effort quality of service. |
|
||||
|
||||
The `BestEffort` scope restricts a quota to tracking the following resource: `pods`
|
||||
|
||||
The `Terminating`, `NotTerminating`, and `NotBestEffort` scopes restrict a quota to tracking the following resources:
|
||||
|
||||
* `cpu`
|
||||
* `limits.cpu`
|
||||
* `limits.memory`
|
||||
* `memory`
|
||||
* `pods`
|
||||
* `requests.cpu`
|
||||
* `requests.memory`
|
||||
|
||||
## Requests vs Limits
|
||||
|
||||
When allocating compute resources, each container may specify a request and a limit value for either CPU or memory.
|
||||
The quota can be configured to quota either value.
|
||||
|
||||
If the quota has a value specified for `requests.cpu` or `requests.memory`, then it requires that every incoming
|
||||
container makes an explicit request for those resources. If the quota has a value specified for `limits.cpu` or `limits.memory`,
|
||||
then it requires that every incoming container specifies an explicit limit for those resources.
|
||||
|
||||
## Viewing and Setting Quotas
|
||||
|
||||
Kubectl supports creating, updating, and viewing quotas:
|
||||
|
||||
```shell
|
||||
$ kubectl create namespace myspace
|
||||
|
||||
$ cat <<EOF > compute-resources.yaml
|
||||
apiVersion: v1
|
||||
kind: ResourceQuota
|
||||
metadata:
|
||||
name: compute-resources
|
||||
spec:
|
||||
hard:
|
||||
pods: "4"
|
||||
requests.cpu: "1"
|
||||
requests.memory: 1Gi
|
||||
limits.cpu: "2"
|
||||
limits.memory: 2Gi
|
||||
EOF
|
||||
$ kubectl create -f ./compute-resources.yaml --namespace=myspace
|
||||
|
||||
$ cat <<EOF > object-counts.yaml
|
||||
apiVersion: v1
|
||||
kind: ResourceQuota
|
||||
metadata:
|
||||
name: object-counts
|
||||
spec:
|
||||
hard:
|
||||
configmaps: "10"
|
||||
persistentvolumeclaims: "4"
|
||||
replicationcontrollers: "20"
|
||||
secrets: "10"
|
||||
services: "10"
|
||||
services.loadbalancers: "2"
|
||||
EOF
|
||||
$ kubectl create -f ./object-counts.yaml --namespace=myspace
|
||||
|
||||
$ kubectl get quota --namespace=myspace
|
||||
NAME AGE
|
||||
compute-resources 30s
|
||||
object-counts 32s
|
||||
|
||||
$ kubectl describe quota compute-resources --namespace=myspace
|
||||
Name: compute-resources
|
||||
Namespace: myspace
|
||||
Resource Used Hard
|
||||
-------- ---- ----
|
||||
limits.cpu 0 2
|
||||
limits.memory 0 2Gi
|
||||
pods 0 4
|
||||
requests.cpu 0 1
|
||||
requests.memory 0 1Gi
|
||||
|
||||
$ kubectl describe quota object-counts --namespace=myspace
|
||||
Name: object-counts
|
||||
Namespace: myspace
|
||||
Resource Used Hard
|
||||
-------- ---- ----
|
||||
configmaps 0 10
|
||||
persistentvolumeclaims 0 4
|
||||
replicationcontrollers 0 20
|
||||
secrets 1 10
|
||||
services 0 10
|
||||
services.loadbalancers 0 2
|
||||
```
|
||||
|
||||
## Quota and Cluster Capacity
|
||||
|
||||
Resource Quota objects are independent of the Cluster Capacity. They are
|
||||
expressed in absolute units. So, if you add nodes to your cluster, this does *not*
|
||||
automatically give each namespace the ability to consume more resources.
|
||||
|
||||
Sometimes more complex policies may be desired, such as:
|
||||
|
||||
- proportionally divide total cluster resources among several teams.
|
||||
- allow each tenant to grow resource usage as needed, but have a generous
|
||||
limit to prevent accidental resource exhaustion.
|
||||
- detect demand from one namespace, add nodes, and increase quota.
|
||||
|
||||
Such policies could be implemented using ResourceQuota as a building-block, by
|
||||
writing a 'controller' which watches the quota usage and adjusts the quota
|
||||
hard limits of each namespace according to other signals.
|
||||
|
||||
Note that resource quota divides up aggregate cluster resources, but it creates no
|
||||
restrictions around nodes: pods from several namespaces may run on the same node.
|
||||
|
||||
## Example
|
||||
|
||||
See a [detailed example for how to use resource quota](/docs/admin/resourcequota/walkthrough/).
|
||||
|
||||
## Read More
|
||||
|
||||
See [ResourceQuota design doc](https://github.com/kubernetes/kubernetes/blob/{{page.githubbranch}}/docs/design/admission_control_resource_quota.md) for more information.
|
||||
[Resource Quotas](/docs/concepts/policy/resource-quotas/)
|
||||
|
|
|
@ -4,75 +4,8 @@ assignees:
|
|||
- janetkuo
|
||||
title: Limiting Storage Consumption
|
||||
---
|
||||
This example demonstrates an easy way to limit the amount of storage consumed in a namespace.
|
||||
|
||||
The following resources are used in the demonstration:
|
||||
{% include user-guide-content-moved.md %}
|
||||
|
||||
* [Resource Quota](/docs/admin/resourcequota/)
|
||||
* [Limit Range](/docs/admin/limitrange/)
|
||||
* [Persistent Volume Claim](/docs/user-guide/persistent-volumes/)
|
||||
[Limiting Storage Consumption](/docs/tasks/administer-cluster/limit-storage-consumption/)
|
||||
|
||||
This example assumes you have a functional Kubernetes setup.
|
||||
|
||||
## Limiting Storage Consumption
|
||||
|
||||
The cluster-admin is operating a cluster on behalf of a user population and the admin wants to control
|
||||
how much storage a single namespace can consume in order to control cost.
|
||||
|
||||
The admin would like to limit:
|
||||
|
||||
1. The number of persistent volume claims in a namespace
|
||||
2. The amount of storage each claim can request
|
||||
3. The amount of cumulative storage the namespace can have
|
||||
|
||||
|
||||
## LimitRange to limit requests for storage
|
||||
|
||||
Adding a `LimitRange` to a namespace enforces storage request sizes to a minimum and maximum. Storage is requested
|
||||
via `PersistentVolumeClaim`. The admission controller that enforces limit ranges will reject any PVC that is above or below
|
||||
the values set by the admin.
|
||||
|
||||
In this example, a PVC requesting 10Gi of storage would be rejected because it exceeds the 2Gi max.
|
||||
|
||||
```
|
||||
apiVersion: v1
|
||||
kind: LimitRange
|
||||
metadata:
|
||||
name: storagelimits
|
||||
spec:
|
||||
limits:
|
||||
- type: PersistentVolumeClaim
|
||||
max:
|
||||
storage: 2Gi
|
||||
min:
|
||||
storage: 1Gi
|
||||
```
|
||||
|
||||
Minimum storage requests are used when the underlying storage provider requires certain minimums. For example,
|
||||
AWS EBS volumes have a 1Gi minimum requirement.
|
||||
|
||||
## StorageQuota to limit PVC count and cumulative storage capacity
|
||||
|
||||
Admins can limit the number of PVCs in a namespace as well as the cumulative capacity of those PVCs. New PVCs that exceed
|
||||
either maximum value will be rejected.
|
||||
|
||||
In this example, a 6th PVC in the namespace would be rejected because it exceeds the maximum count of 5. Alternatively,
|
||||
a 5Gi maximum quota when combined with the 2Gi max limit above, cannot have 3 PVCs where each has 2Gi. That would be 6Gi requested
|
||||
for a namespace capped at 5Gi.
|
||||
|
||||
```
|
||||
apiVersion: v1
|
||||
kind: ResourceQuota
|
||||
metadata:
|
||||
name: storagequota
|
||||
spec:
|
||||
hard:
|
||||
persistentvolumeclaims: "5"
|
||||
requests.storage: "5Gi"
|
||||
```
|
||||
|
||||
## Summary
|
||||
|
||||
A limit range can put a ceiling on how much storage is requested while a resource quota can effectively cap the storage
|
||||
consumed by a namespace through claim counts and cumulative storage capacity. The allows a cluster-admin to plan their
|
||||
cluster's storage budget without risk of any one project going over their allotment.
|
||||
|
|
|
@ -5,362 +5,6 @@ assignees:
|
|||
title: Applying Resource Quotas and Limits
|
||||
---
|
||||
|
||||
This example demonstrates a typical setup to control for resource usage in a namespace.
|
||||
{% include user-guide-content-moved.md %}
|
||||
|
||||
It demonstrates using the following resources:
|
||||
|
||||
* [Namespace](/docs/admin/namespaces)
|
||||
* [Resource Quota](/docs/admin/resourcequota/)
|
||||
* [Limit Range](/docs/admin/limitrange/)
|
||||
|
||||
This example assumes you have a functional Kubernetes setup.
|
||||
|
||||
## Scenario
|
||||
|
||||
The cluster-admin is operating a cluster on behalf of a user population and the cluster-admin
|
||||
wants to control the amount of resources that can be consumed in a particular namespace to promote
|
||||
fair sharing of the cluster and control cost.
|
||||
|
||||
The cluster-admin has the following goals:
|
||||
|
||||
* Limit the amount of compute resource for running pods
|
||||
* Limit the number of persistent volume claims to control access to storage
|
||||
* Limit the number of load balancers to control cost
|
||||
* Prevent the use of node ports to preserve scarce resources
|
||||
* Provide default compute resource requests to enable better scheduling decisions
|
||||
|
||||
## Step 1: Create a namespace
|
||||
|
||||
This example will work in a custom namespace to demonstrate the concepts involved.
|
||||
|
||||
Let's create a new namespace called quota-example:
|
||||
|
||||
```shell
|
||||
$ kubectl create -f docs/admin/resourcequota/namespace.yaml
|
||||
namespace "quota-example" created
|
||||
$ kubectl get namespaces
|
||||
NAME STATUS AGE
|
||||
default Active 2m
|
||||
kube-system Active 2m
|
||||
quota-example Active 39s
|
||||
```
|
||||
|
||||
## Step 2: Apply an object-count quota to the namespace
|
||||
|
||||
The cluster-admin wants to control the following resources:
|
||||
|
||||
* persistent volume claims
|
||||
* load balancers
|
||||
* node ports
|
||||
|
||||
Let's create a simple quota that controls object counts for those resource types in this namespace.
|
||||
|
||||
```shell
|
||||
$ kubectl create -f docs/admin/resourcequota/object-counts.yaml --namespace=quota-example
|
||||
resourcequota "object-counts" created
|
||||
```
|
||||
|
||||
The quota system will observe that a quota has been created, and will calculate consumption
|
||||
in the namespace in response. This should happen quickly.
|
||||
|
||||
Let's describe the quota to see what is currently being consumed in this namespace:
|
||||
|
||||
```shell
|
||||
$ kubectl describe quota object-counts --namespace=quota-example
|
||||
Name: object-counts
|
||||
Namespace: quota-example
|
||||
Resource Used Hard
|
||||
-------- ---- ----
|
||||
persistentvolumeclaims 0 2
|
||||
services.loadbalancers 0 2
|
||||
services.nodeports 0 0
|
||||
```
|
||||
|
||||
The quota system will now prevent users from creating more than the specified amount for each resource.
|
||||
|
||||
|
||||
## Step 3: Apply a compute-resource quota to the namespace
|
||||
|
||||
To limit the amount of compute resource that can be consumed in this namespace,
|
||||
let's create a quota that tracks compute resources.
|
||||
|
||||
```shell
|
||||
$ kubectl create -f docs/admin/resourcequota/compute-resources.yaml --namespace=quota-example
|
||||
resourcequota "compute-resources" created
|
||||
```
|
||||
|
||||
Let's describe the quota to see what is currently being consumed in this namespace:
|
||||
|
||||
```shell
|
||||
$ kubectl describe quota compute-resources --namespace=quota-example
|
||||
Name: compute-resources
|
||||
Namespace: quota-example
|
||||
Resource Used Hard
|
||||
-------- ---- ----
|
||||
limits.cpu 0 2
|
||||
limits.memory 0 2Gi
|
||||
pods 0 4
|
||||
requests.cpu 0 1
|
||||
requests.memory 0 1Gi
|
||||
```
|
||||
|
||||
The quota system will now prevent the namespace from having more than 4 non-terminal pods. In
|
||||
addition, it will enforce that each container in a pod makes a `request` and defines a `limit` for
|
||||
`cpu` and `memory`.
|
||||
|
||||
## Step 4: Applying default resource requests and limits
|
||||
|
||||
Pod authors rarely specify resource requests and limits for their pods.
|
||||
|
||||
Since we applied a quota to our project, let's see what happens when an end-user creates a pod that has unbounded
|
||||
cpu and memory by creating an nginx container.
|
||||
|
||||
To demonstrate, lets create a deployment that runs nginx:
|
||||
|
||||
```shell
|
||||
$ kubectl run nginx --image=nginx --replicas=1 --namespace=quota-example
|
||||
deployment "nginx" created
|
||||
```
|
||||
|
||||
Now let's look at the pods that were created.
|
||||
|
||||
```shell
|
||||
$ kubectl get pods --namespace=quota-example
|
||||
```
|
||||
|
||||
What happened? I have no pods! Let's describe the deployment to get a view of what is happening.
|
||||
|
||||
```shell
|
||||
$ kubectl describe deployment nginx --namespace=quota-example
|
||||
Name: nginx
|
||||
Namespace: quota-example
|
||||
CreationTimestamp: Mon, 06 Jun 2016 16:11:37 -0400
|
||||
Labels: run=nginx
|
||||
Selector: run=nginx
|
||||
Replicas: 0 updated | 1 total | 0 available | 1 unavailable
|
||||
StrategyType: RollingUpdate
|
||||
MinReadySeconds: 0
|
||||
RollingUpdateStrategy: 1 max unavailable, 1 max surge
|
||||
OldReplicaSets: <none>
|
||||
NewReplicaSet: nginx-3137573019 (0/1 replicas created)
|
||||
...
|
||||
```
|
||||
|
||||
A deployment created a corresponding replica set and attempted to size it to create a single pod.
|
||||
|
||||
Let's look at the replica set to get more detail.
|
||||
|
||||
```shell
|
||||
$ kubectl describe rs nginx-3137573019 --namespace=quota-example
|
||||
Name: nginx-3137573019
|
||||
Namespace: quota-example
|
||||
Image(s): nginx
|
||||
Selector: pod-template-hash=3137573019,run=nginx
|
||||
Labels: pod-template-hash=3137573019
|
||||
run=nginx
|
||||
Replicas: 0 current / 1 desired
|
||||
Pods Status: 0 Running / 0 Waiting / 0 Succeeded / 0 Failed
|
||||
No volumes.
|
||||
Events:
|
||||
FirstSeen LastSeen Count From SubobjectPath Type Reason Message
|
||||
--------- -------- ----- ---- ------------- -------- ------ -------
|
||||
4m 7s 11 {replicaset-controller } Warning FailedCreate Error creating: pods "nginx-3137573019-" is forbidden: Failed quota: compute-resources: must specify limits.cpu,limits.memory,requests.cpu,requests.memory
|
||||
```
|
||||
|
||||
The Kubernetes API server is rejecting the replica set requests to create a pod because our pods
|
||||
do not specify `requests` or `limits` for `cpu` and `memory`.
|
||||
|
||||
So let's set some default values for the amount of `cpu` and `memory` a pod can consume:
|
||||
|
||||
```shell
|
||||
$ kubectl create -f docs/admin/resourcequota/limits.yaml --namespace=quota-example
|
||||
limitrange "limits" created
|
||||
$ kubectl describe limits limits --namespace=quota-example
|
||||
Name: limits
|
||||
Namespace: quota-example
|
||||
Type Resource Min Max Default Request Default Limit Max Limit/Request Ratio
|
||||
---- -------- --- --- --------------- ------------- -----------------------
|
||||
Container memory - - 256Mi 512Mi -
|
||||
Container cpu - - 100m 200m -
|
||||
```
|
||||
|
||||
If the Kubernetes API server observes a request to create a pod in this namespace, and the containers
|
||||
in that pod do not make any compute resource requests, a default request and default limit will be applied
|
||||
as part of admission control.
|
||||
|
||||
In this example, each pod created will have compute resources equivalent to the following:
|
||||
|
||||
```shell
|
||||
$ kubectl run nginx \
|
||||
--image=nginx \
|
||||
--replicas=1 \
|
||||
--requests=cpu=100m,memory=256Mi \
|
||||
--limits=cpu=200m,memory=512Mi \
|
||||
--namespace=quota-example
|
||||
```
|
||||
|
||||
Now that we have applied default compute resources for our namespace, our replica set should be able to create
|
||||
its pods.
|
||||
|
||||
```shell
|
||||
$ kubectl get pods --namespace=quota-example
|
||||
NAME READY STATUS RESTARTS AGE
|
||||
nginx-3137573019-fvrig 1/1 Running 0 6m
|
||||
```
|
||||
|
||||
And if we print out our quota usage in the namespace:
|
||||
|
||||
```shell
|
||||
$ kubectl describe quota --namespace=quota-example
|
||||
Name: compute-resources
|
||||
Namespace: quota-example
|
||||
Resource Used Hard
|
||||
-------- ---- ----
|
||||
limits.cpu 200m 2
|
||||
limits.memory 512Mi 2Gi
|
||||
pods 1 4
|
||||
requests.cpu 100m 1
|
||||
requests.memory 256Mi 1Gi
|
||||
|
||||
|
||||
Name: object-counts
|
||||
Namespace: quota-example
|
||||
Resource Used Hard
|
||||
-------- ---- ----
|
||||
persistentvolumeclaims 0 2
|
||||
services.loadbalancers 0 2
|
||||
services.nodeports 0 0
|
||||
```
|
||||
|
||||
As you can see, the pod that was created is consuming explicit amounts of compute resources, and the usage is being
|
||||
tracked by Kubernetes properly.
|
||||
|
||||
## Step 5: Advanced quota scopes
|
||||
|
||||
Let's imagine you did not want to specify default compute resource consumption in your namespace.
|
||||
|
||||
Instead, you want to let users run a specific number of `BestEffort` pods in their namespace to take
|
||||
advantage of slack compute resources, and then require that users make an explicit resource request for
|
||||
pods that require a higher quality of service.
|
||||
|
||||
Let's create a new namespace with two quotas to demonstrate this behavior:
|
||||
|
||||
```shell
|
||||
$ kubectl create namespace quota-scopes
|
||||
namespace "quota-scopes" created
|
||||
$ kubectl create -f docs/admin/resourcequota/best-effort.yaml --namespace=quota-scopes
|
||||
resourcequota "best-effort" created
|
||||
$ kubectl create -f docs/admin/resourcequota/not-best-effort.yaml --namespace=quota-scopes
|
||||
resourcequota "not-best-effort" created
|
||||
$ kubectl describe quota --namespace=quota-scopes
|
||||
Name: best-effort
|
||||
Namespace: quota-scopes
|
||||
Scopes: BestEffort
|
||||
* Matches all pods that have best effort quality of service.
|
||||
Resource Used Hard
|
||||
-------- ---- ----
|
||||
pods 0 10
|
||||
|
||||
|
||||
Name: not-best-effort
|
||||
Namespace: quota-scopes
|
||||
Scopes: NotBestEffort
|
||||
* Matches all pods that do not have best effort quality of service.
|
||||
Resource Used Hard
|
||||
-------- ---- ----
|
||||
limits.cpu 0 2
|
||||
limits.memory 0 2Gi
|
||||
pods 0 4
|
||||
requests.cpu 0 1
|
||||
requests.memory 0 1Gi
|
||||
```
|
||||
|
||||
In this scenario, a pod that makes no compute resource requests will be tracked by the `best-effort` quota.
|
||||
|
||||
A pod that does make compute resource requests will be tracked by the `not-best-effort` quota.
|
||||
|
||||
Let's demonstrate this by creating two deployments:
|
||||
|
||||
```shell
|
||||
$ kubectl run best-effort-nginx --image=nginx --replicas=8 --namespace=quota-scopes
|
||||
deployment "best-effort-nginx" created
|
||||
$ kubectl run not-best-effort-nginx \
|
||||
--image=nginx \
|
||||
--replicas=2 \
|
||||
--requests=cpu=100m,memory=256Mi \
|
||||
--limits=cpu=200m,memory=512Mi \
|
||||
--namespace=quota-scopes
|
||||
deployment "not-best-effort-nginx" created
|
||||
```
|
||||
|
||||
Even though no default limits were specified, the `best-effort-nginx` deployment will create
|
||||
all 8 pods. This is because it is tracked by the `best-effort` quota, and the `not-best-effort`
|
||||
quota will just ignore it. The `not-best-effort` quota will track the `not-best-effort-nginx`
|
||||
deployment since it creates pods with `Burstable` quality of service.
|
||||
|
||||
Let's list the pods in the namespace:
|
||||
|
||||
```shell
|
||||
$ kubectl get pods --namespace=quota-scopes
|
||||
NAME READY STATUS RESTARTS AGE
|
||||
best-effort-nginx-3488455095-2qb41 1/1 Running 0 51s
|
||||
best-effort-nginx-3488455095-3go7n 1/1 Running 0 51s
|
||||
best-effort-nginx-3488455095-9o2xg 1/1 Running 0 51s
|
||||
best-effort-nginx-3488455095-eyg40 1/1 Running 0 51s
|
||||
best-effort-nginx-3488455095-gcs3v 1/1 Running 0 51s
|
||||
best-effort-nginx-3488455095-rq8p1 1/1 Running 0 51s
|
||||
best-effort-nginx-3488455095-udhhd 1/1 Running 0 51s
|
||||
best-effort-nginx-3488455095-zmk12 1/1 Running 0 51s
|
||||
not-best-effort-nginx-2204666826-7sl61 1/1 Running 0 23s
|
||||
not-best-effort-nginx-2204666826-ke746 1/1 Running 0 23s
|
||||
```
|
||||
|
||||
As you can see, all 10 pods have been allowed to be created.
|
||||
|
||||
Let's describe current quota usage in the namespace:
|
||||
|
||||
```shell
|
||||
$ kubectl describe quota --namespace=quota-scopes
|
||||
Name: best-effort
|
||||
Namespace: quota-scopes
|
||||
Scopes: BestEffort
|
||||
* Matches all pods that have best effort quality of service.
|
||||
Resource Used Hard
|
||||
-------- ---- ----
|
||||
pods 8 10
|
||||
|
||||
|
||||
Name: not-best-effort
|
||||
Namespace: quota-scopes
|
||||
Scopes: NotBestEffort
|
||||
* Matches all pods that do not have best effort quality of service.
|
||||
Resource Used Hard
|
||||
-------- ---- ----
|
||||
limits.cpu 400m 2
|
||||
limits.memory 1Gi 2Gi
|
||||
pods 2 4
|
||||
requests.cpu 200m 1
|
||||
requests.memory 512Mi 1Gi
|
||||
```
|
||||
|
||||
As you can see, the `best-effort` quota has tracked the usage for the 8 pods we created in
|
||||
the `best-effort-nginx` deployment, and the `not-best-effort` quota has tracked the usage for
|
||||
the 2 pods we created in the `not-best-effort-nginx` quota.
|
||||
|
||||
Scopes provide a mechanism to subdivide the set of resources that are tracked by
|
||||
any quota document to allow greater flexibility in how operators deploy and track resource
|
||||
consumption.
|
||||
|
||||
In addition to `BestEffort` and `NotBestEffort` scopes, there are scopes to restrict
|
||||
long-running versus time-bound pods. The `Terminating` scope will match any pod
|
||||
where `spec.activeDeadlineSeconds is not nil`. The `NotTerminating` scope will match any pod
|
||||
where `spec.activeDeadlineSeconds is nil`. These scopes allow you to quota pods based on their
|
||||
anticipated permanence on a node in your cluster.
|
||||
|
||||
## Summary
|
||||
|
||||
Actions that consume node resources for cpu and memory can be subject to hard quota limits defined by the namespace quota.
|
||||
|
||||
Any action that consumes those resources can be tweaked, or can pick up namespace level defaults to meet your end goal.
|
||||
|
||||
Quota can be apportioned based on quality of service and anticipated permanence on a node in your cluster.
|
||||
[Applying Resource Quotas and Limits](/docs/tasks/configure-pod-container/apply-resource-quota-limit/)
|
||||
|
|
|
@ -4,125 +4,6 @@ assignees:
|
|||
title: Static Pods
|
||||
---
|
||||
|
||||
**If you are running clustered Kubernetes and are using static pods to run a pod on every node, you should probably be using a [DaemonSet](/docs/admin/daemons/)!**
|
||||
{% include user-guide-content-moved.md %}
|
||||
|
||||
*Static pods* are managed directly by kubelet daemon on a specific node, without API server observing it. It does not have associated any replication controller, kubelet daemon itself watches it and restarts it when it crashes. There is no health check though. Static pods are always bound to one kubelet daemon and always run on the same node with it.
|
||||
|
||||
Kubelet automatically creates so-called *mirror pod* on Kubernetes API server for each static pod, so the pods are visible there, but they cannot be controlled from the API server.
|
||||
|
||||
## Static pod creation
|
||||
|
||||
Static pod can be created in two ways: either by using configuration file(s) or by HTTP.
|
||||
|
||||
### Configuration files
|
||||
|
||||
The configuration files are just standard pod definition in json or yaml format in specific directory. Use `kubelet --pod-manifest-path=<the directory>` to start kubelet daemon, which periodically scans the directory and creates/deletes static pods as yaml/json files appear/disappear there.
|
||||
|
||||
For example, this is how to start a simple web server as a static pod:
|
||||
|
||||
1. Choose a node where we want to run the static pod. In this example, it's `my-node1`.
|
||||
|
||||
```
|
||||
[joe@host ~] $ ssh my-node1
|
||||
```
|
||||
|
||||
2. Choose a directory, say `/etc/kubelet.d` and place a web server pod definition there, e.g. `/etc/kubelet.d/static-web.yaml`:
|
||||
|
||||
```
|
||||
[root@my-node1 ~] $ mkdir /etc/kubernetes.d/
|
||||
[root@my-node1 ~] $ cat <<EOF >/etc/kubernetes.d/static-web.yaml
|
||||
apiVersion: v1
|
||||
kind: Pod
|
||||
metadata:
|
||||
name: static-web
|
||||
labels:
|
||||
role: myrole
|
||||
spec:
|
||||
containers:
|
||||
- name: web
|
||||
image: nginx
|
||||
ports:
|
||||
- name: web
|
||||
containerPort: 80
|
||||
protocol: TCP
|
||||
EOF
|
||||
```
|
||||
|
||||
3. Configure your kubelet daemon on the node to use this directory by running it with `--pod-manifest-path=/etc/kubelet.d/` argument.
|
||||
On Fedora edit `/etc/kubernetes/kubelet` to include this line:
|
||||
|
||||
```
|
||||
KUBELET_ARGS="--cluster-dns=10.254.0.10 --cluster-domain=kube.local --pod-manifest-path=/etc/kubelet.d/"
|
||||
```
|
||||
|
||||
Instructions for other distributions or Kubernetes installations may vary.
|
||||
|
||||
4. Restart kubelet. On Fedora, this is:
|
||||
|
||||
```
|
||||
[root@my-node1 ~] $ systemctl restart kubelet
|
||||
```
|
||||
|
||||
## Pods created via HTTP
|
||||
|
||||
Kubelet periodically downloads a file specified by `--manifest-url=<URL>` argument and interprets it as a json/yaml file with a pod definition. It works the same as `--pod-manifest-path=<directory>`, i.e. it's reloaded every now and then and changes are applied to running static pods (see below).
|
||||
|
||||
## Behavior of static pods
|
||||
|
||||
When kubelet starts, it automatically starts all pods defined in directory specified in `--pod-manifest-path=` or `--manifest-url=` arguments, i.e. our static-web. (It may take some time to pull nginx image, be patient…):
|
||||
|
||||
```shell
|
||||
[joe@my-node1 ~] $ docker ps
|
||||
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
|
||||
f6d05272b57e nginx:latest "nginx" 8 minutes ago Up 8 minutes k8s_web.6f802af4_static-web-fk-node1_default_67e24ed9466ba55986d120c867395f3c_378e5f3c
|
||||
```
|
||||
|
||||
If we look at our Kubernetes API server (running on host `my-master`), we see that a new mirror-pod was created there too:
|
||||
|
||||
```shell
|
||||
[joe@host ~] $ ssh my-master
|
||||
[joe@my-master ~] $ kubectl get pods
|
||||
NAME READY STATUS RESTARTS AGE
|
||||
static-web-my-node1 1/1 Running 0 2m
|
||||
|
||||
```
|
||||
|
||||
Labels from the static pod are propagated into the mirror-pod and can be used as usual for filtering.
|
||||
|
||||
Notice we cannot delete the pod with the API server (e.g. via [`kubectl`](/docs/user-guide/kubectl/) command), kubelet simply won't remove it.
|
||||
|
||||
```shell
|
||||
[joe@my-master ~] $ kubectl delete pod static-web-my-node1
|
||||
pods/static-web-my-node1
|
||||
[joe@my-master ~] $ kubectl get pods
|
||||
NAME READY STATUS RESTARTS AGE
|
||||
static-web-my-node1 1/1 Running 0 12s
|
||||
|
||||
```
|
||||
|
||||
Back to our `my-node1` host, we can try to stop the container manually and see, that kubelet automatically restarts it in a while:
|
||||
|
||||
```shell
|
||||
[joe@host ~] $ ssh my-node1
|
||||
[joe@my-node1 ~] $ docker stop f6d05272b57e
|
||||
[joe@my-node1 ~] $ sleep 20
|
||||
[joe@my-node1 ~] $ docker ps
|
||||
CONTAINER ID IMAGE COMMAND CREATED ...
|
||||
5b920cbaf8b1 nginx:latest "nginx -g 'daemon of 2 seconds ago ...
|
||||
```
|
||||
|
||||
## Dynamic addition and removal of static pods
|
||||
|
||||
Running kubelet periodically scans the configured directory (`/etc/kubelet.d` in our example) for changes and adds/removes pods as files appear/disappear in this directory.
|
||||
|
||||
```shell
|
||||
[joe@my-node1 ~] $ mv /etc/kubelet.d/static-web.yaml /tmp
|
||||
[joe@my-node1 ~] $ sleep 20
|
||||
[joe@my-node1 ~] $ docker ps
|
||||
// no nginx container is running
|
||||
[joe@my-node1 ~] $ mv /tmp/static-web.yaml /etc/kubelet.d/
|
||||
[joe@my-node1 ~] $ sleep 20
|
||||
[joe@my-node1 ~] $ docker ps
|
||||
CONTAINER ID IMAGE COMMAND CREATED ...
|
||||
e7a62e3427f1 nginx:latest "nginx -g 'daemon of 27 seconds ago
|
||||
```
|
||||
[Static Pods](/docs/concepts/cluster-administration/static-pod/)
|
||||
|
|
|
@ -4,119 +4,6 @@ assignees:
|
|||
title: Using Sysctls in a Kubernetes Cluster
|
||||
---
|
||||
|
||||
* TOC
|
||||
{:toc}
|
||||
{% include user-guide-content-moved.md %}
|
||||
|
||||
This document describes how sysctls are used within a Kubernetes cluster.
|
||||
|
||||
## What is a Sysctl?
|
||||
|
||||
In Linux, the sysctl interface allows an administrator to modify kernel
|
||||
parameters at runtime. Parameters are available via the `/proc/sys/` virtual
|
||||
process file system. The parameters cover various subsystems such as:
|
||||
|
||||
- kernel (common prefix: `kernel.`)
|
||||
- networking (common prefix: `net.`)
|
||||
- virtual memory (common prefix: `vm.`)
|
||||
- MDADM (common prefix: `dev.`)
|
||||
- More subsystems are described in [Kernel docs](https://www.kernel.org/doc/Documentation/sysctl/README).
|
||||
|
||||
To get a list of all parameters, you can run
|
||||
|
||||
```
|
||||
$ sudo sysctl -a
|
||||
```
|
||||
|
||||
## Namespaced vs. Node-Level Sysctls
|
||||
|
||||
A number of sysctls are _namespaced_ in today's Linux kernels. This means that
|
||||
they can be set independently for each pod on a node. Being namespaced is a
|
||||
requirement for sysctls to be accessible in a pod context within Kubernetes.
|
||||
|
||||
The following sysctls are known to be _namespaced_:
|
||||
|
||||
- `kernel.shm*`,
|
||||
- `kernel.msg*`,
|
||||
- `kernel.sem`,
|
||||
- `fs.mqueue.*`,
|
||||
- `net.*`.
|
||||
|
||||
Sysctls which are not namespaced are called _node-level_ and must be set
|
||||
manually by the cluster admin, either by means of the underlying Linux
|
||||
distribution of the nodes (e.g. via `/etc/sysctls.conf`) or using a DaemonSet
|
||||
with privileged containers.
|
||||
|
||||
**Note**: it is good practice to consider nodes with special sysctl settings as
|
||||
_tainted_ within a cluster, and only schedule pods onto them which need those
|
||||
sysctl settings. It is suggested to use the Kubernetes [_taints and toleration_
|
||||
feature](/docs/user-guide/kubectl/kubectl_taint.md) to implement this.
|
||||
|
||||
## Safe vs. Unsafe Sysctls
|
||||
|
||||
Sysctls are grouped into _safe_ and _unsafe_ sysctls. In addition to proper
|
||||
namespacing a _safe_ sysctl must be properly _isolated_ between pods on the same
|
||||
node. This means that setting a _safe_ sysctl for one pod
|
||||
|
||||
- must not have any influence on any other pod on the node
|
||||
- must not allow to harm the node's health
|
||||
- must not allow to gain CPU or memory resources outside of the resource limits
|
||||
of a pod.
|
||||
|
||||
By far, most of the _namespaced_ sysctls are not necessarily considered _safe_.
|
||||
|
||||
For Kubernetes 1.4, the following sysctls are supported in the _safe_ set:
|
||||
|
||||
- `kernel.shm_rmid_forced`,
|
||||
- `net.ipv4.ip_local_port_range`,
|
||||
- `net.ipv4.tcp_syncookies`.
|
||||
|
||||
This list will be extended in future Kubernetes versions when the kubelet
|
||||
supports better isolation mechanisms.
|
||||
|
||||
All _safe_ sysctls are enabled by default.
|
||||
|
||||
All _unsafe_ sysctls are disabled by default and must be allowed manually by the
|
||||
cluster admin on a per-node basis. Pods with disabled unsafe sysctls will be
|
||||
scheduled, but will fail to launch.
|
||||
|
||||
**Warning**: Due to their nature of being _unsafe_, the use of _unsafe_ sysctls
|
||||
is at-your-own-risk and can lead to severe problems like wrong behavior of
|
||||
containers, resource shortage or complete breakage of a node.
|
||||
|
||||
## Enabling Unsafe Sysctls
|
||||
|
||||
With the warning above in mind, the cluster admin can allow certain _unsafe_
|
||||
sysctls for very special situations like e.g. high-performance or real-time
|
||||
application tuning. _Unsafe_ sysctls are enabled on a node-by-node basis with a
|
||||
flag of the kubelet, e.g.:
|
||||
|
||||
```shell
|
||||
$ kubelet --experimental-allowed-unsafe-sysctls 'kernel.msg*,net.ipv4.route.min_pmtu' ...
|
||||
```
|
||||
|
||||
Only _namespaced_ sysctls can be enabled this way.
|
||||
|
||||
## Setting Sysctls for a Pod
|
||||
|
||||
The sysctl feature is an alpha API in Kubernetes 1.4. Therefore, sysctls are set
|
||||
using annotations on pods. They apply to all containers in the same pod.
|
||||
|
||||
Here is an example, with different annotations for _safe_ and _unsafe_ sysctls:
|
||||
|
||||
```yaml
|
||||
apiVersion: v1
|
||||
kind: Pod
|
||||
metadata:
|
||||
name: sysctl-example
|
||||
annotations:
|
||||
security.alpha.kubernetes.io/sysctls: kernel.shm_rmid_forced=1
|
||||
security.alpha.kubernetes.io/unsafe-sysctls: net.ipv4.route.min_pmtu=1000,kernel.msgmax=1 2 3
|
||||
spec:
|
||||
...
|
||||
```
|
||||
|
||||
**Note**: a pod with the _unsafe_ sysctls specified above will fail to launch on
|
||||
any node which has not enabled those two _unsafe_ sysctls explicitly. As with
|
||||
_node-level_ sysctls it is recommended to use [_taints and toleration_
|
||||
feature](/docs/user-guide/kubectl/kubectl_taint.md) or [labels on nodes](/docs
|
||||
/user-guide/labels.md) to schedule those pods onto the right nodes.
|
||||
[Using Sysctls in a Kubernetes Cluster](/docs/concepts/cluster-administration/sysctl-cluster/)
|
||||
|
|
|
@ -0,0 +1,26 @@
|
|||
---
|
||||
assignees:
|
||||
- mml
|
||||
title: Cluster Management Guide
|
||||
---
|
||||
|
||||
* TOC
|
||||
{:toc}
|
||||
|
||||
This document outlines the potentially disruptive changes that exist in the 1.6 release cycle. Operators, administrators, and developers should
|
||||
take note of the changes below in order to maintain continuity across their upgrade process.
|
||||
|
||||
## Cluster defaults set to etcd 3
|
||||
|
||||
In the 1.6 release cycle, the default backend storage layer has been upgraded to fully leverage [etcd 3 capabilities](https://coreos.com/blog/etcd3-a-new-etcd.html) by default.
|
||||
For new clusters, there is nothing an operator will need to do, it should "just work". However, if you are upgrading from a 1.5 cluster, care should be taken to ensure
|
||||
continuity.
|
||||
|
||||
It is possible to maintain v2 compatibility mode while running etcd 3 for an interim period of time. To do this, you will simply need to update an argument passed to your apiserver during
|
||||
startup:
|
||||
|
||||
```
|
||||
$ kube-apiserver --storage-backend='etcd2' $(EXISTING_ARGS)
|
||||
```
|
||||
|
||||
However, for long-term maintenance of the cluster, we recommend that the operator plan an outage window in order to perform a [v2->v3 data upgrade](https://coreos.com/etcd/docs/latest/upgrades/upgrade_3_0.html).
|
103
docs/api.md
103
docs/api.md
|
@ -6,105 +6,4 @@ assignees:
|
|||
title: Kubernetes API Overview
|
||||
---
|
||||
|
||||
Primary system and API concepts are documented in the [User guide](/docs/user-guide/).
|
||||
|
||||
Overall API conventions are described in the [API conventions doc](https://github.com/kubernetes/kubernetes/tree/{{page.githubbranch}}/docs/devel/api-conventions.md).
|
||||
|
||||
Remote access to the API is discussed in the [access doc](/docs/admin/accessing-the-api).
|
||||
|
||||
The Kubernetes API also serves as the foundation for the declarative configuration schema for the system. The [Kubectl](/docs/user-guide/kubectl) command-line tool can be used to create, update, delete, and get API objects.
|
||||
|
||||
Kubernetes also stores its serialized state (currently in [etcd](https://coreos.com/docs/distributed-configuration/getting-started-with-etcd/)) in terms of the API resources.
|
||||
|
||||
Kubernetes itself is decomposed into multiple components, which interact through its API.
|
||||
|
||||
## API changes
|
||||
|
||||
In our experience, any system that is successful needs to grow and change as new use cases emerge or existing ones change. Therefore, we expect the Kubernetes API to continuously change and grow. However, we intend to not break compatibility with existing clients, for an extended period of time. In general, new API resources and new resource fields can be expected to be added frequently. Elimination of resources or fields will require following a deprecation process. The precise deprecation policy for eliminating features is TBD, but once we reach our 1.0 milestone, there will be a specific policy.
|
||||
|
||||
What constitutes a compatible change and how to change the API are detailed by the [API change document](https://github.com/kubernetes/kubernetes/tree/{{page.githubbranch}}/docs/devel/api_changes.md).
|
||||
|
||||
## OpenAPI and Swagger definitions
|
||||
|
||||
Complete API details are documented using [Swagger v1.2](http://swagger.io/) and [OpenAPI](https://www.openapis.org/). The Kubernetes apiserver (aka "master") exposes an API that can be used to retrieve the Swagger v1.2 Kubernetes API spec located at `/swaggerapi`. You can also enable a UI to browse the API documentation at `/swagger-ui` by passing the `--enable-swagger-ui=true` flag to apiserver.
|
||||
|
||||
We also host a version of the [latest v1.2 API documentation UI](http://kubernetes.io/kubernetes/third_party/swagger-ui/). This is updated with the latest release, so if you are using a different version of Kubernetes you will want to use the spec from your apiserver.
|
||||
|
||||
Starting with kubernetes 1.4, OpenAPI spec is also available at `/swagger.json`. While we are transitioning from Swagger v1.2 to OpenAPI (aka Swagger v2.0), some of the tools such as kubectl and swagger-ui are still using v1.2 spec. OpenAPI spec is in Beta as of Kubernetes 1.5.
|
||||
|
||||
Kubernetes implements an alternative Protobuf based serialization format for the API that is primarily intended for intra-cluster communication, documented in the [design proposal](https://github.com/kubernetes/kubernetes/blob/{{ page.githubbranch }}/docs/proposals/protobuf.md) and the IDL files for each schema are located in the Go packages that define the API objects.
|
||||
|
||||
## API versioning
|
||||
|
||||
To make it easier to eliminate fields or restructure resource representations, Kubernetes supports
|
||||
multiple API versions, each at a different API path, such as `/api/v1` or
|
||||
`/apis/extensions/v1beta1`.
|
||||
|
||||
We chose to version at the API level rather than at the resource or field level to ensure that the API presents a clear, consistent view of system resources and behavior, and to enable controlling access to end-of-lifed and/or experimental APIs. The JSON and Protobuf serialization schemas follow the same guidelines for schema changes - all descriptions below cover both formats.
|
||||
|
||||
Note that API versioning and Software versioning are only indirectly related. The [API and release
|
||||
versioning proposal](https://github.com/kubernetes/kubernetes/blob/{{page.githubbranch}}/docs/design/versioning.md) describes the relationship between API versioning and
|
||||
software versioning.
|
||||
|
||||
|
||||
Different API versions imply different levels of stability and support. The criteria for each level are described
|
||||
in more detail in the [API Changes documentation](https://github.com/kubernetes/kubernetes/tree/{{page.githubbranch}}/docs/devel/api_changes.md#alpha-beta-and-stable-versions). They are summarized here:
|
||||
|
||||
- Alpha level:
|
||||
- The version names contain `alpha` (e.g. `v1alpha1`).
|
||||
- May be buggy. Enabling the feature may expose bugs. Disabled by default.
|
||||
- Support for feature may be dropped at any time without notice.
|
||||
- The API may change in incompatible ways in a later software release without notice.
|
||||
- Recommended for use only in short-lived testing clusters, due to increased risk of bugs and lack of long-term support.
|
||||
- Beta level:
|
||||
- The version names contain `beta` (e.g. `v2beta3`).
|
||||
- Code is well tested. Enabling the feature is considered safe. Enabled by default.
|
||||
- Support for the overall feature will not be dropped, though details may change.
|
||||
- The schema and/or semantics of objects may change in incompatible ways in a subsequent beta or stable release. When this happens,
|
||||
we will provide instructions for migrating to the next version. This may require deleting, editing, and re-creating
|
||||
API objects. The editing process may require some thought. This may require downtime for applications that rely on the feature.
|
||||
- Recommended for only non-business-critical uses because of potential for incompatible changes in subsequent releases. If you have
|
||||
multiple clusters which can be upgraded independently, you may be able to relax this restriction.
|
||||
- **Please do try our beta features and give feedback on them! Once they exit beta, it may not be practical for us to make more changes.**
|
||||
- Stable level:
|
||||
- The version name is `vX` where `X` is an integer.
|
||||
- Stable versions of features will appear in released software for many subsequent versions.
|
||||
|
||||
## API groups
|
||||
|
||||
To make it easier to extend the Kubernetes API, we implemented [*API groups*](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/api-group.md).
|
||||
The API group is specified in a REST path and in the `apiVersion` field of a serialized object.
|
||||
|
||||
Currently there are several API groups in use:
|
||||
|
||||
1. the "core" (oftentimes called "legacy", due to not having explicit group name) group, which is at
|
||||
REST path `/api/v1` and is not specified as part of the `apiVersion` field, e.g. `apiVersion: v1`.
|
||||
1. the named groups are at REST path `/apis/$GROUP_NAME/$VERSION`, and use `apiVersion: $GROUP_NAME/$VERSION`
|
||||
(e.g. `apiVersion: batch/v1`). Full list of supported API groups can be seen in [Kubernetes API reference](/docs/reference/).
|
||||
|
||||
|
||||
There are two supported paths to extending the API.
|
||||
1. [Third Party Resources](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/extending-api.md)
|
||||
are for users with very basic CRUD needs.
|
||||
1. Coming soon: users needing the full set of Kubernetes API semantics can implement their own apiserver
|
||||
and use the [aggregator](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/aggregated-api-servers.md)
|
||||
to make it seamless for clients.
|
||||
|
||||
|
||||
## Enabling API groups
|
||||
|
||||
Certain resources and API groups are enabled by default. They can be enabled or disabled by setting `--runtime-config`
|
||||
on apiserver. `--runtime-config` accepts comma separated values. For ex: to disable batch/v1, set
|
||||
`--runtime-config=batch/v1=false`, to enable batch/v2alpha1, set `--runtime-config=batch/v2alpha1`.
|
||||
The flag accepts comma separated set of key=value pairs describing runtime configuration of the apiserver.
|
||||
|
||||
IMPORTANT: Enabling or disabling groups or resources requires restarting apiserver and controller-manager
|
||||
to pick up the `--runtime-config` changes.
|
||||
|
||||
## Enabling resources in the groups
|
||||
|
||||
DaemonSets, Deployments, HorizontalPodAutoscalers, Ingress, Jobs and ReplicaSets are enabled by default.
|
||||
|
||||
Other extensions resources can be enabled by setting `--runtime-config` on
|
||||
apiserver. `--runtime-config` accepts comma separated values. For ex: to disable deployments and jobs, set
|
||||
`--runtime-config=extensions/v1beta1/deployments=false,extensions/v1beta1/ingress=false`
|
||||
{% include user-guide-content-moved.md %}
|
|
@ -0,0 +1,67 @@
|
|||
---
|
||||
assignees:
|
||||
- soltysh
|
||||
- sttts
|
||||
title: Auditing
|
||||
---
|
||||
|
||||
* TOC
|
||||
{:toc}
|
||||
|
||||
Kubernetes Audit provides a security-relevant chronological set of records documenting
|
||||
the sequence of activities that have affected system by individual users, administrators
|
||||
or other components of the system. It allows cluster administrator to
|
||||
answer the following questions:
|
||||
- what happened?
|
||||
- when did it happen?
|
||||
- who initiated it?
|
||||
- on what did it happen?
|
||||
- where was it observed?
|
||||
- from where was it initiated?
|
||||
- to where was it going?
|
||||
|
||||
NOTE: Currently, Kubernetes provides only basic audit capabilities, there is still a lot
|
||||
of work going on to provide fully featured auditing capabilities (see [this issue](https://github.com/kubernetes/features/issues/22)).
|
||||
|
||||
Kubernetes audit is part of [kube-apiserver](/docs/admin/kube-apiserver) logging all requests
|
||||
coming to the server. Each audit log contains two entries:
|
||||
|
||||
1. The request line containing:
|
||||
- unique id allowing to match the response line (see 2)
|
||||
- source ip of the request
|
||||
- HTTP method being invoked
|
||||
- original user invoking the operation
|
||||
- impersonated user for the operation
|
||||
- namespace of the request or <none>
|
||||
- URI as requested
|
||||
2. The response line containing:
|
||||
- the unique id from 1
|
||||
- response code
|
||||
|
||||
Example output for user `admin` asking for a list of pods:
|
||||
|
||||
```
|
||||
2016-09-07T13:03:57.400333046Z AUDIT: id="5c3b8227-4af9-4322-8a71-542231c3887b" ip="127.0.0.1" method="GET" user="admin" as="<self>" namespace="default" uri="/api/v1/namespaces/default/pods"
|
||||
2016-09-07T13:03:57.400710987Z AUDIT: id="5c3b8227-4af9-4322-8a71-542231c3887b" response="200"
|
||||
```
|
||||
|
||||
NOTE: The audit capabilities are available *only* for the secured endpoint of the API server.
|
||||
|
||||
## Configuration
|
||||
|
||||
[Kube-apiserver](/docs/admin/kube-apiserver) provides following options which are responsible
|
||||
for configuring where and how audit logs are handled:
|
||||
|
||||
- `audit-log-path` - enables the audit log pointing to a file where the requests are being logged to.
|
||||
- `audit-log-maxage` - specifies maximum number of days to retain old audit log files based on the timestamp encoded in their filename.
|
||||
- `audit-log-maxbackup` - specifies maximum number of old audit log files to retain.
|
||||
- `audit-log-maxsize` - specifies maximum size in megabytes of the audit log file before it gets rotated. Defaults to 100MB
|
||||
|
||||
If an audit log file already exists, Kubernetes appends new audit logs to that file.
|
||||
Otherwise, Kubernetes creates an audit log file at the location you specified in
|
||||
`audit-log-path`. If the audit log file exceeds the size you specify in `audit-log-maxsize`,
|
||||
Kubernetes will rename the current log file by appending the current timestamp on
|
||||
the file name (before the file extension) and create a new audit log file.
|
||||
Kubernetes may delete old log files when creating a new log file; you can configure
|
||||
how many files are retained and how old they can be by specifying the `audit-log-maxbackup`
|
||||
and `audit-log-maxage` options.
|
|
@ -0,0 +1,137 @@
|
|||
---
|
||||
title: Federation
|
||||
---
|
||||
|
||||
This guide explains why and how to manage multiple Kubernetes clusters using
|
||||
federation.
|
||||
|
||||
|
||||
* TOC
|
||||
{:toc}
|
||||
|
||||
|
||||
## Why federation
|
||||
|
||||
Federation makes it easy to manage multiple clusters. It does so by providing 2
|
||||
major building blocks:
|
||||
|
||||
* Sync resources across clusters: Federation provides the ability to keep
|
||||
resources in multiple clusters in sync. This can be used, for example, to
|
||||
ensure that the same deployment exists in multiple clusters.
|
||||
* Cross cluster discovery: It provides the ability to auto-configure DNS
|
||||
servers and load balancers with backends from all clusters. This can be used,
|
||||
for example, to ensure that a global VIP or DNS record can be used to access
|
||||
backends from multiple clusters.
|
||||
|
||||
Some other use cases that federation enables are:
|
||||
|
||||
* High Availability: By spreading load across clusters and auto configuring DNS
|
||||
servers and load balancers, federation minimises the impact of cluster
|
||||
failure.
|
||||
* Avoiding provider lock-in: By making it easier to migrate applications across
|
||||
clusters, federation prevents cluster provider lock-in.
|
||||
|
||||
|
||||
Federation is not helpful unless you have multiple clusters. Some of the reasons
|
||||
why you might want multiple clusters are:
|
||||
|
||||
* Low latency: Having clusters in multiple regions minimises latency by serving
|
||||
users from the cluster that is closest to them.
|
||||
* Fault isolation: It might be better to have multiple small clusters rather
|
||||
than a single large cluster for fault isolation (for example: multiple
|
||||
clusters in different availability zones of a cloud provider).
|
||||
[Multi cluster guide](/docs/admin/multi-cluster) has more details on this.
|
||||
* Scalability: There are scalability limits to a single kubernetes cluster (this
|
||||
should not be the case for most users. For more details:
|
||||
[Kubernetes Scaling and Performance Goals](https://github.com/kubernetes/community/blob/master/sig-scalability/goals.md)).
|
||||
* Hybrid cloud: You can have multiple clusters on different cloud providers or
|
||||
on-premises data centers.
|
||||
|
||||
|
||||
### Caveats
|
||||
|
||||
While there are a lot of attractive use cases for federation, there are also
|
||||
some caveats.
|
||||
|
||||
* Increased network bandwidth and cost: The federation control plane watches all
|
||||
clusters to ensure that the current state is as expected. This can lead to
|
||||
significant network cost if the clusters are running in different regions on
|
||||
a cloud provider or on different cloud providers.
|
||||
* Reduced cross cluster isolation: A bug in the federation control plane can
|
||||
impact all clusters. This is mitigated by keeping the logic in federation
|
||||
control plane to a minimum. It mostly delegates to the control plane in
|
||||
kubernetes clusters whenever it can. The design and implementation also errs
|
||||
on the side of safety and avoiding multicluster outage.
|
||||
* Maturity: The federation project is relatively new and is not very mature.
|
||||
Not all resources are available and many are still alpha. [Issue
|
||||
38893](https://github.com/kubernetes/kubernetes/issues/38893) ennumerates
|
||||
known issues with the system that the team is busy solving.
|
||||
|
||||
## Setup
|
||||
|
||||
To be able to federate multiple clusters, we first need to setup a federation
|
||||
control plane.
|
||||
Follow the [setup guide](/docs/admin/federation/) to setup the
|
||||
federation control plane.
|
||||
|
||||
## Hybrid cloud capabilities
|
||||
|
||||
Federations of Kubernetes Clusters can include clusters running in
|
||||
different cloud providers (e.g. Google Cloud, AWS), and on-premises
|
||||
(e.g. on OpenStack). Simply create all of the clusters that you
|
||||
require, in the appropriate cloud providers and/or locations, and
|
||||
register each cluster's API endpoint and credentials with your
|
||||
Federation API Server (See the
|
||||
[federation admin guide](/docs/admin/federation/) for details).
|
||||
|
||||
Thereafter, your API resources can span different clusters
|
||||
and cloud providers.
|
||||
|
||||
## API resources
|
||||
|
||||
Once we have the control plane setup, we can start creating federation API
|
||||
resources.
|
||||
The following guides explain some of the resources in detail:
|
||||
|
||||
* [ConfigMap](https://kubernetes.io/docs/user-guide/federation/configmap/)
|
||||
* [DaemonSets](https://kubernetes.io/docs/user-guide/federation/daemonsets/)
|
||||
* [Deployment](https://kubernetes.io/docs/user-guide/federation/deployment/)
|
||||
* [Events](https://kubernetes.io/docs/user-guide/federation/events/)
|
||||
* [Ingress](https://kubernetes.io/docs/user-guide/federation/federated-ingress/)
|
||||
* [Namespaces](https://kubernetes.io/docs/user-guide/federation/namespaces/)
|
||||
* [ReplicaSets](https://kubernetes.io/docs/user-guide/federation/replicasets/)
|
||||
* [Secrets](https://kubernetes.io/docs/user-guide/federation/secrets/)
|
||||
* [Services](https://kubernetes.io/docs/user-guide/federation/federated-services/)
|
||||
|
||||
[API reference docs](/docs/federation/api-reference/) lists all the
|
||||
resources supported by federation apiserver.
|
||||
|
||||
## Cascading deletion
|
||||
|
||||
Kubernetes version 1.5 includes support for cascading deletion of federated
|
||||
resources. With cascading deletion, when you delete a resource from the
|
||||
federation control plane, the corresponding resources in all underlying clusters
|
||||
are also deleted.
|
||||
|
||||
To enable cascading deletion, set the option
|
||||
`DeleteOptions.orphanDependents=false` when you delete a resource from the
|
||||
federation control plane.
|
||||
|
||||
The following Federated resources are affected by cascading deletion:
|
||||
|
||||
* [Ingress](https://kubernetes.io/docs/user-guide/federation/federated-ingress/)
|
||||
* [Namespaces](https://kubernetes.io/docs/user-guide/federation/namespaces/)
|
||||
* [ReplicaSets](https://kubernetes.io/docs/user-guide/federation/replicasets/)
|
||||
* [Secrets](https://kubernetes.io/docs/user-guide/federation/secrets/)
|
||||
* [Deployment](https://kubernetes.io/docs/user-guide/federation/deployment/)
|
||||
* [DaemonSets](https://kubernetes.io/docs/user-guide/federation/daemonsets/)
|
||||
|
||||
Note: By default, deleting a resource from federation control plane does not
|
||||
delete the corresponding resources from underlying clusters.
|
||||
|
||||
|
||||
## For more information
|
||||
|
||||
* [Federation
|
||||
proposal](https://github.com/kubernetes/community/blob/{{page.githubbranch}}/contributors/design-proposals/federation.md)
|
||||
* [Kubecon2016 talk on federation](https://www.youtube.com/watch?v=pq9lbkmxpS8)
|
|
@ -0,0 +1,57 @@
|
|||
---
|
||||
assignees:
|
||||
- davidopp
|
||||
- filipg
|
||||
- piosz
|
||||
title: Guaranteed Scheduling For Critical Add-On Pods
|
||||
---
|
||||
|
||||
* TOC
|
||||
{:toc}
|
||||
|
||||
## Overview
|
||||
|
||||
In addition to Kubernetes core components like api-server, scheduler, controller-manager running on a master machine
|
||||
there are a number of add-ons which, for various reasons, must run on a regular cluster node (rather than the Kubernetes master).
|
||||
Some of these add-ons are critical to a fully functional cluster, such as Heapster, DNS, and UI.
|
||||
A cluster may stop working properly if a critical add-on is evicted (either manually or as a side effect of another operation like upgrade)
|
||||
and becomes pending (for example when the cluster is highly utilized and either there are other pending pods that schedule into the space
|
||||
vacated by the evicted critical add-on pod or the amount of resources available on the node changed for some other reason).
|
||||
|
||||
## Rescheduler: guaranteed scheduling of critical add-ons
|
||||
|
||||
Rescheduler ensures that critical add-ons are always scheduled
|
||||
(assuming the cluster has enough resources to run the critical add-on pods in the absence of regular pods).
|
||||
If the scheduler determines that no node has enough free resources to run the critical add-on pod
|
||||
given the pods that are already running in the cluster
|
||||
(indicated by critical add-on pod's pod condition PodScheduled set to false, the reason set to Unschedulable)
|
||||
the rescheduler tries to free up space for the add-on by evicting some pods; then the scheduler will schedule the add-on pod.
|
||||
|
||||
To avoid situation when another pod is scheduled into the space prepared for the critical add-on,
|
||||
the chosen node gets a temporary taint "CriticalAddonsOnly" before the eviction(s)
|
||||
(see [more details](https://github.com/kubernetes/kubernetes/blob/master/docs/design/taint-toleration-dedicated.md)).
|
||||
Each critical add-on has to tolerate it,
|
||||
while the other pods shouldn't tolerate the taint. The taint is removed once the add-on is successfully scheduled.
|
||||
|
||||
*Warning:* currently there is no guarantee which node is chosen and which pods are being killed
|
||||
in order to schedule critical pods, so if rescheduler is enabled your pods might be occasionally
|
||||
killed for this purpose.
|
||||
|
||||
## Config
|
||||
|
||||
Rescheduler doesn't have any user facing configuration (component config) or API.
|
||||
It's enabled by default. It can be disabled:
|
||||
|
||||
* during cluster setup by setting `ENABLE_RESCHEDULER` flag to `false`
|
||||
* on running cluster by deleting its manifest from master node
|
||||
(default path `/etc/kubernetes/manifests/rescheduler.manifest`)
|
||||
|
||||
### Marking add-on as critical
|
||||
|
||||
To be critical an add-on has to run in `kube-system` namespace (configurable via flag)
|
||||
and have the following annotations specified:
|
||||
|
||||
* `scheduler.alpha.kubernetes.io/critical-pod` set to empty string
|
||||
* `scheduler.alpha.kubernetes.io/tolerations` set to `[{"key":"CriticalAddonsOnly", "operator":"Exists"}]`
|
||||
|
||||
The first one marks a pod a critical. The second one is required by Rescheduler algorithm.
|
|
@ -3,6 +3,9 @@ assignees:
|
|||
- crassirostris
|
||||
- piosz
|
||||
title: Logging and Monitoring Cluster Activity
|
||||
redirect_from:
|
||||
- "/docs/concepts/clusters/logging/"
|
||||
- "/docs/concepts/clusters/logging.html"
|
||||
---
|
||||
|
||||
Application and systems logs can help you understand what is happening inside your cluster. The logs are particularly useful for debugging problems and monitoring cluster activity. Most modern applications have some kind of logging mechanism; as such, most container engines are likewise designed to support some kind of logging. The easiest and most embraced logging method for containerized applications is to write to the standard output and standard error streams.
|
||||
|
@ -21,7 +24,7 @@ The guidance for cluster-level logging assumes that a logging backend is present
|
|||
|
||||
In this section, you can see an example of basic logging in Kubernetes that
|
||||
outputs data to the standard output stream. This demonstration uses
|
||||
a [pod specification](/docs/concepts/clusters/counter-pod.yaml) with
|
||||
a [pod specification](/docs/concepts/cluster-administration/counter-pod.yaml) with
|
||||
a container that writes some text to standard output once per second.
|
||||
|
||||
{% include code.html language="yaml" file="counter-pod.yaml" ghlink="/docs/tasks/debug-application-cluster/counter-pod.yaml" %}
|
||||
|
@ -131,7 +134,7 @@ Consider the following example. A pod runs a single container, and the container
|
|||
writes to two different log files, using two different formats. Here's a
|
||||
configuration file for the Pod:
|
||||
|
||||
{% include code.html language="yaml" file="two-files-counter-pod.yaml" ghlink="/docs/concepts/clusters/two-files-counter-pod.yaml" %}
|
||||
{% include code.html language="yaml" file="two-files-counter-pod.yaml" ghlink="/docs/concepts/cluster-administration/two-files-counter-pod.yaml" %}
|
||||
|
||||
It would be a mess to have log entries of different formats in the same log
|
||||
stream, even if you managed to redirect both components to the `stdout` stream of
|
||||
|
@ -141,7 +144,7 @@ the logs to its own `stdout` stream.
|
|||
|
||||
Here's a configuration file for a pod that has two sidecar containers:
|
||||
|
||||
{% include code.html language="yaml" file="two-files-counter-pod-streaming-sidecar.yaml" ghlink="/docs/concepts/clusters/two-files-counter-pod-streaming-sidecar.yaml" %}
|
||||
{% include code.html language="yaml" file="two-files-counter-pod-streaming-sidecar.yaml" ghlink="/docs/concepts/cluster-administration/two-files-counter-pod-streaming-sidecar.yaml" %}
|
||||
|
||||
Now when you run this pod, you can access each log stream separately by
|
||||
running the following commands:
|
||||
|
@ -197,7 +200,7 @@ which uses fluentd as a logging agent. Here are two configuration files that
|
|||
you can use to implement this approach. The first file contains
|
||||
a [ConfigMap](/docs/user-guide/configmap/) to configure fluentd.
|
||||
|
||||
{% include code.html language="yaml" file="fluentd-sidecar-config.yaml" ghlink="/docs/concepts/clusters/fluentd-sidecar-config.yaml" %}
|
||||
{% include code.html language="yaml" file="fluentd-sidecar-config.yaml" ghlink="/docs/concepts/cluster-administration/fluentd-sidecar-config.yaml" %}
|
||||
|
||||
**Note**: The configuration of fluentd is beyond the scope of this article. For
|
||||
information about configuring fluentd, see the
|
||||
|
@ -206,7 +209,7 @@ information about configuring fluentd, see the
|
|||
The second file describes a pod that has a sidecar container running fluentd.
|
||||
The pod mounts a volume where fluentd can pick up its configuration data.
|
||||
|
||||
{% include code.html language="yaml" file="two-files-counter-pod-agent-sidecar.yaml" ghlink="/docs/concepts/clusters/two-files-counter-pod-agent-sidecar.yaml" %}
|
||||
{% include code.html language="yaml" file="two-files-counter-pod-agent-sidecar.yaml" ghlink="/docs/concepts/cluster-administration/two-files-counter-pod-agent-sidecar.yaml" %}
|
||||
|
||||
After some time you can find log messages in the Stackdriver interface.
|
||||
|
|
@ -0,0 +1,438 @@
|
|||
---
|
||||
assignees:
|
||||
- bgrant0607
|
||||
- janetkuo
|
||||
- mikedanese
|
||||
title: Managing Resources
|
||||
---
|
||||
|
||||
You've deployed your application and exposed it via a service. Now what? Kubernetes provides a number of tools to help you manage your application deployment, including scaling and updating. Among the features we'll discuss in more depth are [configuration files](/docs/user-guide/configuring-containers/#configuration-in-kubernetes) and [labels](/docs/user-guide/deploying-applications/#labels).
|
||||
|
||||
You can find all the files for this example [in our docs
|
||||
repo here](https://github.com/kubernetes/kubernetes.github.io/tree/{{page.docsbranch}}/docs/user-guide/).
|
||||
|
||||
* TOC
|
||||
{:toc}
|
||||
|
||||
## Organizing resource configurations
|
||||
|
||||
Many applications require multiple resources to be created, such as a Deployment and a Service. Management of multiple resources can be simplified by grouping them together in the same file (separated by `---` in YAML). For example:
|
||||
|
||||
{% include code.html language="yaml" file="nginx-app.yaml" ghlink="/docs/user-guide/nginx-app.yaml" %}
|
||||
|
||||
Multiple resources can be created the same way as a single resource:
|
||||
|
||||
```shell
|
||||
$ kubectl create -f docs/user-guide/nginx-app.yaml
|
||||
service "my-nginx-svc" created
|
||||
deployment "my-nginx" created
|
||||
```
|
||||
|
||||
The resources will be created in the order they appear in the file. Therefore, it's best to specify the service first, since that will ensure the scheduler can spread the pods associated with the service as they are created by the controller(s), such as Deployment.
|
||||
|
||||
`kubectl create` also accepts multiple `-f` arguments:
|
||||
|
||||
```shell
|
||||
$ kubectl create -f docs/user-guide/nginx/nginx-svc.yaml -f docs/user-guide/nginx/nginx-deployment.yaml
|
||||
```
|
||||
|
||||
And a directory can be specified rather than or in addition to individual files:
|
||||
|
||||
```shell
|
||||
$ kubectl create -f docs/user-guide/nginx/
|
||||
```
|
||||
|
||||
`kubectl` will read any files with suffixes `.yaml`, `.yml`, or `.json`.
|
||||
|
||||
It is a recommended practice to put resources related to the same microservice or application tier into the same file, and to group all of the files associated with your application in the same directory. If the tiers of your application bind to each other using DNS, then you can then simply deploy all of the components of your stack en masse.
|
||||
|
||||
A URL can also be specified as a configuration source, which is handy for deploying directly from configuration files checked into github:
|
||||
|
||||
```shell
|
||||
$ kubectl create -f https://raw.githubusercontent.com/kubernetes/kubernetes/master/docs/user-guide/nginx-deployment.yaml
|
||||
deployment "nginx-deployment" created
|
||||
```
|
||||
|
||||
## Bulk operations in kubectl
|
||||
|
||||
Resource creation isn't the only operation that `kubectl` can perform in bulk. It can also extract resource names from configuration files in order to perform other operations, in particular to delete the same resources you created:
|
||||
|
||||
```shell
|
||||
$ kubectl delete -f docs/user-guide/nginx/
|
||||
deployment "my-nginx" deleted
|
||||
service "my-nginx-svc" deleted
|
||||
```
|
||||
|
||||
In the case of just two resources, it's also easy to specify both on the command line using the resource/name syntax:
|
||||
|
||||
```shell
|
||||
$ kubectl delete deployments/my-nginx services/my-nginx-svc
|
||||
```
|
||||
|
||||
For larger numbers of resources, you'll find it easier to specify the selector (label query) specified using `-l` or `--selector`, to filter resources by their labels:
|
||||
|
||||
```shell
|
||||
$ kubectl delete deployment,services -l app=nginx
|
||||
deployment "my-nginx" deleted
|
||||
service "my-nginx-svc" deleted
|
||||
```
|
||||
|
||||
Because `kubectl` outputs resource names in the same syntax it accepts, it's easy to chain operations using `$()` or `xargs`:
|
||||
|
||||
```shell
|
||||
$ kubectl get $(kubectl create -f docs/user-guide/nginx/ -o name | grep service)
|
||||
NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
|
||||
my-nginx-svc 10.0.0.208 80/TCP 0s
|
||||
```
|
||||
|
||||
With the above commands, we first create resources under docs/user-guide/nginx/ and print the resources created with `-o name` output format
|
||||
(print each resource as resource/name). Then we `grep` only the "service", and then print it with `kubectl get`.
|
||||
|
||||
If you happen to organize your resources across several subdirectories within a particular directory, you can recursively perform the operations on the subdirectories also, by specifying `--recursive` or `-R` alongside the `--filename,-f` flag.
|
||||
|
||||
For instance, assume there is a directory `project/k8s/development` that holds all of the manifests needed for the development environment, organized by resource type:
|
||||
|
||||
```
|
||||
project/k8s/development
|
||||
├── configmap
|
||||
│ └── my-configmap.yaml
|
||||
├── deployment
|
||||
│ └── my-deployment.yaml
|
||||
└── pvc
|
||||
└── my-pvc.yaml
|
||||
```
|
||||
|
||||
By default, performing a bulk operation on `project/k8s/development` will stop at the first level of the directory, not processing any subdirectories. If we tried to create the resources in this directory using the following command, we'd encounter an error:
|
||||
|
||||
```shell
|
||||
$ kubectl create -f project/k8s/development
|
||||
error: you must provide one or more resources by argument or filename (.json|.yaml|.yml|stdin)
|
||||
```
|
||||
|
||||
Instead, specify the `--recursive` or `-R` flag with the `--filename,-f` flag as such:
|
||||
|
||||
```shell
|
||||
$ kubectl create -f project/k8s/development --recursive
|
||||
configmap "my-config" created
|
||||
deployment "my-deployment" created
|
||||
persistentvolumeclaim "my-pvc" created
|
||||
```
|
||||
|
||||
The `--recursive` flag works with any operation that accepts the `--filename,-f` flag such as: `kubectl {create,get,delete,describe,rollout} etc.`
|
||||
|
||||
The `--recursive` flag also works when multiple `-f` arguments are provided:
|
||||
|
||||
```shell
|
||||
$ kubectl create -f project/k8s/namespaces -f project/k8s/development --recursive
|
||||
namespace "development" created
|
||||
namespace "staging" created
|
||||
configmap "my-config" created
|
||||
deployment "my-deployment" created
|
||||
persistentvolumeclaim "my-pvc" created
|
||||
```
|
||||
|
||||
If you're interested in learning more about `kubectl`, go ahead and read [kubectl Overview](/docs/user-guide/kubectl-overview).
|
||||
|
||||
## Using labels effectively
|
||||
|
||||
The examples we've used so far apply at most a single label to any resource. There are many scenarios where multiple labels should be used to distinguish sets from one another.
|
||||
|
||||
For instance, different applications would use different values for the `app` label, but a multi-tier application, such as the [guestbook example](https://github.com/kubernetes/kubernetes/tree/{{page.githubbranch}}/examples/guestbook/), would additionally need to distinguish each tier. The frontend could carry the following labels:
|
||||
|
||||
```yaml
|
||||
labels:
|
||||
app: guestbook
|
||||
tier: frontend
|
||||
```
|
||||
|
||||
while the Redis master and slave would have different `tier` labels, and perhaps even an additional `role` label:
|
||||
|
||||
```yaml
|
||||
labels:
|
||||
app: guestbook
|
||||
tier: backend
|
||||
role: master
|
||||
```
|
||||
|
||||
and
|
||||
|
||||
```yaml
|
||||
labels:
|
||||
app: guestbook
|
||||
tier: backend
|
||||
role: slave
|
||||
```
|
||||
|
||||
The labels allow us to slice and dice our resources along any dimension specified by a label:
|
||||
|
||||
```shell
|
||||
$ kubectl create -f examples/guestbook/all-in-one/guestbook-all-in-one.yaml
|
||||
$ kubectl get pods -Lapp -Ltier -Lrole
|
||||
NAME READY STATUS RESTARTS AGE APP TIER ROLE
|
||||
guestbook-fe-4nlpb 1/1 Running 0 1m guestbook frontend <none>
|
||||
guestbook-fe-ght6d 1/1 Running 0 1m guestbook frontend <none>
|
||||
guestbook-fe-jpy62 1/1 Running 0 1m guestbook frontend <none>
|
||||
guestbook-redis-master-5pg3b 1/1 Running 0 1m guestbook backend master
|
||||
guestbook-redis-slave-2q2yf 1/1 Running 0 1m guestbook backend slave
|
||||
guestbook-redis-slave-qgazl 1/1 Running 0 1m guestbook backend slave
|
||||
my-nginx-divi2 1/1 Running 0 29m nginx <none> <none>
|
||||
my-nginx-o0ef1 1/1 Running 0 29m nginx <none> <none>
|
||||
$ kubectl get pods -lapp=guestbook,role=slave
|
||||
NAME READY STATUS RESTARTS AGE
|
||||
guestbook-redis-slave-2q2yf 1/1 Running 0 3m
|
||||
guestbook-redis-slave-qgazl 1/1 Running 0 3m
|
||||
```
|
||||
|
||||
## Canary deployments
|
||||
|
||||
Another scenario where multiple labels are needed is to distinguish deployments of different releases or configurations of the same component. It is common practice to deploy a *canary* of a new application release (specified via image tag in the pod template) side by side with the previous release so that the new release can receive live production traffic before fully rolling it out.
|
||||
|
||||
For instance, you can use a `track` label to differentiate different releases.
|
||||
|
||||
The primary, stable release would have a `track` label with value as `stable`:
|
||||
|
||||
```yaml
|
||||
name: frontend
|
||||
replicas: 3
|
||||
...
|
||||
labels:
|
||||
app: guestbook
|
||||
tier: frontend
|
||||
track: stable
|
||||
...
|
||||
image: gb-frontend:v3
|
||||
```
|
||||
|
||||
and then you can create a new release of the guestbook frontend that carries the `track` label with different value (i.e. `canary`), so that two sets of pods would not overlap:
|
||||
|
||||
```yaml
|
||||
name: frontend-canary
|
||||
replicas: 1
|
||||
...
|
||||
labels:
|
||||
app: guestbook
|
||||
tier: frontend
|
||||
track: canary
|
||||
...
|
||||
image: gb-frontend:v4
|
||||
```
|
||||
|
||||
|
||||
The frontend service would span both sets of replicas by selecting the common subset of their labels (i.e. omitting the `track` label), so that the traffic will be redirected to both applications:
|
||||
|
||||
```yaml
|
||||
selector:
|
||||
app: guestbook
|
||||
tier: frontend
|
||||
```
|
||||
|
||||
You can tweak the number of replicas of the stable and canary releases to determine the ratio of each release that will receive live production traffic (in this case, 3:1).
|
||||
Once you're confident, you can update the stable track to the new application release and remove the canary one.
|
||||
|
||||
For a more concrete example, check the [tutorial of deploying Ghost](https://github.com/kelseyhightower/talks/tree/master/kubecon-eu-2016/demo#deploy-a-canary).
|
||||
|
||||
## Updating labels
|
||||
|
||||
Sometimes existing pods and other resources need to be relabeled before creating new resources. This can be done with `kubectl label`.
|
||||
For example, if you want to label all your nginx pods as frontend tier, simply run:
|
||||
|
||||
```shell
|
||||
$ kubectl label pods -l app=nginx tier=fe
|
||||
pod "my-nginx-2035384211-j5fhi" labeled
|
||||
pod "my-nginx-2035384211-u2c7e" labeled
|
||||
pod "my-nginx-2035384211-u3t6x" labeled
|
||||
```
|
||||
|
||||
This first filters all pods with the label "app=nginx", and then labels them with the "tier=fe".
|
||||
To see the pods you just labeled, run:
|
||||
|
||||
```shell
|
||||
$ kubectl get pods -l app=nginx -L tier
|
||||
NAME READY STATUS RESTARTS AGE TIER
|
||||
my-nginx-2035384211-j5fhi 1/1 Running 0 23m fe
|
||||
my-nginx-2035384211-u2c7e 1/1 Running 0 23m fe
|
||||
my-nginx-2035384211-u3t6x 1/1 Running 0 23m fe
|
||||
```
|
||||
|
||||
This outputs all "app=nginx" pods, with an additional label column of pods' tier (specified with `-L` or `--label-columns`).
|
||||
|
||||
For more information, please see [labels](/docs/user-guide/labels/) and [kubectl label](/docs/user-guide/kubectl/kubectl_label/) document.
|
||||
|
||||
## Updating annotations
|
||||
|
||||
Sometimes you would want to attach annotations to resources. Annotations are arbitrary non-identifying metadata for retrieval by API clients such as tools, libraries, etc. This can be done with `kubectl annotate`. For example:
|
||||
|
||||
```shell
|
||||
$ kubectl annotate pods my-nginx-v4-9gw19 description='my frontend running nginx'
|
||||
$ kubectl get pods my-nginx-v4-9gw19 -o yaml
|
||||
apiversion: v1
|
||||
kind: pod
|
||||
metadata:
|
||||
annotations:
|
||||
description: my frontend running nginx
|
||||
...
|
||||
```
|
||||
|
||||
For more information, please see [annotations](/docs/user-guide/annotations/) and [kubectl annotate](/docs/user-guide/kubectl/kubectl_annotate/) document.
|
||||
|
||||
## Scaling your application
|
||||
|
||||
When load on your application grows or shrinks, it's easy to scale with `kubectl`. For instance, to decrease the number of nginx replicas from 3 to 1, do:
|
||||
|
||||
```shell
|
||||
$ kubectl scale deployment/my-nginx --replicas=1
|
||||
deployment "my-nginx" scaled
|
||||
```
|
||||
|
||||
Now you only have one pod managed by the deployment.
|
||||
|
||||
```shell
|
||||
$ kubectl get pods -l app=nginx
|
||||
NAME READY STATUS RESTARTS AGE
|
||||
my-nginx-2035384211-j5fhi 1/1 Running 0 30m
|
||||
```
|
||||
|
||||
To have the system automatically choose the number of nginx replicas as needed, ranging from 1 to 3, do:
|
||||
|
||||
```shell
|
||||
$ kubectl autoscale deployment/my-nginx --min=1 --max=3
|
||||
deployment "my-nginx" autoscaled
|
||||
```
|
||||
|
||||
Now your nginx replicas will be scaled up and down as needed, automatically.
|
||||
|
||||
For more information, please see [kubectl scale](/docs/user-guide/kubectl/kubectl_scale/), [kubectl autoscale](/docs/user-guide/kubectl/kubectl_autoscale/) and [horizontal pod autoscaler](/docs/user-guide/horizontal-pod-autoscaler/) document.
|
||||
|
||||
|
||||
## In-place updates of resources
|
||||
|
||||
Sometimes it's necessary to make narrow, non-disruptive updates to resources you've created.
|
||||
|
||||
### kubectl apply
|
||||
|
||||
It is suggested to maintain a set of configuration files in source control (see [configuration as code](http://martinfowler.com/bliki/InfrastructureAsCode.html)),
|
||||
so that they can be maintained and versioned along with the code for the resources they configure.
|
||||
Then, you can use [`kubectl apply`](/docs/user-guide/kubectl/kubectl_apply/) to push your configuration changes to the cluster.
|
||||
|
||||
This command will compare the version of the configuration that you're pushing with the previous version and apply the changes you've made, without overwriting any automated changes to properties you haven't specified.
|
||||
|
||||
```shell
|
||||
$ kubectl apply -f docs/user-guide/nginx/nginx-deployment.yaml
|
||||
deployment "my-nginx" configured
|
||||
```
|
||||
|
||||
Note that `kubectl apply` attaches an annotation to the resource in order to determine the changes to the configuration since the previous invocation. When it's invoked, `kubectl apply` does a three-way diff between the previous configuration, the provided input and the current configuration of the resource, in order to determine how to modify the resource.
|
||||
|
||||
Currently, resources are created without this annotation, so the first invocation of `kubectl apply` will fall back to a two-way diff between the provided input and the current configuration of the resource. During this first invocation, it cannot detect the deletion of properties set when the resource was created. For this reason, it will not remove them.
|
||||
|
||||
All subsequent calls to `kubectl apply`, and other commands that modify the configuration, such as `kubectl replace` and `kubectl edit`, will update the annotation, allowing subsequent calls to `kubectl apply` to detect and perform deletions using a three-way diff.
|
||||
|
||||
**Note:** To use apply, always create resource initially with either `kubectl apply` or `kubectl create --save-config`.
|
||||
|
||||
### kubectl edit
|
||||
|
||||
Alternatively, you may also update resources with `kubectl edit`:
|
||||
|
||||
```shell
|
||||
$ kubectl edit deployment/my-nginx
|
||||
```
|
||||
|
||||
This is equivalent to first `get` the resource, edit it in text editor, and then `apply` the resource with the updated version:
|
||||
|
||||
```shell
|
||||
$ kubectl get deployment my-nginx -o yaml > /tmp/nginx.yaml
|
||||
$ vi /tmp/nginx.yaml
|
||||
# do some edit, and then save the file
|
||||
$ kubectl apply -f /tmp/nginx.yaml
|
||||
deployment "my-nginx" configured
|
||||
$ rm /tmp/nginx.yaml
|
||||
```
|
||||
|
||||
This allows you to do more significant changes more easily. Note that you can specify the editor with your `EDITOR` or `KUBE_EDITOR` environment variables.
|
||||
|
||||
For more information, please see [kubectl edit](/docs/user-guide/kubectl/kubectl_edit/) document.
|
||||
|
||||
### kubectl patch
|
||||
|
||||
Suppose you want to fix a typo of the container's image of a Deployment. One way to do that is with `kubectl patch`:
|
||||
|
||||
```shell
|
||||
# Suppose you have a Deployment with a container named "nginx" and its image "nignx" (typo),
|
||||
# use container name "nginx" as a key to update the image from "nignx" (typo) to "nginx"
|
||||
$ kubectl get deployment my-nginx -o yaml
|
||||
```
|
||||
|
||||
```yaml
|
||||
apiVersion: extensions/v1beta1
|
||||
kind: Deployment
|
||||
...
|
||||
spec:
|
||||
template:
|
||||
spec:
|
||||
containers:
|
||||
- image: nignx
|
||||
name: nginx
|
||||
...
|
||||
```
|
||||
|
||||
```shell
|
||||
$ kubectl patch deployment my-nginx -p'{"spec":{"template":{"spec":{"containers":[{"name":"nginx","image":"nginx"}]}}}}'
|
||||
"my-nginx" patched
|
||||
$ kubectl get pod my-nginx-1jgkf -o yaml
|
||||
```
|
||||
|
||||
```yaml
|
||||
apiVersion: extensions/v1beta1
|
||||
kind: Deployment
|
||||
...
|
||||
spec:
|
||||
template:
|
||||
spec:
|
||||
containers:
|
||||
- image: nginx
|
||||
name: nginx
|
||||
...
|
||||
```
|
||||
|
||||
The patch is specified using json.
|
||||
|
||||
The system ensures that you don't clobber changes made by other users or components by confirming that the `resourceVersion` doesn't differ from the version you edited. If you want to update regardless of other changes, remove the `resourceVersion` field when you edit the resource. However, if you do this, don't use your original configuration file as the source since additional fields most likely were set in the live state.
|
||||
|
||||
For more information, please see [kubectl patch](/docs/user-guide/kubectl/kubectl_patch/) document.
|
||||
|
||||
## Disruptive updates
|
||||
|
||||
In some cases, you may need to update resource fields that cannot be updated once initialized, or you may just want to make a recursive change immediately, such as to fix broken pods created by a Deployment. To change such fields, use `replace --force`, which deletes and re-creates the resource. In this case, you can simply modify your original configuration file:
|
||||
|
||||
```shell
|
||||
$ kubectl replace -f docs/user-guide/nginx/nginx-deployment.yaml --force
|
||||
deployment "my-nginx" deleted
|
||||
deployment "my-nginx" replaced
|
||||
```
|
||||
|
||||
## Updating your application without a service outage
|
||||
|
||||
At some point, you'll eventually need to update your deployed application, typically by specifying a new image or image tag, as in the canary deployment scenario above. `kubectl` supports several update operations, each of which is applicable to different scenarios.
|
||||
|
||||
We'll guide you through how to create and update applications with Deployments. If your deployed application is managed by Replication Controllers,
|
||||
you should read [how to use `kubectl rolling-update`](/docs/tasks/run-application/rolling-update-replication-controller/) instead.
|
||||
|
||||
Let's say you were running version 1.7.9 of nginx:
|
||||
|
||||
```shell
|
||||
$ kubectl run my-nginx --image=nginx:1.7.9 --replicas=3
|
||||
deployment "my-nginx" created
|
||||
```
|
||||
|
||||
To update to version 1.9.1, simply change `.spec.template.spec.containers[0].image` from `nginx:1.7.9` to `nginx:1.9.1`, with the kubectl commands we learned above.
|
||||
|
||||
```shell
|
||||
$ kubectl edit deployment/my-nginx
|
||||
```
|
||||
|
||||
That's it! The Deployment will declaratively update the deployed nginx application progressively behind the scene. It ensures that only a certain number of old replicas may be down while they are being updated, and only a certain number of new replicas may be created above the desired number of pods. To learn more details about it, visit [Deployment page](/docs/user-guide/deployments/).
|
||||
|
||||
## What's next?
|
||||
|
||||
- [Learn about how to use `kubectl` for application introspection and debugging.](/docs/user-guide/introspection-and-debugging/)
|
||||
- [Configuration Best Practices and Tips](/docs/concepts/configuration/overview/)
|
|
@ -0,0 +1,66 @@
|
|||
---
|
||||
assignees:
|
||||
- davidopp
|
||||
title: Using Multiple Clusters
|
||||
---
|
||||
|
||||
You may want to set up multiple Kubernetes clusters, both to
|
||||
have clusters in different regions to be nearer to your users, and to tolerate failures and/or invasive maintenance.
|
||||
This document describes some of the issues to consider when making a decision about doing so.
|
||||
|
||||
If you decide to have multiple clusters, Kubernetes provides a way to [federate them](/docs/admin/federation/).
|
||||
|
||||
## Scope of a single cluster
|
||||
|
||||
On IaaS providers such as Google Compute Engine or Amazon Web Services, a VM exists in a
|
||||
[zone](https://cloud.google.com/compute/docs/zones) or [availability
|
||||
zone](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html).
|
||||
We suggest that all the VMs in a Kubernetes cluster should be in the same availability zone, because:
|
||||
|
||||
- compared to having a single global Kubernetes cluster, there are fewer single-points of failure
|
||||
- compared to a cluster that spans availability zones, it is easier to reason about the availability properties of a
|
||||
single-zone cluster.
|
||||
- when the Kubernetes developers are designing the system (e.g. making assumptions about latency, bandwidth, or
|
||||
correlated failures) they are assuming all the machines are in a single data center, or otherwise closely connected.
|
||||
|
||||
It is okay to have multiple clusters per availability zone, though on balance we think fewer is better.
|
||||
Reasons to prefer fewer clusters are:
|
||||
|
||||
- improved bin packing of Pods in some cases with more nodes in one cluster (less resource fragmentation)
|
||||
- reduced operational overhead (though the advantage is diminished as ops tooling and processes matures)
|
||||
- reduced costs for per-cluster fixed resource costs, e.g. apiserver VMs (but small as a percentage
|
||||
of overall cluster cost for medium to large clusters).
|
||||
|
||||
Reasons to have multiple clusters include:
|
||||
|
||||
- strict security policies requiring isolation of one class of work from another (but, see Partitioning Clusters
|
||||
below).
|
||||
- test clusters to canary new Kubernetes releases or other cluster software.
|
||||
|
||||
## Selecting the right number of clusters
|
||||
|
||||
The selection of the number of Kubernetes clusters may be a relatively static choice, only revisited occasionally.
|
||||
By contrast, the number of nodes in a cluster and the number of pods in a service may change frequently according to
|
||||
load and growth.
|
||||
|
||||
To pick the number of clusters, first, decide which regions you need to be in to have adequate latency to all your end users, for services that will run
|
||||
on Kubernetes (if you use a Content Distribution Network, the latency requirements for the CDN-hosted content need not
|
||||
be considered). Legal issues might influence this as well. For example, a company with a global customer base might decide to have clusters in US, EU, AP, and SA regions.
|
||||
Call the number of regions to be in `R`.
|
||||
|
||||
Second, decide how many clusters should be able to be unavailable at the same time, while still being available. Call
|
||||
the number that can be unavailable `U`. If you are not sure, then 1 is a fine choice.
|
||||
|
||||
If it is allowable for load-balancing to direct traffic to any region in the event of a cluster failure, then
|
||||
you need at least the larger of `R` or `U + 1` clusters. If it is not (e.g. you want to ensure low latency for all
|
||||
users in the event of a cluster failure), then you need to have `R * (U + 1)` clusters
|
||||
(`U + 1` in each of `R` regions). In any case, try to put each cluster in a different zone.
|
||||
|
||||
Finally, if any of your clusters would need more than the maximum recommended number of nodes for a Kubernetes cluster, then
|
||||
you may need even more clusters. Kubernetes v1.3 supports clusters up to 1000 nodes in size.
|
||||
|
||||
## Working with multiple clusters
|
||||
|
||||
When you have multiple clusters, you would typically create services with the same config in each cluster and put each of those
|
||||
service instances behind a load balancer (AWS Elastic Load Balancer, GCE Forwarding Rule or HTTP Load Balancer) spanning all of them, so that
|
||||
failures of a single cluster are not visible to end users.
|
|
@ -0,0 +1,73 @@
|
|||
---
|
||||
assignees:
|
||||
- dcbw
|
||||
- freehan
|
||||
- thockin
|
||||
title: Network Plugins
|
||||
---
|
||||
|
||||
* TOC
|
||||
{:toc}
|
||||
|
||||
__Disclaimer__: Network plugins are in alpha. Its contents will change rapidly.
|
||||
|
||||
Network plugins in Kubernetes come in a few flavors:
|
||||
|
||||
* CNI plugins: adhere to the appc/CNI specification, designed for interoperability.
|
||||
* Kubenet plugin: implements basic `cbr0` using the `bridge` and `host-local` CNI plugins
|
||||
|
||||
## Installation
|
||||
|
||||
The kubelet has a single default network plugin, and a default network common to the entire cluster. It probes for plugins when it starts up, remembers what it found, and executes the selected plugin at appropriate times in the pod lifecycle (this is only true for docker, as rkt manages its own CNI plugins). There are two Kubelet command line parameters to keep in mind when using plugins:
|
||||
|
||||
* `network-plugin-dir`: Kubelet probes this directory for plugins on startup
|
||||
* `network-plugin`: The network plugin to use from `network-plugin-dir`. It must match the name reported by a plugin probed from the plugin directory. For CNI plugins, this is simply "cni".
|
||||
|
||||
## Network Plugin Requirements
|
||||
|
||||
Besides providing the [`NetworkPlugin` interface](https://github.com/kubernetes/kubernetes/tree/{{page.version}}/pkg/kubelet/network/plugins.go) to configure and clean up pod networking, the plugin may also need specific support for kube-proxy. The iptables proxy obviously depends on iptables, and the plugin may need to ensure that container traffic is made available to iptables. For example, if the plugin connects containers to a Linux bridge, the plugin must set the `net/bridge/bridge-nf-call-iptables` sysctl to `1` to ensure that the iptables proxy functions correctly. If the plugin does not use a Linux bridge (but instead something like Open vSwitch or some other mechanism) it should ensure container traffic is appropriately routed for the proxy.
|
||||
|
||||
By default if no kubelet network plugin is specified, the `noop` plugin is used, which sets `net/bridge/bridge-nf-call-iptables=1` to ensure simple configurations (like docker with a bridge) work correctly with the iptables proxy.
|
||||
|
||||
### CNI
|
||||
|
||||
The CNI plugin is selected by passing Kubelet the `--network-plugin=cni` command-line option. Kubelet reads a file from `--cni-conf-dir` (default `/etc/cni/net.d`) and uses the CNI configuration from that file to set up each pod's network. The CNI configuration file must match the [CNI specification](https://github.com/containernetworking/cni/blob/master/SPEC.md#network-configuration), and any required CNI plugins referenced by the configuration must be present in `--cni-bin-dir` (default `/opt/cni/bin`).
|
||||
|
||||
If there are multiple CNI configuration files in the directory, the first one in lexicographic order of file name is used.
|
||||
|
||||
In addition to the CNI plugin specified by the configuration file, Kubernetes requires the standard CNI [`lo`](https://github.com/containernetworking/cni/blob/master/plugins/main/loopback/loopback.go) plugin, at minimum version 0.2.0
|
||||
|
||||
Limitation: Due to [#31307](https://github.com/kubernetes/kubernetes/issues/31307), `HostPort` won't work with CNI networking plugin at the moment. That means all `hostPort` attribute in pod would be simply ignored.
|
||||
|
||||
### kubenet
|
||||
|
||||
Kubenet is a very basic, simple network plugin, on Linux only. It does not, of itself, implement more advanced features like cross-node networking or network policy. It is typically used together with a cloud provider that sets up routing rules for communication between nodes, or in single-node environments.
|
||||
|
||||
Kubenet creates a Linux bridge named `cbr0` and creates a veth pair for each pod with the host end of each pair connected to `cbr0`. The pod end of the pair is assigned an IP address allocated from a range assigned to the node either through configuration or by the controller-manager. `cbr0` is assigned an MTU matching the smallest MTU of an enabled normal interface on the host.
|
||||
|
||||
The plugin requires a few things:
|
||||
|
||||
* The standard CNI `bridge`, `lo` and `host-local` plugins are required, at minimum version 0.2.0. Kubenet will first search for them in `/opt/cni/bin`. Specify `network-plugin-dir` to supply additional search path. The first found match will take effect.
|
||||
* Kubelet must be run with the `--network-plugin=kubenet` argument to enable the plugin
|
||||
* Kubelet should also be run with the `--non-masquerade-cidr=<clusterCidr>` argument to ensure traffic to IPs outside this range will use IP masquerade.
|
||||
* The node must be assigned an IP subnet through either the `--pod-cidr` kubelet command-line option or the `--allocate-node-cidrs=true --cluster-cidr=<cidr>` controller-manager command-line options.
|
||||
|
||||
### Customizing the MTU (with kubenet)
|
||||
|
||||
The MTU should always be configured correctly to get the best networking performance. Network plugins will usually try
|
||||
to infer a sensible MTU, but sometimes the logic will not result in an optimal MTU. For example, if the
|
||||
Docker bridge or another interface has a small MTU, kubenet will currently select that MTU. Or if you are
|
||||
using IPSEC encapsulation, the MTU must be reduced, and this calculation is out-of-scope for
|
||||
most network plugins.
|
||||
|
||||
Where needed, you can specify the MTU explicitly with the `network-plugin-mtu` kubelet option. For example,
|
||||
on AWS the `eth0` MTU is typically 9001, so you might specify `--network-plugin-mtu=9001`. If you're using IPSEC you
|
||||
might reduce it to allow for encapsulation overhead e.g. `--network-plugin-mtu=8873`.
|
||||
|
||||
This option is provided to the network-plugin; currently **only kubenet supports `network-plugin-mtu`**.
|
||||
|
||||
## Usage Summary
|
||||
|
||||
* `--network-plugin=cni` specifies that we use the `cni` network plugin with actual CNI plugin binaries located in `--cni-bin-dir` (default `/opt/cni/bin`) and CNI plugin configuration located in `--cni-conf-dir` (default `/etc/cni/net.d`).
|
||||
* `--network-plugin=kubenet` specifies that we use the `kubenet` network plugin with CNI `bridge` and `host-local` plugins placed in `/opt/cni/bin` or `network-plugin-dir`.
|
||||
* `--network-plugin-mtu=9001` specifies the MTU to use, currently only used by the `kubenet` network plugin.
|
|
@ -0,0 +1,215 @@
|
|||
---
|
||||
assignees:
|
||||
- thockin
|
||||
title: Cluster Networking
|
||||
---
|
||||
|
||||
Kubernetes approaches networking somewhat differently than Docker does by
|
||||
default. There are 4 distinct networking problems to solve:
|
||||
|
||||
1. Highly-coupled container-to-container communications: this is solved by
|
||||
[pods](/docs/user-guide/pods/) and `localhost` communications.
|
||||
2. Pod-to-Pod communications: this is the primary focus of this document.
|
||||
3. Pod-to-Service communications: this is covered by [services](/docs/user-guide/services/).
|
||||
4. External-to-Service communications: this is covered by [services](/docs/user-guide/services/).
|
||||
|
||||
* TOC
|
||||
{:toc}
|
||||
|
||||
|
||||
## Summary
|
||||
|
||||
Kubernetes assumes that pods can communicate with other pods, regardless of
|
||||
which host they land on. We give every pod its own IP address so you do not
|
||||
need to explicitly create links between pods and you almost never need to deal
|
||||
with mapping container ports to host ports. This creates a clean,
|
||||
backwards-compatible model where pods can be treated much like VMs or physical
|
||||
hosts from the perspectives of port allocation, naming, service discovery, load
|
||||
balancing, application configuration, and migration.
|
||||
|
||||
To achieve this we must impose some requirements on how you set up your cluster
|
||||
networking.
|
||||
|
||||
## Docker model
|
||||
|
||||
Before discussing the Kubernetes approach to networking, it is worthwhile to
|
||||
review the "normal" way that networking works with Docker. By default, Docker
|
||||
uses host-private networking. It creates a virtual bridge, called `docker0` by
|
||||
default, and allocates a subnet from one of the private address blocks defined
|
||||
in [RFC1918](https://tools.ietf.org/html/rfc1918) for that bridge. For each
|
||||
container that Docker creates, it allocates a virtual ethernet device (called
|
||||
`veth`) which is attached to the bridge. The veth is mapped to appear as `eth0`
|
||||
in the container, using Linux namespaces. The in-container `eth0` interface is
|
||||
given an IP address from the bridge's address range.
|
||||
|
||||
The result is that Docker containers can talk to other containers only if they
|
||||
are on the same machine (and thus the same virtual bridge). Containers on
|
||||
different machines can not reach each other - in fact they may end up with the
|
||||
exact same network ranges and IP addresses.
|
||||
|
||||
In order for Docker containers to communicate across nodes, they must be
|
||||
allocated ports on the machine's own IP address, which are then forwarded or
|
||||
proxied to the containers. This obviously means that containers must either
|
||||
coordinate which ports they use very carefully or else be allocated ports
|
||||
dynamically.
|
||||
|
||||
## Kubernetes model
|
||||
|
||||
Coordinating ports across multiple developers is very difficult to do at
|
||||
scale and exposes users to cluster-level issues outside of their control.
|
||||
Dynamic port allocation brings a lot of complications to the system - every
|
||||
application has to take ports as flags, the API servers have to know how to
|
||||
insert dynamic port numbers into configuration blocks, services have to know
|
||||
how to find each other, etc. Rather than deal with this, Kubernetes takes a
|
||||
different approach.
|
||||
|
||||
Kubernetes imposes the following fundamental requirements on any networking
|
||||
implementation (barring any intentional network segmentation policies):
|
||||
|
||||
* all containers can communicate with all other containers without NAT
|
||||
* all nodes can communicate with all containers (and vice-versa) without NAT
|
||||
* the IP that a container sees itself as is the same IP that others see it as
|
||||
|
||||
What this means in practice is that you can not just take two computers
|
||||
running Docker and expect Kubernetes to work. You must ensure that the
|
||||
fundamental requirements are met.
|
||||
|
||||
This model is not only less complex overall, but it is principally compatible
|
||||
with the desire for Kubernetes to enable low-friction porting of apps from VMs
|
||||
to containers. If your job previously ran in a VM, your VM had an IP and could
|
||||
talk to other VMs in your project. This is the same basic model.
|
||||
|
||||
Until now this document has talked about containers. In reality, Kubernetes
|
||||
applies IP addresses at the `Pod` scope - containers within a `Pod` share their
|
||||
network namespaces - including their IP address. This means that containers
|
||||
within a `Pod` can all reach each other's ports on `localhost`. This does imply
|
||||
that containers within a `Pod` must coordinate port usage, but this is no
|
||||
different than processes in a VM. We call this the "IP-per-pod" model. This
|
||||
is implemented in Docker as a "pod container" which holds the network namespace
|
||||
open while "app containers" (the things the user specified) join that namespace
|
||||
with Docker's `--net=container:<id>` function.
|
||||
|
||||
As with Docker, it is possible to request host ports, but this is reduced to a
|
||||
very niche operation. In this case a port will be allocated on the host `Node`
|
||||
and traffic will be forwarded to the `Pod`. The `Pod` itself is blind to the
|
||||
existence or non-existence of host ports.
|
||||
|
||||
## How to achieve this
|
||||
|
||||
There are a number of ways that this network model can be implemented. This
|
||||
document is not an exhaustive study of the various methods, but hopefully serves
|
||||
as an introduction to various technologies and serves as a jumping-off point.
|
||||
|
||||
The following networking options are sorted alphabetically - the order does not
|
||||
imply any preferential status.
|
||||
|
||||
### Contiv
|
||||
|
||||
[Contiv](https://github.com/contiv/netplugin) provides configurable networking (native l3 using BGP, overlay using vxlan, classic l2, or Cisco-SDN/ACI) for various use cases. [Contiv](http://contiv.io) is all open sourced.
|
||||
|
||||
### Flannel
|
||||
|
||||
[Flannel](https://github.com/coreos/flannel#flannel) is a very simple overlay
|
||||
network that satisfies the Kubernetes requirements. Many
|
||||
people have reported success with Flannel and Kubernetes.
|
||||
|
||||
### Google Compute Engine (GCE)
|
||||
|
||||
For the Google Compute Engine cluster configuration scripts, we use [advanced
|
||||
routing](https://cloud.google.com/compute/docs/networking#routing) to
|
||||
assign each VM a subnet (default is `/24` - 254 IPs). Any traffic bound for that
|
||||
subnet will be routed directly to the VM by the GCE network fabric. This is in
|
||||
addition to the "main" IP address assigned to the VM, which is NAT'ed for
|
||||
outbound internet access. A linux bridge (called `cbr0`) is configured to exist
|
||||
on that subnet, and is passed to docker's `--bridge` flag.
|
||||
|
||||
We start Docker with:
|
||||
|
||||
```shell
|
||||
DOCKER_OPTS="--bridge=cbr0 --iptables=false --ip-masq=false"
|
||||
```
|
||||
|
||||
This bridge is created by Kubelet (controlled by the `--network-plugin=kubenet`
|
||||
flag) according to the `Node`'s `spec.podCIDR`.
|
||||
|
||||
Docker will now allocate IPs from the `cbr-cidr` block. Containers can reach
|
||||
each other and `Nodes` over the `cbr0` bridge. Those IPs are all routable
|
||||
within the GCE project network.
|
||||
|
||||
GCE itself does not know anything about these IPs, though, so it will not NAT
|
||||
them for outbound internet traffic. To achieve that we use an iptables rule to
|
||||
masquerade (aka SNAT - to make it seem as if packets came from the `Node`
|
||||
itself) traffic that is bound for IPs outside the GCE project network
|
||||
(10.0.0.0/8).
|
||||
|
||||
```shell
|
||||
iptables -t nat -A POSTROUTING ! -d 10.0.0.0/8 -o eth0 -j MASQUERADE
|
||||
```
|
||||
|
||||
Lastly we enable IP forwarding in the kernel (so the kernel will process
|
||||
packets for bridged containers):
|
||||
|
||||
```shell
|
||||
sysctl net.ipv4.ip_forward=1
|
||||
```
|
||||
|
||||
The result of all this is that all `Pods` can reach each other and can egress
|
||||
traffic to the internet.
|
||||
|
||||
### L2 networks and linux bridging
|
||||
|
||||
If you have a "dumb" L2 network, such as a simple switch in a "bare-metal"
|
||||
environment, you should be able to do something similar to the above GCE setup.
|
||||
Note that these instructions have only been tried very casually - it seems to
|
||||
work, but has not been thoroughly tested. If you use this technique and
|
||||
perfect the process, please let us know.
|
||||
|
||||
Follow the "With Linux Bridge devices" section of [this very nice
|
||||
tutorial](http://blog.oddbit.com/2014/08/11/four-ways-to-connect-a-docker/) from
|
||||
Lars Kellogg-Stedman.
|
||||
|
||||
### Nuage Networks VCS (Virtualized Cloud Services)
|
||||
|
||||
[Nuage](http://www.nuagenetworks.net) provides a highly scalable policy-based Software-Defined Networking (SDN) platform. Nuage uses the open source Open vSwitch for the data plane along with a feature rich SDN Controller built on open standards.
|
||||
|
||||
The Nuage platform uses overlays to provide seamless policy-based networking between Kubernetes Pods and non-Kubernetes environments (VMs and bare metal servers). Nuage's policy abstraction model is designed with applications in mind and makes it easy to declare fine-grained policies for applications.The platform's real-time analytics engine enables visibility and security monitoring for Kubernetes applications.
|
||||
|
||||
### OpenVSwitch
|
||||
|
||||
[OpenVSwitch](/docs/admin/ovs-networking) is a somewhat more mature but also
|
||||
complicated way to build an overlay network. This is endorsed by several of the
|
||||
"Big Shops" for networking.
|
||||
|
||||
### OVN (Open Virtual Networking)
|
||||
|
||||
OVN is an opensource network virtualization solution developed by the
|
||||
Open vSwitch community. It lets one create logical switches, logical routers,
|
||||
stateful ACLs, load-balancers etc to build different virtual networking
|
||||
topologies. The project has a specific Kubernetes plugin and documentation
|
||||
at [ovn-kubernetes](https://github.com/openvswitch/ovn-kubernetes).
|
||||
|
||||
### Project Calico
|
||||
|
||||
[Project Calico](http://docs.projectcalico.org/) is an open source container networking provider and network policy engine.
|
||||
|
||||
Calico provides a highly scalable networking and network policy solution for connecting Kubernetes pods based on the same IP networking principles as the internet. Calico can be deployed without encapsulation or overlays to provide high-performance, high-scale data center networking. Calico also provides fine-grained, intent based network security policy for Kubernetes pods via its distributed firewall.
|
||||
|
||||
Calico can also be run in policy enforcement mode in conjunction with other networking solutions such as Flannel, aka [canal](https://github.com/tigera/canal), or native GCE networking.
|
||||
|
||||
### Romana
|
||||
|
||||
[Romana](http://romana.io) is an open source network and security automation solution that lets you deploy Kubernetes without an overlay network. Romana supports Kubernetes [Network Policy](/docs/user-guide/networkpolicies/) to provide isolation across network namespaces.
|
||||
|
||||
### Weave Net from Weaveworks
|
||||
|
||||
[Weave Net](https://www.weave.works/products/weave-net/) is a
|
||||
resilient and simple to use network for Kubernetes and its hosted applications.
|
||||
Weave Net runs as a [CNI plug-in](https://www.weave.works/docs/net/latest/cni-plugin/)
|
||||
or stand-alone. In either version, it doesn't require any configuration or extra code
|
||||
to run, and in both cases, the network provides one IP address per pod - as is standard for Kubernetes.
|
||||
|
||||
## Other reading
|
||||
|
||||
The early design of the networking model and its rationale, and some future
|
||||
plans are described in more detail in the [networking design
|
||||
document](https://github.com/kubernetes/kubernetes/blob/{{page.githubbranch}}/docs/design/networking.md).
|
|
@ -0,0 +1,29 @@
|
|||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: my-nginx-svc
|
||||
labels:
|
||||
app: nginx
|
||||
spec:
|
||||
type: LoadBalancer
|
||||
ports:
|
||||
- port: 80
|
||||
selector:
|
||||
app: nginx
|
||||
---
|
||||
apiVersion: extensions/v1beta1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: my-nginx
|
||||
spec:
|
||||
replicas: 3
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: nginx
|
||||
spec:
|
||||
containers:
|
||||
- name: nginx
|
||||
image: nginx:1.7.9
|
||||
ports:
|
||||
- containerPort: 80
|
|
@ -0,0 +1,368 @@
|
|||
---
|
||||
assignees:
|
||||
- derekwaynecarr
|
||||
- vishh
|
||||
- timstclair
|
||||
title: Configuring Out Of Resource Handling
|
||||
---
|
||||
|
||||
* TOC
|
||||
{:toc}
|
||||
|
||||
The `kubelet` needs to preserve node stability when available compute resources are low.
|
||||
|
||||
This is especially important when dealing with incompressible resources such as memory or disk.
|
||||
|
||||
If either resource is exhausted, the node would become unstable.
|
||||
|
||||
## Eviction Policy
|
||||
|
||||
The `kubelet` can pro-actively monitor for and prevent against total starvation of a compute resource. In those cases, the `kubelet` can pro-actively fail one or more pods in order to reclaim
|
||||
the starved resource. When the `kubelet` fails a pod, it terminates all containers in the pod, and the `PodPhase`
|
||||
is transitioned to `Failed`.
|
||||
|
||||
### Eviction Signals
|
||||
|
||||
The `kubelet` can support the ability to trigger eviction decisions on the signals described in the
|
||||
table below. The value of each signal is described in the description column based on the `kubelet`
|
||||
summary API.
|
||||
|
||||
| Eviction Signal | Description |
|
||||
|----------------------------|-----------------------------------------------------------------------|
|
||||
| `memory.available` | `memory.available` := `node.status.capacity[memory]` - `node.stats.memory.workingSet` |
|
||||
| `nodefs.available` | `nodefs.available` := `node.stats.fs.available` |
|
||||
| `nodefs.inodesFree` | `nodefs.inodesFree` := `node.stats.fs.inodesFree` |
|
||||
| `imagefs.available` | `imagefs.available` := `node.stats.runtime.imagefs.available` |
|
||||
| `imagefs.inodesFree` | `imagefs.inodesFree` := `node.stats.runtime.imagefs.inodesFree` |
|
||||
|
||||
Each of the above signals supports either a literal or percentage based value. The percentage based value
|
||||
is calculated relative to the total capacity associated with each signal.
|
||||
|
||||
`kubelet` supports only two filesystem partitions.
|
||||
|
||||
1. The `nodefs` filesystem that kubelet uses for volumes, daemon logs, etc.
|
||||
1. The `imagefs` filesystem that container runtimes uses for storing images and container writable layers.
|
||||
|
||||
`imagefs` is optional. `kubelet` auto-discovers these filesystems using cAdvisor. `kubelet` does not care about any
|
||||
other filesystems. Any other types of configurations are not currently supported by the kubelet. For example, it is
|
||||
*not OK* to store volumes and logs in a dedicated `filesystem`.
|
||||
|
||||
In future releases, the `kubelet` will deprecate the existing [garbage collection](/docs/admin/garbage-collection/)
|
||||
support in favor of eviction in response to disk pressure.
|
||||
|
||||
### Eviction Thresholds
|
||||
|
||||
The `kubelet` supports the ability to specify eviction thresholds that trigger the `kubelet` to reclaim resources.
|
||||
|
||||
Each threshold is of the following form:
|
||||
|
||||
`<eviction-signal><operator><quantity>`
|
||||
|
||||
* valid `eviction-signal` tokens as defined above.
|
||||
* valid `operator` tokens are `<`
|
||||
* valid `quantity` tokens must match the quantity representation used by Kubernetes
|
||||
* an eviction threshold can be expressed as a percentage if ends with `%` token.
|
||||
|
||||
For example, if a node has `10Gi` of memory, and the desire is to induce eviction
|
||||
if available memory falls below `1Gi`, an eviction threshold can be specified as either
|
||||
of the following (but not both).
|
||||
|
||||
* `memory.available<10%`
|
||||
* `memory.available<1Gi`
|
||||
|
||||
#### Soft Eviction Thresholds
|
||||
|
||||
A soft eviction threshold pairs an eviction threshold with a required
|
||||
administrator specified grace period. No action is taken by the `kubelet`
|
||||
to reclaim resources associated with the eviction signal until that grace
|
||||
period has been exceeded. If no grace period is provided, the `kubelet` will
|
||||
error on startup.
|
||||
|
||||
In addition, if a soft eviction threshold has been met, an operator can
|
||||
specify a maximum allowed pod termination grace period to use when evicting
|
||||
pods from the node. If specified, the `kubelet` will use the lesser value among
|
||||
the `pod.Spec.TerminationGracePeriodSeconds` and the max allowed grace period.
|
||||
If not specified, the `kubelet` will kill pods immediately with no graceful
|
||||
termination.
|
||||
|
||||
To configure soft eviction thresholds, the following flags are supported:
|
||||
|
||||
* `eviction-soft` describes a set of eviction thresholds (e.g. `memory.available<1.5Gi`) that if met over a
|
||||
corresponding grace period would trigger a pod eviction.
|
||||
* `eviction-soft-grace-period` describes a set of eviction grace periods (e.g. `memory.available=1m30s`) that
|
||||
correspond to how long a soft eviction threshold must hold before triggering a pod eviction.
|
||||
* `eviction-max-pod-grace-period` describes the maximum allowed grace period (in seconds) to use when terminating
|
||||
pods in response to a soft eviction threshold being met.
|
||||
|
||||
#### Hard Eviction Thresholds
|
||||
|
||||
A hard eviction threshold has no grace period, and if observed, the `kubelet`
|
||||
will take immediate action to reclaim the associated starved resource. If a
|
||||
hard eviction threshold is met, the `kubelet` will kill the pod immediately
|
||||
with no graceful termination.
|
||||
|
||||
To configure hard eviction thresholds, the following flag is supported:
|
||||
|
||||
* `eviction-hard` describes a set of eviction thresholds (e.g. `memory.available<1Gi`) that if met
|
||||
would trigger a pod eviction.
|
||||
|
||||
The `kubelet` has the following default hard eviction thresholds:
|
||||
|
||||
* `--eviction-hard=memory.available<100Mi`
|
||||
|
||||
### Eviction Monitoring Interval
|
||||
|
||||
The `kubelet` evaluates eviction thresholds per its configured housekeeping interval.
|
||||
|
||||
* `housekeeping-interval` is the interval between container housekeepings.
|
||||
|
||||
### Node Conditions
|
||||
|
||||
The `kubelet` will map one or more eviction signals to a corresponding node condition.
|
||||
|
||||
If a hard eviction threshold has been met, or a soft eviction threshold has been met
|
||||
independent of its associated grace period, the `kubelet` will report a condition that
|
||||
reflects the node is under pressure.
|
||||
|
||||
The following node conditions are defined that correspond to the specified eviction signal.
|
||||
|
||||
| Node Condition | Eviction Signal | Description |
|
||||
|-------------------------|-------------------------------|--------------------------------------------|
|
||||
| `MemoryPressure` | `memory.available` | Available memory on the node has satisfied an eviction threshold |
|
||||
| `DiskPressure` | `nodefs.available`, `nodefs.inodesFree`, `imagefs.available`, or `imagefs.inodesFree` | Available disk space and inodes on either the node's root filesytem or image filesystem has satisfied an eviction threshold |
|
||||
|
||||
The `kubelet` will continue to report node status updates at the frequency specified by
|
||||
`--node-status-update-frequency` which defaults to `10s`.
|
||||
|
||||
### Oscillation of node conditions
|
||||
|
||||
If a node is oscillating above and below a soft eviction threshold, but not exceeding
|
||||
its associated grace period, it would cause the corresponding node condition to
|
||||
constantly oscillate between true and false, and could cause poor scheduling decisions
|
||||
as a consequence.
|
||||
|
||||
To protect against this oscillation, the following flag is defined to control how
|
||||
long the `kubelet` must wait before transitioning out of a pressure condition.
|
||||
|
||||
* `eviction-pressure-transition-period` is the duration for which the `kubelet` has
|
||||
to wait before transitioning out of an eviction pressure condition.
|
||||
|
||||
The `kubelet` would ensure that it has not observed an eviction threshold being met
|
||||
for the specified pressure condition for the period specified before toggling the
|
||||
condition back to `false`.
|
||||
|
||||
### Reclaiming node level resources
|
||||
|
||||
If an eviction threshold has been met and the grace period has passed,
|
||||
the `kubelet` will initiate the process of reclaiming the pressured resource
|
||||
until it has observed the signal has gone below its defined threshold.
|
||||
|
||||
The `kubelet` attempts to reclaim node level resources prior to evicting end-user pods. If
|
||||
disk pressure is observed, the `kubelet` reclaims node level resources differently if the
|
||||
machine has a dedicated `imagefs` configured for the container runtime.
|
||||
|
||||
#### With Imagefs
|
||||
|
||||
If `nodefs` filesystem has met eviction thresholds, `kubelet` will free up disk space in the following order:
|
||||
|
||||
1. Delete dead pods/containers
|
||||
|
||||
If `imagefs` filesystem has met eviction thresholds, `kubelet` will free up disk space in the following order:
|
||||
|
||||
1. Delete all unused images
|
||||
|
||||
#### Without Imagefs
|
||||
|
||||
If `nodefs` filesystem has met eviction thresholds, `kubelet` will free up disk space in the following order:
|
||||
|
||||
1. Delete dead pods/containers
|
||||
1. Delete all unused images
|
||||
|
||||
### Evicting end-user pods
|
||||
|
||||
If the `kubelet` is unable to reclaim sufficient resource on the node,
|
||||
it will begin evicting pods.
|
||||
|
||||
The `kubelet` ranks pods for eviction as follows:
|
||||
|
||||
* by their quality of service
|
||||
* by the consumption of the starved compute resource relative to the pods scheduling request.
|
||||
|
||||
As a result, pod eviction occurs in the following order:
|
||||
|
||||
* `BestEffort` pods that consume the most of the starved resource are failed
|
||||
first.
|
||||
* `Burstable` pods that consume the greatest amount of the starved resource
|
||||
relative to their request for that resource are killed first. If no pod
|
||||
has exceeded its request, the strategy targets the largest consumer of the
|
||||
starved resource.
|
||||
* `Guaranteed` pods that consume the greatest amount of the starved resource
|
||||
relative to their request are killed first. If no pod has exceeded its request,
|
||||
the strategy targets the largest consumer of the starved resource.
|
||||
|
||||
A `Guaranteed` pod is guaranteed to never be evicted because of another pod's
|
||||
resource consumption. If a system daemon (i.e. `kubelet`, `docker`, `journald`, etc.)
|
||||
is consuming more resources than were reserved via `system-reserved` or `kube-reserved` allocations,
|
||||
and the node only has `Guaranteed` pod(s) remaining, then the node must choose to evict a
|
||||
`Guaranteed` pod in order to preserve node stability, and to limit the impact
|
||||
of the unexpected consumption to other `Guaranteed` pod(s).
|
||||
|
||||
Local disk is a `BestEffort` resource. If necessary, `kubelet` will evict pods one at a time to reclaim
|
||||
disk when `DiskPressure` is encountered. The `kubelet` will rank pods by quality of service. If the `kubelet`
|
||||
is responding to `inode` starvation, it will reclaim `inodes` by evicting pods with the lowest quality of service
|
||||
first. If the `kubelet` is responding to lack of available disk, it will rank pods within a quality of service
|
||||
that consumes the largest amount of disk and kill those first.
|
||||
|
||||
#### With Imagefs
|
||||
|
||||
If `nodefs` is triggering evictions, `kubelet` will sort pods based on the usage on `nodefs`
|
||||
- local volumes + logs of all its containers.
|
||||
|
||||
If `imagefs` is triggering evictions, `kubelet` will sort pods based on the writable layer usage of all its containers.
|
||||
|
||||
#### Without Imagefs
|
||||
|
||||
If `nodefs` is triggering evictions, `kubelet` will sort pods based on their total disk usage
|
||||
- local volumes + logs & writable layer of all its containers.
|
||||
|
||||
### Minimum eviction reclaim
|
||||
|
||||
In certain scenarios, eviction of pods could result in reclamation of small amount of resources. This can result in
|
||||
`kubelet` hitting eviction thresholds in repeated successions. In addition to that, eviction of resources like `disk`,
|
||||
is time consuming.
|
||||
|
||||
To mitigate these issues, `kubelet` can have a per-resource `minimum-reclaim`. Whenever `kubelet` observes
|
||||
resource pressure, `kubelet` will attempt to reclaim at least `minimum-reclaim` amount of resource below
|
||||
the configured eviction threshold.
|
||||
|
||||
For example, with the following configuration:
|
||||
|
||||
```
|
||||
--eviction-hard=memory.available<500Mi,nodefs.available<1Gi,imagefs.available<100Gi
|
||||
--eviction-minimum-reclaim="memory.available=0Mi,nodefs.available=500Mi,imagefs.available=2Gi"`
|
||||
```
|
||||
|
||||
If an eviction threshold is triggered for `memory.available`, the `kubelet` will work to ensure
|
||||
that `memory.available` is at least `500Mi`. For `nodefs.available`, the `kubelet` will work
|
||||
to ensure that `nodefs.available` is at least `1.5Gi`, and for `imagefs.available` it will
|
||||
work to ensure that `imagefs.available` is at least `102Gi` before no longer reporting pressure
|
||||
on their associated resources.
|
||||
|
||||
The default `eviction-minimum-reclaim` is `0` for all resources.
|
||||
|
||||
### Scheduler
|
||||
|
||||
The node will report a condition when a compute resource is under pressure. The
|
||||
scheduler views that condition as a signal to dissuade placing additional
|
||||
pods on the node.
|
||||
|
||||
| Node Condition | Scheduler Behavior |
|
||||
| ---------------- | ------------------------------------------------ |
|
||||
| `MemoryPressure` | No new `BestEffort` pods are scheduled to the node. |
|
||||
| `DiskPressure` | No new pods are scheduled to the node. |
|
||||
|
||||
## Node OOM Behavior
|
||||
|
||||
If the node experiences a system OOM (out of memory) event prior to the `kubelet` is able to reclaim memory,
|
||||
the node depends on the [oom_killer](https://lwn.net/Articles/391222/) to respond.
|
||||
|
||||
The `kubelet` sets a `oom_score_adj` value for each container based on the quality of service for the pod.
|
||||
|
||||
| Quality of Service | oom_score_adj |
|
||||
|----------------------------|-----------------------------------------------------------------------|
|
||||
| `Guaranteed` | -998 |
|
||||
| `BestEffort` | 1000 |
|
||||
| `Burstable` | min(max(2, 1000 - (1000 * memoryRequestBytes) / machineMemoryCapacityBytes), 999) |
|
||||
|
||||
If the `kubelet` is unable to reclaim memory prior to a node experiencing system OOM, the `oom_killer` will calculate
|
||||
an `oom_score` based on the percentage of memory its using on the node, and then add the `oom_score_adj` to get an
|
||||
effective `oom_score` for the container, and then kills the container with the highest score.
|
||||
|
||||
The intended behavior should be that containers with the lowest quality of service that
|
||||
are consuming the largest amount of memory relative to the scheduling request should be killed first in order
|
||||
to reclaim memory.
|
||||
|
||||
Unlike pod eviction, if a pod container is OOM killed, it may be restarted by the `kubelet` based on its `RestartPolicy`.
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Schedulable resources and eviction policies
|
||||
|
||||
Let's imagine the following scenario:
|
||||
|
||||
* Node memory capacity: `10Gi`
|
||||
* Operator wants to reserve 10% of memory capacity for system daemons (kernel, `kubelet`, etc.)
|
||||
* Operator wants to evict pods at 95% memory utilization to reduce thrashing and incidence of system OOM.
|
||||
|
||||
To facilitate this scenario, the `kubelet` would be launched as follows:
|
||||
|
||||
```
|
||||
--eviction-hard=memory.available<500Mi
|
||||
--system-reserved=memory=1.5Gi
|
||||
```
|
||||
|
||||
Implicit in this configuration is the understanding that "System reserved" should include the amount of memory
|
||||
covered by the eviction threshold.
|
||||
|
||||
To reach that capacity, either some pod is using more than its request, or the system is using more than `500Mi`.
|
||||
|
||||
This configuration will ensure that the scheduler does not place pods on a node that immediately induce memory pressure
|
||||
and trigger eviction assuming those pods use less than their configured request.
|
||||
|
||||
### DaemonSet
|
||||
|
||||
It is never desired for a `kubelet` to evict a pod that was derived from
|
||||
a `DaemonSet` since the pod will immediately be recreated and rescheduled
|
||||
back to the same node.
|
||||
|
||||
At the moment, the `kubelet` has no ability to distinguish a pod created
|
||||
from `DaemonSet` versus any other object. If/when that information is
|
||||
available, the `kubelet` could pro-actively filter those pods from the
|
||||
candidate set of pods provided to the eviction strategy.
|
||||
|
||||
In general, it is strongly recommended that `DaemonSet` not
|
||||
create `BestEffort` pods to avoid being identified as a candidate pod
|
||||
for eviction. Instead `DaemonSet` should ideally launch `Guaranteed` pods.
|
||||
|
||||
## Deprecation of existing feature flags to reclaim disk
|
||||
|
||||
`kubelet` has been freeing up disk space on demand to keep the node stable.
|
||||
|
||||
As disk based eviction matures, the following `kubelet` flags will be marked for deprecation
|
||||
in favor of the simpler configuration supported around eviction.
|
||||
|
||||
| Existing Flag | New Flag |
|
||||
| ------------- | -------- |
|
||||
| `--image-gc-high-threshold` | `--eviction-hard` or `eviction-soft` |
|
||||
| `--image-gc-low-threshold` | `--eviction-minimum-reclaim` |
|
||||
| `--maximum-dead-containers` | deprecated |
|
||||
| `--maximum-dead-containers-per-container` | deprecated |
|
||||
| `--minimum-container-ttl-duration` | deprecated |
|
||||
| `--low-diskspace-threshold-mb` | `--eviction-hard` or `eviction-soft` |
|
||||
| `--outofdisk-transition-frequency` | `--eviction-pressure-transition-period` |
|
||||
|
||||
## Known issues
|
||||
|
||||
### kubelet may not observe memory pressure right away
|
||||
|
||||
The `kubelet` currently polls `cAdvisor` to collect memory usage stats at a regular interval. If memory usage
|
||||
increases within that window rapidly, the `kubelet` may not observe `MemoryPressure` fast enough, and the `OOMKiller`
|
||||
will still be invoked. We intend to integrate with the `memcg` notification API in a future release to reduce this
|
||||
latency, and instead have the kernel tell us when a threshold has been crossed immediately.
|
||||
|
||||
If you are not trying to achieve extreme utilization, but a sensible measure of overcommit, a viable workaround for
|
||||
this issue is to set eviction thresholds at approximately 75% capacity. This increases the ability of this feature
|
||||
to prevent system OOMs, and promote eviction of workloads so cluster state can rebalance.
|
||||
|
||||
### kubelet may evict more pods than needed
|
||||
|
||||
The pod eviction may evict more pods than needed due to stats collection timing gap. This can be mitigated by adding
|
||||
the ability to get root container stats on an on-demand basis (https://github.com/google/cadvisor/issues/1247) in the future.
|
||||
|
||||
### How kubelet ranks pods for eviction in response to inode exhaustion
|
||||
|
||||
At this time, it is not possible to know how many inodes were consumed by a particular container. If the `kubelet` observes
|
||||
inode exhaustion, it will evict pods by ranking them by quality of service. The following issue has been opened in cadvisor
|
||||
to track per container inode consumption (https://github.com/google/cadvisor/issues/1422) which would allow us to rank pods
|
||||
by inode consumption. For example, this would let us identify a container that created large numbers of 0 byte files, and evict
|
||||
that pod over others.
|
|
@ -0,0 +1,128 @@
|
|||
---
|
||||
assignees:
|
||||
- jsafrane
|
||||
title: Static Pods
|
||||
---
|
||||
|
||||
**If you are running clustered Kubernetes and are using static pods to run a pod on every node, you should probably be using a [DaemonSet](/docs/admin/daemons/)!**
|
||||
|
||||
*Static pods* are managed directly by kubelet daemon on a specific node, without API server observing it. It does not have associated any replication controller, kubelet daemon itself watches it and restarts it when it crashes. There is no health check though. Static pods are always bound to one kubelet daemon and always run on the same node with it.
|
||||
|
||||
Kubelet automatically creates so-called *mirror pod* on Kubernetes API server for each static pod, so the pods are visible there, but they cannot be controlled from the API server.
|
||||
|
||||
## Static pod creation
|
||||
|
||||
Static pod can be created in two ways: either by using configuration file(s) or by HTTP.
|
||||
|
||||
### Configuration files
|
||||
|
||||
The configuration files are just standard pod definition in json or yaml format in specific directory. Use `kubelet --pod-manifest-path=<the directory>` to start kubelet daemon, which periodically scans the directory and creates/deletes static pods as yaml/json files appear/disappear there.
|
||||
|
||||
For example, this is how to start a simple web server as a static pod:
|
||||
|
||||
1. Choose a node where we want to run the static pod. In this example, it's `my-node1`.
|
||||
|
||||
```
|
||||
[joe@host ~] $ ssh my-node1
|
||||
```
|
||||
|
||||
2. Choose a directory, say `/etc/kubelet.d` and place a web server pod definition there, e.g. `/etc/kubelet.d/static-web.yaml`:
|
||||
|
||||
```
|
||||
[root@my-node1 ~] $ mkdir /etc/kubernetes.d/
|
||||
[root@my-node1 ~] $ cat <<EOF >/etc/kubernetes.d/static-web.yaml
|
||||
apiVersion: v1
|
||||
kind: Pod
|
||||
metadata:
|
||||
name: static-web
|
||||
labels:
|
||||
role: myrole
|
||||
spec:
|
||||
containers:
|
||||
- name: web
|
||||
image: nginx
|
||||
ports:
|
||||
- name: web
|
||||
containerPort: 80
|
||||
protocol: TCP
|
||||
EOF
|
||||
```
|
||||
|
||||
3. Configure your kubelet daemon on the node to use this directory by running it with `--pod-manifest-path=/etc/kubelet.d/` argument.
|
||||
On Fedora edit `/etc/kubernetes/kubelet` to include this line:
|
||||
|
||||
```
|
||||
KUBELET_ARGS="--cluster-dns=10.254.0.10 --cluster-domain=kube.local --pod-manifest-path=/etc/kubelet.d/"
|
||||
```
|
||||
|
||||
Instructions for other distributions or Kubernetes installations may vary.
|
||||
|
||||
4. Restart kubelet. On Fedora, this is:
|
||||
|
||||
```
|
||||
[root@my-node1 ~] $ systemctl restart kubelet
|
||||
```
|
||||
|
||||
## Pods created via HTTP
|
||||
|
||||
Kubelet periodically downloads a file specified by `--manifest-url=<URL>` argument and interprets it as a json/yaml file with a pod definition. It works the same as `--pod-manifest-path=<directory>`, i.e. it's reloaded every now and then and changes are applied to running static pods (see below).
|
||||
|
||||
## Behavior of static pods
|
||||
|
||||
When kubelet starts, it automatically starts all pods defined in directory specified in `--pod-manifest-path=` or `--manifest-url=` arguments, i.e. our static-web. (It may take some time to pull nginx image, be patient…):
|
||||
|
||||
```shell
|
||||
[joe@my-node1 ~] $ docker ps
|
||||
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
|
||||
f6d05272b57e nginx:latest "nginx" 8 minutes ago Up 8 minutes k8s_web.6f802af4_static-web-fk-node1_default_67e24ed9466ba55986d120c867395f3c_378e5f3c
|
||||
```
|
||||
|
||||
If we look at our Kubernetes API server (running on host `my-master`), we see that a new mirror-pod was created there too:
|
||||
|
||||
```shell
|
||||
[joe@host ~] $ ssh my-master
|
||||
[joe@my-master ~] $ kubectl get pods
|
||||
NAME READY STATUS RESTARTS AGE
|
||||
static-web-my-node1 1/1 Running 0 2m
|
||||
|
||||
```
|
||||
|
||||
Labels from the static pod are propagated into the mirror-pod and can be used as usual for filtering.
|
||||
|
||||
Notice we cannot delete the pod with the API server (e.g. via [`kubectl`](/docs/user-guide/kubectl/) command), kubelet simply won't remove it.
|
||||
|
||||
```shell
|
||||
[joe@my-master ~] $ kubectl delete pod static-web-my-node1
|
||||
pods/static-web-my-node1
|
||||
[joe@my-master ~] $ kubectl get pods
|
||||
NAME READY STATUS RESTARTS AGE
|
||||
static-web-my-node1 1/1 Running 0 12s
|
||||
|
||||
```
|
||||
|
||||
Back to our `my-node1` host, we can try to stop the container manually and see, that kubelet automatically restarts it in a while:
|
||||
|
||||
```shell
|
||||
[joe@host ~] $ ssh my-node1
|
||||
[joe@my-node1 ~] $ docker stop f6d05272b57e
|
||||
[joe@my-node1 ~] $ sleep 20
|
||||
[joe@my-node1 ~] $ docker ps
|
||||
CONTAINER ID IMAGE COMMAND CREATED ...
|
||||
5b920cbaf8b1 nginx:latest "nginx -g 'daemon of 2 seconds ago ...
|
||||
```
|
||||
|
||||
## Dynamic addition and removal of static pods
|
||||
|
||||
Running kubelet periodically scans the configured directory (`/etc/kubelet.d` in our example) for changes and adds/removes pods as files appear/disappear in this directory.
|
||||
|
||||
```shell
|
||||
[joe@my-node1 ~] $ mv /etc/kubelet.d/static-web.yaml /tmp
|
||||
[joe@my-node1 ~] $ sleep 20
|
||||
[joe@my-node1 ~] $ docker ps
|
||||
// no nginx container is running
|
||||
[joe@my-node1 ~] $ mv /tmp/static-web.yaml /etc/kubelet.d/
|
||||
[joe@my-node1 ~] $ sleep 20
|
||||
[joe@my-node1 ~] $ docker ps
|
||||
CONTAINER ID IMAGE COMMAND CREATED ...
|
||||
e7a62e3427f1 nginx:latest "nginx -g 'daemon of 27 seconds ago
|
||||
```
|
|
@ -0,0 +1,122 @@
|
|||
---
|
||||
assignees:
|
||||
- sttts
|
||||
title: Using Sysctls in a Kubernetes Cluster
|
||||
---
|
||||
|
||||
* TOC
|
||||
{:toc}
|
||||
|
||||
This document describes how sysctls are used within a Kubernetes cluster.
|
||||
|
||||
## What is a Sysctl?
|
||||
|
||||
In Linux, the sysctl interface allows an administrator to modify kernel
|
||||
parameters at runtime. Parameters are available via the `/proc/sys/` virtual
|
||||
process file system. The parameters cover various subsystems such as:
|
||||
|
||||
- kernel (common prefix: `kernel.`)
|
||||
- networking (common prefix: `net.`)
|
||||
- virtual memory (common prefix: `vm.`)
|
||||
- MDADM (common prefix: `dev.`)
|
||||
- More subsystems are described in [Kernel docs](https://www.kernel.org/doc/Documentation/sysctl/README).
|
||||
|
||||
To get a list of all parameters, you can run
|
||||
|
||||
```
|
||||
$ sudo sysctl -a
|
||||
```
|
||||
|
||||
## Namespaced vs. Node-Level Sysctls
|
||||
|
||||
A number of sysctls are _namespaced_ in today's Linux kernels. This means that
|
||||
they can be set independently for each pod on a node. Being namespaced is a
|
||||
requirement for sysctls to be accessible in a pod context within Kubernetes.
|
||||
|
||||
The following sysctls are known to be _namespaced_:
|
||||
|
||||
- `kernel.shm*`,
|
||||
- `kernel.msg*`,
|
||||
- `kernel.sem`,
|
||||
- `fs.mqueue.*`,
|
||||
- `net.*`.
|
||||
|
||||
Sysctls which are not namespaced are called _node-level_ and must be set
|
||||
manually by the cluster admin, either by means of the underlying Linux
|
||||
distribution of the nodes (e.g. via `/etc/sysctls.conf`) or using a DaemonSet
|
||||
with privileged containers.
|
||||
|
||||
**Note**: it is good practice to consider nodes with special sysctl settings as
|
||||
_tainted_ within a cluster, and only schedule pods onto them which need those
|
||||
sysctl settings. It is suggested to use the Kubernetes [_taints and toleration_
|
||||
feature](/docs/user-guide/kubectl/kubectl_taint.md) to implement this.
|
||||
|
||||
## Safe vs. Unsafe Sysctls
|
||||
|
||||
Sysctls are grouped into _safe_ and _unsafe_ sysctls. In addition to proper
|
||||
namespacing a _safe_ sysctl must be properly _isolated_ between pods on the same
|
||||
node. This means that setting a _safe_ sysctl for one pod
|
||||
|
||||
- must not have any influence on any other pod on the node
|
||||
- must not allow to harm the node's health
|
||||
- must not allow to gain CPU or memory resources outside of the resource limits
|
||||
of a pod.
|
||||
|
||||
By far, most of the _namespaced_ sysctls are not necessarily considered _safe_.
|
||||
|
||||
For Kubernetes 1.4, the following sysctls are supported in the _safe_ set:
|
||||
|
||||
- `kernel.shm_rmid_forced`,
|
||||
- `net.ipv4.ip_local_port_range`,
|
||||
- `net.ipv4.tcp_syncookies`.
|
||||
|
||||
This list will be extended in future Kubernetes versions when the kubelet
|
||||
supports better isolation mechanisms.
|
||||
|
||||
All _safe_ sysctls are enabled by default.
|
||||
|
||||
All _unsafe_ sysctls are disabled by default and must be allowed manually by the
|
||||
cluster admin on a per-node basis. Pods with disabled unsafe sysctls will be
|
||||
scheduled, but will fail to launch.
|
||||
|
||||
**Warning**: Due to their nature of being _unsafe_, the use of _unsafe_ sysctls
|
||||
is at-your-own-risk and can lead to severe problems like wrong behavior of
|
||||
containers, resource shortage or complete breakage of a node.
|
||||
|
||||
## Enabling Unsafe Sysctls
|
||||
|
||||
With the warning above in mind, the cluster admin can allow certain _unsafe_
|
||||
sysctls for very special situations like e.g. high-performance or real-time
|
||||
application tuning. _Unsafe_ sysctls are enabled on a node-by-node basis with a
|
||||
flag of the kubelet, e.g.:
|
||||
|
||||
```shell
|
||||
$ kubelet --experimental-allowed-unsafe-sysctls 'kernel.msg*,net.ipv4.route.min_pmtu' ...
|
||||
```
|
||||
|
||||
Only _namespaced_ sysctls can be enabled this way.
|
||||
|
||||
## Setting Sysctls for a Pod
|
||||
|
||||
The sysctl feature is an alpha API in Kubernetes 1.4. Therefore, sysctls are set
|
||||
using annotations on pods. They apply to all containers in the same pod.
|
||||
|
||||
Here is an example, with different annotations for _safe_ and _unsafe_ sysctls:
|
||||
|
||||
```yaml
|
||||
apiVersion: v1
|
||||
kind: Pod
|
||||
metadata:
|
||||
name: sysctl-example
|
||||
annotations:
|
||||
security.alpha.kubernetes.io/sysctls: kernel.shm_rmid_forced=1
|
||||
security.alpha.kubernetes.io/unsafe-sysctls: net.ipv4.route.min_pmtu=1000,kernel.msgmax=1 2 3
|
||||
spec:
|
||||
...
|
||||
```
|
||||
|
||||
**Note**: a pod with the _unsafe_ sysctls specified above will fail to launch on
|
||||
any node which has not enabled those two _unsafe_ sysctls explicitly. As with
|
||||
_node-level_ sysctls it is recommended to use [_taints and toleration_
|
||||
feature](/docs/user-guide/kubectl/kubectl_taint.md) or [labels on nodes](/docs
|
||||
/user-guide/labels.md) to schedule those pods onto the right nodes.
|
|
@ -0,0 +1,119 @@
|
|||
---
|
||||
assignees:
|
||||
- mikedanese
|
||||
title: Configuration Best Practices
|
||||
---
|
||||
|
||||
This document is meant to highlight and consolidate in one place configuration best practices that are introduced throughout the user-guide and getting-started documentation and examples. This is a living document so if you think of something that is not on this list but might be useful to others, please don't hesitate to file an issue or submit a PR.
|
||||
|
||||
## General Config Tips
|
||||
|
||||
- When defining configurations, specify the latest stable API version (currently v1).
|
||||
|
||||
- Configuration files should be stored in version control before being pushed to the cluster. This allows a configuration to be quickly rolled back if needed, and will aid with cluster re-creation and restoration if necessary.
|
||||
|
||||
- Write your configuration files using YAML rather than JSON. They can be used interchangeably in almost all scenarios, but YAML tends to be more user-friendly for config.
|
||||
|
||||
- Group related objects together in a single file where this makes sense. This format is often easier to manage than separate files. See the [guestbook-all-in-one.yaml](https://github.com/kubernetes/kubernetes/tree/{{page.githubbranch}}/examples/guestbook/all-in-one/guestbook-all-in-one.yaml) file as an example of this syntax.
|
||||
(Note also that many `kubectl` commands can be called on a directory, and so you can also call
|
||||
`kubectl create` on a directory of config files— see below for more detail).
|
||||
|
||||
- Don't specify default values unnecessarily, in order to simplify and minimize configs, and to
|
||||
reduce error. For example, omit the selector and labels in a `ReplicationController` if you want
|
||||
them to be the same as the labels in its `podTemplate`, since those fields are populated from the
|
||||
`podTemplate` labels by default. See the [guestbook app's](https://github.com/kubernetes/kubernetes/tree/{{page.githubbranch}}/examples/guestbook/) .yaml files for some [examples](https://github.com/kubernetes/kubernetes/tree/{{page.githubbranch}}/examples/guestbook/frontend-deployment.yaml) of this.
|
||||
|
||||
- Put an object description in an annotation to allow better introspection.
|
||||
|
||||
|
||||
## "Naked" Pods vs Replication Controllers and Jobs
|
||||
|
||||
- If there is a viable alternative to naked pods (i.e., pods not bound to a [replication controller
|
||||
](/docs/user-guide/replication-controller)), go with the alternative. Naked pods will not be rescheduled in the
|
||||
event of node failure.
|
||||
|
||||
Replication controllers are almost always preferable to creating pods, except for some explicit
|
||||
[`restartPolicy: Never`](/docs/user-guide/pod-states/#restartpolicy) scenarios. A
|
||||
[Job](/docs/concepts/jobs/run-to-completion-finite-workloads/) object (currently in Beta), may also be appropriate.
|
||||
|
||||
|
||||
## Services
|
||||
|
||||
- It's typically best to create a [service](/docs/user-guide/services/) before corresponding [replication
|
||||
controllers](/docs/user-guide/replication-controller/), so that the scheduler can spread the pods comprising the
|
||||
service. You can also create a replication controller without specifying replicas (this will set
|
||||
replicas=1), create a service, then scale up the replication controller. This can be useful in
|
||||
ensuring that one replica works before creating lots of them.
|
||||
|
||||
- Don't use `hostPort` (which specifies the port number to expose on the host) unless absolutely
|
||||
necessary, e.g., for a node daemon. When you bind a Pod to a `hostPort`, there are a limited
|
||||
number of places that pod can be scheduled, due to port conflicts— you can only schedule as many
|
||||
such Pods as there are nodes in your Kubernetes cluster.
|
||||
|
||||
If you only need access to the port for debugging purposes, you can use the [kubectl proxy and apiserver proxy](/docs/user-guide/connecting-to-applications-proxy/) or [kubectl port-forward](/docs/user-guide/connecting-to-applications-port-forward/).
|
||||
You can use a [Service](/docs/user-guide/services/) object for external service access.
|
||||
If you do need to expose a pod's port on the host machine, consider using a [NodePort](/docs/user-guide/services/#type-nodeport) service before resorting to `hostPort`.
|
||||
|
||||
- Avoid using `hostNetwork`, for the same reasons as `hostPort`.
|
||||
|
||||
- Use _headless services_ for easy service discovery when you don't need kube-proxy load balancing.
|
||||
See [headless services](/docs/user-guide/services/#headless-services).
|
||||
|
||||
## Using Labels
|
||||
|
||||
- Define and use [labels](/docs/user-guide/labels/) that identify __semantic attributes__ of your application or
|
||||
deployment. For example, instead of attaching a label to a set of pods to explicitly represent
|
||||
some service (e.g., `service: myservice`), or explicitly representing the replication
|
||||
controller managing the pods (e.g., `controller: mycontroller`), attach labels that identify
|
||||
semantic attributes, such as `{ app: myapp, tier: frontend, phase: test, deployment: v3 }`. This
|
||||
will let you select the object groups appropriate to the context— e.g., a service for all "tier:
|
||||
frontend" pods, or all "test" phase components of app "myapp". See the
|
||||
[guestbook](https://github.com/kubernetes/kubernetes/tree/{{page.githubbranch}}/examples/guestbook/) app for an example of this approach.
|
||||
|
||||
A service can be made to span multiple deployments, such as is done across [rolling updates](/docs/user-guide/kubectl/kubectl_rolling-update/), by simply omitting release-specific labels from its selector, rather than updating a service's selector to match the replication controller's selector fully.
|
||||
|
||||
- To facilitate rolling updates, include version info in replication controller names, e.g. as a
|
||||
suffix to the name. It is useful to set a 'version' label as well. The rolling update creates a
|
||||
new controller as opposed to modifying the existing controller. So, there will be issues with
|
||||
version-agnostic controller names. See the [documentation](/docs/user-guide/kubectl/kubectl_rolling-update/) on
|
||||
the rolling-update command for more detail.
|
||||
|
||||
Note that the [Deployment](/docs/user-guide/deployments/) object obviates the need to manage replication
|
||||
controller 'version names'. A desired state of an object is described by a Deployment, and if
|
||||
changes to that spec are _applied_, the deployment controller changes the actual state to the
|
||||
desired state at a controlled rate. (Deployment objects are currently part of the [`extensions`
|
||||
API Group](/docs/api/#api-groups).)
|
||||
|
||||
- You can manipulate labels for debugging. Because Kubernetes replication controllers and services
|
||||
match to pods using labels, this allows you to remove a pod from being considered by a
|
||||
controller, or served traffic by a service, by removing the relevant selector labels. If you
|
||||
remove the labels of an existing pod, its controller will create a new pod to take its place.
|
||||
This is a useful way to debug a previously "live" pod in a quarantine environment. See the
|
||||
[`kubectl label`](/docs/user-guide/kubectl/kubectl_label/) command.
|
||||
|
||||
## Container Images
|
||||
|
||||
- The [default container image pull policy](/docs/user-guide/images/) is `IfNotPresent`, which causes the
|
||||
[Kubelet](/docs/admin/kubelet/) to not pull an image if it already exists. If you would like to
|
||||
always force a pull, you must specify a pull image policy of `Always` in your .yaml file
|
||||
(`imagePullPolicy: Always`) or specify a `:latest` tag on your image.
|
||||
|
||||
That is, if you're specifying an image with other than the `:latest` tag, e.g. `myimage:v1`, and
|
||||
there is an image update to that same tag, the Kubelet won't pull the updated image. You can
|
||||
address this by ensuring that any updates to an image bump the image tag as well (e.g.
|
||||
`myimage:v2`), and ensuring that your configs point to the correct version.
|
||||
|
||||
**Note:** you should avoid using `:latest` tag when deploying containers in production, because this makes it hard
|
||||
to track which version of the image is running and hard to roll back.
|
||||
|
||||
## Using kubectl
|
||||
|
||||
- Use `kubectl create -f <directory>` where possible. This looks for config objects in all `.yaml`, `.yml`, and `.json` files in `<directory>` and passes them to `create`.
|
||||
|
||||
- Use `kubectl delete` rather than `stop`. `Delete` has a superset of the functionality of `stop`, and `stop` is deprecated.
|
||||
|
||||
- Use kubectl bulk operations (via files and/or labels) for get and delete. See [label selectors](/docs/user-guide/labels/#label-selectors) and [using labels effectively](/docs/concepts/cluster-administration/manage-deployment/#using-labels-effectively).
|
||||
|
||||
- Use `kubectl run` and `expose` to quickly create and expose single container Deployments. See the [quick start guide](/docs/user-guide/quick-start/) for an example.
|
||||
|
||||
|
|
@ -0,0 +1,15 @@
|
|||
apiVersion: batch/v1
|
||||
kind: Job
|
||||
metadata:
|
||||
name: pi
|
||||
spec:
|
||||
template:
|
||||
metadata:
|
||||
name: pi
|
||||
spec:
|
||||
containers:
|
||||
- name: pi
|
||||
image: perl
|
||||
command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"]
|
||||
restartPolicy: Never
|
||||
|
|
@ -0,0 +1,385 @@
|
|||
---
|
||||
assignees:
|
||||
- erictune
|
||||
- soltysh
|
||||
title: Run to Completion Finite Workloads
|
||||
---
|
||||
|
||||
* TOC
|
||||
{:toc}
|
||||
|
||||
## What is a Job?
|
||||
|
||||
A _job_ creates one or more pods and ensures that a specified number of them successfully terminate.
|
||||
As pods successfully complete, the _job_ tracks the successful completions. When a specified number
|
||||
of successful completions is reached, the job itself is complete. Deleting a Job will cleanup the
|
||||
pods it created.
|
||||
|
||||
A simple case is to create one Job object in order to reliably run one Pod to completion.
|
||||
The Job object will start a new Pod if the first pod fails or is deleted (for example
|
||||
due to a node hardware failure or a node reboot).
|
||||
|
||||
A Job can also be used to run multiple pods in parallel.
|
||||
|
||||
### extensions/v1beta1.Job is deprecated
|
||||
|
||||
Starting from version 1.5 `extensions/v1beta1.Job` is being deprecated, with a plan to be removed in
|
||||
version 1.6 of Kubernetes (see this [issue](https://github.com/kubernetes/kubernetes/issues/32763)).
|
||||
Please use `batch/v1.Job` instead.
|
||||
|
||||
## Running an example Job
|
||||
|
||||
Here is an example Job config. It computes π to 2000 places and prints it out.
|
||||
It takes around 10s to complete.
|
||||
|
||||
{% include code.html language="yaml" file="job.yaml" ghlink="/docs/user-guide/job.yaml" %}
|
||||
|
||||
Run the example job by downloading the example file and then running this command:
|
||||
|
||||
```shell
|
||||
$ kubectl create -f ./job.yaml
|
||||
job "pi" created
|
||||
```
|
||||
|
||||
Check on the status of the job using this command:
|
||||
|
||||
```shell
|
||||
$ kubectl describe jobs/pi
|
||||
Name: pi
|
||||
Namespace: default
|
||||
Image(s): perl
|
||||
Selector: controller-uid=b1db589a-2c8d-11e6-b324-0209dc45a495
|
||||
Parallelism: 1
|
||||
Completions: 1
|
||||
Start Time: Tue, 07 Jun 2016 10:56:16 +0200
|
||||
Labels: controller-uid=b1db589a-2c8d-11e6-b324-0209dc45a495,job-name=pi
|
||||
Pods Statuses: 0 Running / 1 Succeeded / 0 Failed
|
||||
No volumes.
|
||||
Events:
|
||||
FirstSeen LastSeen Count From SubobjectPath Type Reason Message
|
||||
--------- -------- ----- ---- ------------- -------- ------ -------
|
||||
1m 1m 1 {job-controller } Normal SuccessfulCreate Created pod: pi-dtn4q
|
||||
```
|
||||
|
||||
To view completed pods of a job, use `kubectl get pods --show-all`. The `--show-all` will show completed pods too.
|
||||
|
||||
To list all the pods that belong to a job in a machine readable form, you can use a command like this:
|
||||
|
||||
```shell
|
||||
$ pods=$(kubectl get pods --show-all --selector=job-name=pi --output=jsonpath={.items..metadata.name})
|
||||
echo $pods
|
||||
pi-aiw0a
|
||||
```
|
||||
|
||||
Here, the selector is the same as the selector for the job. The `--output=jsonpath` option specifies an expression
|
||||
that just gets the name from each pod in the returned list.
|
||||
|
||||
View the standard output of one of the pods:
|
||||
|
||||
```shell
|
||||
$ kubectl logs $pods
|
||||
3.1415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679821480865132823066470938446095505822317253594081284811174502841027019385211055596446229489549303819644288109756659334461284756482337867831652712019091456485669234603486104543266482133936072602491412737245870066063155881748815209209628292540917153643678925903600113305305488204665213841469519415116094330572703657595919530921861173819326117931051185480744623799627495673518857527248912279381830119491298336733624406566430860213949463952247371907021798609437027705392171762931767523846748184676694051320005681271452635608277857713427577896091736371787214684409012249534301465495853710507922796892589235420199561121290219608640344181598136297747713099605187072113499999983729780499510597317328160963185950244594553469083026425223082533446850352619311881710100031378387528865875332083814206171776691473035982534904287554687311595628638823537875937519577818577805321712268066130019278766111959092164201989380952572010654858632788659361533818279682303019520353018529689957736225994138912497217752834791315155748572424541506959508295331168617278558890750983817546374649393192550604009277016711390098488240128583616035637076601047101819429555961989467678374494482553797747268471040475346462080466842590694912933136770289891521047521620569660240580381501935112533824300355876402474964732639141992726042699227967823547816360093417216412199245863150302861829745557067498385054945885869269956909272107975093029553211653449872027559602364806654991198818347977535663698074265425278625518184175746728909777727938000816470600161452491921732172147723501414419735685481613611573525521334757418494684385233239073941433345477624168625189835694855620992192221842725502542568876717904946016534668049886272327917860857843838279679766814541009538837863609506800642251252051173929848960841284886269456042419652850222106611863067442786220391949450471237137869609563643719172874677646575739624138908658326459958133904780275901
|
||||
```
|
||||
|
||||
## Writing a Job Spec
|
||||
|
||||
As with all other Kubernetes config, a Job needs `apiVersion`, `kind`, and `metadata` fields. For
|
||||
general information about working with config files, see [here](/docs/user-guide/simple-yaml),
|
||||
[here](/docs/user-guide/configuring-containers), and [here](/docs/user-guide/working-with-resources).
|
||||
|
||||
A Job also needs a [`.spec` section](https://github.com/kubernetes/kubernetes/tree/{{page.githubbranch}}/docs/devel/api-conventions.md#spec-and-status).
|
||||
|
||||
### Pod Template
|
||||
|
||||
The `.spec.template` is the only required field of the `.spec`.
|
||||
|
||||
The `.spec.template` is a [pod template](/docs/user-guide/replication-controller/#pod-template). It has exactly
|
||||
the same schema as a [pod](/docs/user-guide/pods), except it is nested and does not have an `apiVersion` or
|
||||
`kind`.
|
||||
|
||||
In addition to required fields for a Pod, a pod template in a job must specify appropriate
|
||||
labels (see [pod selector](#pod-selector)) and an appropriate restart policy.
|
||||
|
||||
Only a [`RestartPolicy`](/docs/user-guide/pod-states/#restartpolicy) equal to `Never` or `OnFailure` is allowed.
|
||||
|
||||
### Pod Selector
|
||||
|
||||
The `.spec.selector` field is optional. In almost all cases you should not specify it.
|
||||
See section [specifying your own pod selector](#specifying-your-own-pod-selector).
|
||||
|
||||
|
||||
### Parallel Jobs
|
||||
|
||||
There are three main types of jobs:
|
||||
|
||||
1. Non-parallel Jobs
|
||||
- normally only one pod is started, unless the pod fails.
|
||||
- job is complete as soon as Pod terminates successfully.
|
||||
1. Parallel Jobs with a *fixed completion count*:
|
||||
- specify a non-zero positive value for `.spec.completions`
|
||||
- the job is complete when there is one successful pod for each value in the range 1 to `.spec.completions`.
|
||||
- **not implemented yet:** each pod passed a different index in the range 1 to `.spec.completions`.
|
||||
1. Parallel Jobs with a *work queue*:
|
||||
- do not specify `.spec.completions`, default to `.spec.Parallelism`
|
||||
- the pods must coordinate with themselves or an external service to determine what each should work on
|
||||
- each pod is independently capable of determining whether or not all its peers are done, thus the entire Job is done.
|
||||
- when _any_ pod terminates with success, no new pods are created.
|
||||
- once at least one pod has terminated with success and all pods are terminated, then the job is completed with success.
|
||||
- once any pod has exited with success, no other pod should still be doing any work or writing any output. They should all be
|
||||
in the process of exiting.
|
||||
|
||||
For a Non-parallel job, you can leave both `.spec.completions` and `.spec.parallelism` unset. When both are
|
||||
unset, both are defaulted to 1.
|
||||
|
||||
For a Fixed Completion Count job, you should set `.spec.completions` to the number of completions needed.
|
||||
You can set `.spec.parallelism`, or leave it unset and it will default to 1.
|
||||
|
||||
For a Work Queue Job, you must leave `.spec.completions` unset, and set `.spec.parallelism` to
|
||||
a non-negative integer.
|
||||
|
||||
For more information about how to make use of the different types of job, see the [job patterns](#job-patterns) section.
|
||||
|
||||
|
||||
#### Controlling Parallelism
|
||||
|
||||
The requested parallelism (`.spec.parallelism`) can be set to any non-negative value.
|
||||
If it is unspecified, it defaults to 1.
|
||||
If it is specified as 0, then the Job is effectively paused until it is increased.
|
||||
|
||||
A job can be scaled up using the `kubectl scale` command. For example, the following
|
||||
command sets `.spec.parallelism` of a job called `myjob` to 10:
|
||||
|
||||
```shell
|
||||
$ kubectl scale --replicas=$N jobs/myjob
|
||||
job "myjob" scaled
|
||||
```
|
||||
|
||||
You can also use the `scale` subresource of the Job resource.
|
||||
|
||||
Actual parallelism (number of pods running at any instant) may be more or less than requested
|
||||
parallelism, for a variety or reasons:
|
||||
|
||||
- For Fixed Completion Count jobs, the actual number of pods running in parallel will not exceed the number of
|
||||
remaining completions. Higher values of `.spec.parallelism` are effectively ignored.
|
||||
- For work queue jobs, no new pods are started after any pod has succeeded -- remaining pods are allowed to complete, however.
|
||||
- If the controller has not had time to react.
|
||||
- If the controller failed to create pods for any reason (lack of ResourceQuota, lack of permission, etc.),
|
||||
then there may be fewer pods than requested.
|
||||
- The controller may throttle new pod creation due to excessive previous pod failures in the same Job.
|
||||
- When a pod is gracefully shutdown, it takes time to stop.
|
||||
|
||||
## Handling Pod and Container Failures
|
||||
|
||||
A Container in a Pod may fail for a number of reasons, such as because the process in it exited with
|
||||
a non-zero exit code, or the Container was killed for exceeding a memory limit, etc. If this
|
||||
happens, and the `.spec.template.spec.restartPolicy = "OnFailure"`, then the Pod stays
|
||||
on the node, but the Container is re-run. Therefore, your program needs to handle the case when it is
|
||||
restarted locally, or else specify `.spec.template.spec.restartPolicy = "Never"`.
|
||||
See [pods-states](/docs/user-guide/pod-states) for more information on `restartPolicy`.
|
||||
|
||||
An entire Pod can also fail, for a number of reasons, such as when the pod is kicked off the node
|
||||
(node is upgraded, rebooted, deleted, etc.), or if a container of the Pod fails and the
|
||||
`.spec.template.spec.restartPolicy = "Never"`. When a Pod fails, then the Job controller
|
||||
starts a new Pod. Therefore, your program needs to handle the case when it is restarted in a new
|
||||
pod. In particular, it needs to handle temporary files, locks, incomplete output and the like
|
||||
caused by previous runs.
|
||||
|
||||
Note that even if you specify `.spec.parallelism = 1` and `.spec.completions = 1` and
|
||||
`.spec.template.spec.restartPolicy = "Never"`, the same program may
|
||||
sometimes be started twice.
|
||||
|
||||
If you do specify `.spec.parallelism` and `.spec.completions` both greater than 1, then there may be
|
||||
multiple pods running at once. Therefore, your pods must also be tolerant of concurrency.
|
||||
|
||||
## Job Termination and Cleanup
|
||||
|
||||
When a Job completes, no more Pods are created, but the Pods are not deleted either. Since they are terminated,
|
||||
they don't show up with `kubectl get pods`, but they will show up with `kubectl get pods -a`. Keeping them around
|
||||
allows you to still view the logs of completed pods to check for errors, warnings, or other diagnostic output.
|
||||
The job object also remains after it is completed so that you can view its status. It is up to the user to delete
|
||||
old jobs after noting their status. Delete the job with `kubectl` (e.g. `kubectl delete jobs/pi` or `kubectl delete -f ./job.yaml`). When you delete the job using `kubectl`, all the pods it created are deleted too.
|
||||
|
||||
If a Job's pods are failing repeatedly, the Job will keep creating new pods forever, by default.
|
||||
Retrying forever can be a useful pattern. If an external dependency of the Job's
|
||||
pods is missing (for example an input file on a networked storage volume is not present), then the
|
||||
Job will keep trying Pods, and when you later resolve the external dependency (for example, creating
|
||||
the missing file) the Job will then complete without any further action.
|
||||
|
||||
However, if you prefer not to retry forever, you can set a deadline on the job. Do this by setting the
|
||||
`spec.activeDeadlineSeconds` field of the job to a number of seconds. The job will have status with
|
||||
`reason: DeadlineExceeded`. No more pods will be created, and existing pods will be deleted.
|
||||
|
||||
```yaml
|
||||
apiVersion: batch/v1
|
||||
kind: Job
|
||||
metadata:
|
||||
name: pi-with-timeout
|
||||
spec:
|
||||
activeDeadlineSeconds: 100
|
||||
template:
|
||||
metadata:
|
||||
name: pi
|
||||
spec:
|
||||
containers:
|
||||
- name: pi
|
||||
image: perl
|
||||
command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"]
|
||||
restartPolicy: Never
|
||||
```
|
||||
|
||||
Note that both the Job Spec and the Pod Template Spec within the Job have a field with the same name.
|
||||
Set the one on the Job.
|
||||
|
||||
## Job Patterns
|
||||
|
||||
The Job object can be used to support reliable parallel execution of Pods. The Job object is not
|
||||
designed to support closely-communicating parallel processes, as commonly found in scientific
|
||||
computing. It does support parallel processing of a set of independent but related *work items*.
|
||||
These might be emails to be sent, frames to be rendered, files to be transcoded, ranges of keys in a
|
||||
NoSQL database to scan, and so on.
|
||||
|
||||
In a complex system, there may be multiple different sets of work items. Here we are just
|
||||
considering one set of work items that the user wants to manage together — a *batch job*.
|
||||
|
||||
There are several different patterns for parallel computation, each with strengths and weaknesses.
|
||||
The tradeoffs are:
|
||||
|
||||
- One Job object for each work item, vs. a single Job object for all work items. The latter is
|
||||
better for large numbers of work items. The former creates some overhead for the user and for the
|
||||
system to manage large numbers of Job objects. Also, with the latter, the resource usage of the job
|
||||
(number of concurrently running pods) can be easily adjusted using the `kubectl scale` command.
|
||||
- Number of pods created equals number of work items, vs. each pod can process multiple work items.
|
||||
The former typically requires less modification to existing code and containers. The latter
|
||||
is better for large numbers of work items, for similar reasons to the previous bullet.
|
||||
- Several approaches use a work queue. This requires running a queue service,
|
||||
and modifications to the existing program or container to make it use the work queue.
|
||||
Other approaches are easier to adapt to an existing containerised application.
|
||||
|
||||
|
||||
The tradeoffs are summarized here, with columns 2 to 4 corresponding to the above tradeoffs.
|
||||
The pattern names are also links to examples and more detailed description.
|
||||
|
||||
| Pattern | Single Job object | Fewer pods than work items? | Use app unmodified? | Works in Kube 1.1? |
|
||||
| -------------------------------------------------------------------- |:-----------------:|:---------------------------:|:-------------------:|:-------------------:|
|
||||
| [Job Template Expansion](/docs/user-guide/jobs/expansions) | | | ✓ | ✓ |
|
||||
| [Queue with Pod Per Work Item](/docs/tasks/job/work-queue-1/) | ✓ | | sometimes | ✓ |
|
||||
| [Queue with Variable Pod Count](/docs/tasks/job/fine-parallel-processing-work-queue/) | ✓ | ✓ | | ✓ |
|
||||
| Single Job with Static Work Assignment | ✓ | | ✓ | |
|
||||
|
||||
When you specify completions with `.spec.completions`, each Pod created by the Job controller
|
||||
has an identical [`spec`](https://github.com/kubernetes/kubernetes/tree/{{page.githubbranch}}/docs/devel/api-conventions.md#spec-and-status). This means that
|
||||
all pods will have the same command line and the same
|
||||
image, the same volumes, and (almost) the same environment variables. These patterns
|
||||
are different ways to arrange for pods to work on different things.
|
||||
|
||||
This table shows the required settings for `.spec.parallelism` and `.spec.completions` for each of the patterns.
|
||||
Here, `W` is the number of work items.
|
||||
|
||||
| Pattern | `.spec.completions` | `.spec.parallelism` |
|
||||
| -------------------------------------------------------------------- |:-------------------:|:--------------------:|
|
||||
| [Job Template Expansion](/docs/tasks/job/parallel-processing-expansion/) | 1 | should be 1 |
|
||||
| [Queue with Pod Per Work Item](/docs/tasks/job/work-queue-1/) | W | any |
|
||||
| [Queue with Variable Pod Count](/docs/tasks/job/fine-parallel-processing-work-queue/) | 1 | any |
|
||||
| Single Job with Static Work Assignment | W | any |
|
||||
|
||||
|
||||
## Advanced Usage
|
||||
|
||||
### Specifying your own pod selector
|
||||
|
||||
Normally, when you create a job object, you do not specify `spec.selector`.
|
||||
The system defaulting logic adds this field when the job is created.
|
||||
It picks a selector value that will not overlap with any other jobs.
|
||||
|
||||
However, in some cases, you might need to override this automatically set selector.
|
||||
To do this, you can specify the `spec.selector` of the job.
|
||||
|
||||
Be very careful when doing this. If you specify a label selector which is not
|
||||
unique to the pods of that job, and which matches unrelated pods, then pods of the unrelated
|
||||
job may be deleted, or this job may count other pods as completing it, or one or both
|
||||
of the jobs may refuse to create pods or run to completion. If a non-unique selector is
|
||||
chosen, then other controllers (e.g. ReplicationController) and their pods may behave
|
||||
in unpredicatable ways too. Kubernetes will not stop you from making a mistake when
|
||||
specifying `spec.selector`.
|
||||
|
||||
Here is an example of a case when you might want to use this feature.
|
||||
|
||||
Say job `old` is already running. You want existing pods
|
||||
to keep running, but you want the rest of the pods it creates
|
||||
to use a different pod template and for the job to have a new name.
|
||||
You cannot update the job because these fields are not updatable.
|
||||
Therefore, you delete job `old` but leave its pods
|
||||
running, using `kubectl delete jobs/old-one --cascade=false`.
|
||||
Before deleting it, you make a note of what selector it uses:
|
||||
|
||||
```
|
||||
kind: Job
|
||||
metadata:
|
||||
name: old
|
||||
...
|
||||
spec:
|
||||
selector:
|
||||
matchLabels:
|
||||
job-uid: a8f3d00d-c6d2-11e5-9f87-42010af00002
|
||||
...
|
||||
```
|
||||
|
||||
Then you create a new job with name `new` and you explicitly specify the same selector.
|
||||
Since the existing pods have label `job-uid=a8f3d00d-c6d2-11e5-9f87-42010af00002`,
|
||||
they are controlled by job `new` as well.
|
||||
|
||||
You need to specify `manualSelector: true` in the new job since you are not using
|
||||
the selector that the system normally generates for you automatically.
|
||||
|
||||
```
|
||||
kind: Job
|
||||
metadata:
|
||||
name: new
|
||||
...
|
||||
spec:
|
||||
manualSelector: true
|
||||
selector:
|
||||
matchLabels:
|
||||
job-uid: a8f3d00d-c6d2-11e5-9f87-42010af00002
|
||||
...
|
||||
```
|
||||
|
||||
The new Job itself will have a different uid from `a8f3d00d-c6d2-11e5-9f87-42010af00002`. Setting
|
||||
`manualSelector: true` tells the system to that you know what you are doing and to allow this
|
||||
mismatch.
|
||||
|
||||
## Alternatives
|
||||
|
||||
### Bare Pods
|
||||
|
||||
When the node that a pod is running on reboots or fails, the pod is terminated
|
||||
and will not be restarted. However, a Job will create new pods to replace terminated ones.
|
||||
For this reason, we recommend that you use a job rather than a bare pod, even if your application
|
||||
requires only a single pod.
|
||||
|
||||
### Replication Controller
|
||||
|
||||
Jobs are complementary to [Replication Controllers](/docs/user-guide/replication-controller).
|
||||
A Replication Controller manages pods which are not expected to terminate (e.g. web servers), and a Job
|
||||
manages pods that are expected to terminate (e.g. batch jobs).
|
||||
|
||||
As discussed in [life of a pod](/docs/user-guide/pod-states), `Job` is *only* appropriate for pods with
|
||||
`RestartPolicy` equal to `OnFailure` or `Never`. (Note: If `RestartPolicy` is not set, the default
|
||||
value is `Always`.)
|
||||
|
||||
### Single Job starts Controller Pod
|
||||
|
||||
Another pattern is for a single Job to create a pod which then creates other pods, acting as a sort
|
||||
of custom controller for those pods. This allows the most flexibility, but may be somewhat
|
||||
complicated to get started with and offers less integration with Kubernetes.
|
||||
|
||||
One example of this pattern would be a Job which starts a Pod which runs a script that in turn
|
||||
starts a Spark master controller (see [spark example](https://github.com/kubernetes/kubernetes/tree/{{page.githubbranch}}/examples/spark/README.md)), runs a spark
|
||||
driver, and then cleans up.
|
||||
|
||||
An advantage of this approach is that the overall process gets the completion guarantee of a Job
|
||||
object, but complete control over what pods are created and how work is assigned to them.
|
||||
|
||||
## Cron Jobs
|
||||
|
||||
Support for creating Jobs at specified times/dates (i.e. cron) is available in Kubernetes [1.4](https://github.com/kubernetes/kubernetes/pull/11980). More information is available in the [cron job documents](http://kubernetes.io/docs/user-guide/cron-jobs/)
|
|
@ -0,0 +1,136 @@
|
|||
---
|
||||
assignees:
|
||||
- lavalamp
|
||||
title: Kubernetes Components
|
||||
---
|
||||
|
||||
This document outlines the various binary components that need to run to
|
||||
deliver a functioning Kubernetes cluster.
|
||||
|
||||
## Master Components
|
||||
|
||||
Master components are those that provide the cluster's control plane. For
|
||||
example, master components are responsible for making global decisions about the
|
||||
cluster (e.g., scheduling), and detecting and responding to cluster events
|
||||
(e.g., starting up a new pod when a replication controller's 'replicas' field is
|
||||
unsatisfied).
|
||||
|
||||
In theory, Master components can be run on any node in the cluster. However,
|
||||
for simplicity, current set up scripts typically start all master components on
|
||||
the same VM, and does not run user containers on this VM. See
|
||||
[high-availability.md](/docs/admin/high-availability) for an example multi-master-VM setup.
|
||||
|
||||
Even in the future, when Kubernetes is fully self-hosting, it will probably be
|
||||
wise to only allow master components to schedule on a subset of nodes, to limit
|
||||
co-running with user-run pods, reducing the possible scope of a
|
||||
node-compromising security exploit.
|
||||
|
||||
### kube-apiserver
|
||||
|
||||
[kube-apiserver](/docs/admin/kube-apiserver) exposes the Kubernetes API; it is the front-end for the
|
||||
Kubernetes control plane. It is designed to scale horizontally (i.e., one scales
|
||||
it by running more of them-- [high-availability.md](/docs/admin/high-availability)).
|
||||
|
||||
### etcd
|
||||
|
||||
[etcd](/docs/admin/etcd) is used as Kubernetes' backing store. All cluster data is stored here.
|
||||
Proper administration of a Kubernetes cluster includes a backup plan for etcd's
|
||||
data.
|
||||
|
||||
### kube-controller-manager
|
||||
|
||||
[kube-controller-manager](/docs/admin/kube-controller-manager) is a binary that runs controllers, which are the
|
||||
background threads that handle routine tasks in the cluster. Logically, each
|
||||
controller is a separate process, but to reduce the number of moving pieces in
|
||||
the system, they are all compiled into a single binary and run in a single
|
||||
process.
|
||||
|
||||
These controllers include:
|
||||
|
||||
* Node Controller: Responsible for noticing & responding when nodes go down.
|
||||
* Replication Controller: Responsible for maintaining the correct number of pods for every replication
|
||||
controller object in the system.
|
||||
* Endpoints Controller: Populates the Endpoints object (i.e., join Services & Pods).
|
||||
* Service Account & Token Controllers: Create default accounts and API access tokens for new namespaces.
|
||||
* ... and others.
|
||||
|
||||
### kube-scheduler
|
||||
|
||||
[kube-scheduler](/docs/admin/kube-scheduler) watches newly created pods that have no node assigned, and
|
||||
selects a node for them to run on.
|
||||
|
||||
### addons
|
||||
|
||||
Addons are pods and services that implement cluster features. The pods may be managed
|
||||
by Deployments, ReplicationContollers, etc. Namespaced addon objects are created in
|
||||
the "kube-system" namespace.
|
||||
|
||||
Addon manager takes the responsibility for creating and maintaining addon resources.
|
||||
See [here](http://releases.k8s.io/HEAD/cluster/addons) for more details.
|
||||
|
||||
#### DNS
|
||||
|
||||
While the other addons are not strictly required, all Kubernetes
|
||||
clusters should have [cluster DNS](/docs/admin/dns/), as many examples rely on it.
|
||||
|
||||
Cluster DNS is a DNS server, in addition to the other DNS server(s) in your
|
||||
environment, which serves DNS records for Kubernetes services.
|
||||
|
||||
Containers started by Kubernetes automatically include this DNS server
|
||||
in their DNS searches.
|
||||
|
||||
#### User interface
|
||||
|
||||
The kube-ui provides a read-only overview of the cluster state. Access
|
||||
[the UI using kubectl proxy](/docs/user-guide/connecting-to-applications-proxy/#connecting-to-the-kube-ui-service-from-your-local-workstation)
|
||||
|
||||
#### Container Resource Monitoring
|
||||
|
||||
[Container Resource Monitoring](/docs/user-guide/monitoring) records generic time-series metrics
|
||||
about containers in a central database, and provides a UI for browsing that data.
|
||||
|
||||
#### Cluster-level Logging
|
||||
|
||||
A [Cluster-level logging](/docs/user-guide/logging/overview) mechanism is responsible for
|
||||
saving container logs to a central log store with search/browsing interface.
|
||||
|
||||
## Node components
|
||||
|
||||
Node components run on every node, maintaining running pods and providing them
|
||||
the Kubernetes runtime environment.
|
||||
|
||||
### kubelet
|
||||
|
||||
[kubelet](/docs/admin/kubelet) is the primary node agent. It:
|
||||
|
||||
* Watches for pods that have been assigned to its node (either by apiserver
|
||||
or via local configuration file) and:
|
||||
* Mounts the pod's required volumes
|
||||
* Downloads the pod's secrets
|
||||
* Runs the pod's containers via docker (or, experimentally, rkt).
|
||||
* Periodically executes any requested container liveness probes.
|
||||
* Reports the status of the pod back to the rest of the system, by creating a
|
||||
"mirror pod" if necessary.
|
||||
* Reports the status of the node back to the rest of the system.
|
||||
|
||||
### kube-proxy
|
||||
|
||||
[kube-proxy](/docs/admin/kube-proxy) enables the Kubernetes service abstraction by maintaining
|
||||
network rules on the host and performing connection forwarding.
|
||||
|
||||
### docker
|
||||
|
||||
`docker` is of course used for actually running containers.
|
||||
|
||||
### rkt
|
||||
|
||||
`rkt` is supported experimentally as an alternative to docker.
|
||||
|
||||
### supervisord
|
||||
|
||||
`supervisord` is a lightweight process babysitting system for keeping kubelet and docker
|
||||
running.
|
||||
|
||||
### fluentd
|
||||
|
||||
`fluentd` is a daemon which helps provide [cluster-level logging](#cluster-level-logging).
|
|
@ -0,0 +1,109 @@
|
|||
---
|
||||
assignees:
|
||||
- bgrant0607
|
||||
- erictune
|
||||
- lavalamp
|
||||
title: The Kubernetes API
|
||||
---
|
||||
|
||||
Primary system and API concepts are documented in the [User guide](/docs/user-guide/).
|
||||
|
||||
Overall API conventions are described in the [API conventions doc](https://github.com/kubernetes/kubernetes/tree/{{page.githubbranch}}/docs/devel/api-conventions.md).
|
||||
|
||||
Remote access to the API is discussed in the [access doc](/docs/admin/accessing-the-api).
|
||||
|
||||
The Kubernetes API also serves as the foundation for the declarative configuration schema for the system. The [Kubectl](/docs/user-guide/kubectl) command-line tool can be used to create, update, delete, and get API objects.
|
||||
|
||||
Kubernetes also stores its serialized state (currently in [etcd](https://coreos.com/docs/distributed-configuration/getting-started-with-etcd/)) in terms of the API resources.
|
||||
|
||||
Kubernetes itself is decomposed into multiple components, which interact through its API.
|
||||
|
||||
## API changes
|
||||
|
||||
In our experience, any system that is successful needs to grow and change as new use cases emerge or existing ones change. Therefore, we expect the Kubernetes API to continuously change and grow. However, we intend to not break compatibility with existing clients, for an extended period of time. In general, new API resources and new resource fields can be expected to be added frequently. Elimination of resources or fields will require following a deprecation process. The precise deprecation policy for eliminating features is TBD, but once we reach our 1.0 milestone, there will be a specific policy.
|
||||
|
||||
What constitutes a compatible change and how to change the API are detailed by the [API change document](https://github.com/kubernetes/kubernetes/tree/{{page.githubbranch}}/docs/devel/api_changes.md).
|
||||
|
||||
## OpenAPI and Swagger definitions
|
||||
|
||||
Complete API details are documented using [Swagger v1.2](http://swagger.io/) and [OpenAPI](https://www.openapis.org/). The Kubernetes apiserver (aka "master") exposes an API that can be used to retrieve the Swagger v1.2 Kubernetes API spec located at `/swaggerapi`. You can also enable a UI to browse the API documentation at `/swagger-ui` by passing the `--enable-swagger-ui=true` flag to apiserver.
|
||||
|
||||
We also host a version of the [latest v1.2 API documentation UI](http://kubernetes.io/kubernetes/third_party/swagger-ui/). This is updated with the latest release, so if you are using a different version of Kubernetes you will want to use the spec from your apiserver.
|
||||
|
||||
Starting with kubernetes 1.4, OpenAPI spec is also available at `/swagger.json`. While we are transitioning from Swagger v1.2 to OpenAPI (aka Swagger v2.0), some of the tools such as kubectl and swagger-ui are still using v1.2 spec. OpenAPI spec is in Beta as of Kubernetes 1.5.
|
||||
|
||||
Kubernetes implements an alternative Protobuf based serialization format for the API that is primarily intended for intra-cluster communication, documented in the [design proposal](https://github.com/kubernetes/kubernetes/blob/{{ page.githubbranch }}/docs/proposals/protobuf.md) and the IDL files for each schema are located in the Go packages that define the API objects.
|
||||
|
||||
## API versioning
|
||||
|
||||
To make it easier to eliminate fields or restructure resource representations, Kubernetes supports
|
||||
multiple API versions, each at a different API path, such as `/api/v1` or
|
||||
`/apis/extensions/v1beta1`.
|
||||
|
||||
We chose to version at the API level rather than at the resource or field level to ensure that the API presents a clear, consistent view of system resources and behavior, and to enable controlling access to end-of-lifed and/or experimental APIs. The JSON and Protobuf serialization schemas follow the same guidelines for schema changes - all descriptions below cover both formats.
|
||||
|
||||
Note that API versioning and Software versioning are only indirectly related. The [API and release
|
||||
versioning proposal](https://github.com/kubernetes/kubernetes/blob/{{page.githubbranch}}/docs/design/versioning.md) describes the relationship between API versioning and
|
||||
software versioning.
|
||||
|
||||
|
||||
Different API versions imply different levels of stability and support. The criteria for each level are described
|
||||
in more detail in the [API Changes documentation](https://github.com/kubernetes/kubernetes/tree/{{page.githubbranch}}/docs/devel/api_changes.md#alpha-beta-and-stable-versions). They are summarized here:
|
||||
|
||||
- Alpha level:
|
||||
- The version names contain `alpha` (e.g. `v1alpha1`).
|
||||
- May be buggy. Enabling the feature may expose bugs. Disabled by default.
|
||||
- Support for feature may be dropped at any time without notice.
|
||||
- The API may change in incompatible ways in a later software release without notice.
|
||||
- Recommended for use only in short-lived testing clusters, due to increased risk of bugs and lack of long-term support.
|
||||
- Beta level:
|
||||
- The version names contain `beta` (e.g. `v2beta3`).
|
||||
- Code is well tested. Enabling the feature is considered safe. Enabled by default.
|
||||
- Support for the overall feature will not be dropped, though details may change.
|
||||
- The schema and/or semantics of objects may change in incompatible ways in a subsequent beta or stable release. When this happens,
|
||||
we will provide instructions for migrating to the next version. This may require deleting, editing, and re-creating
|
||||
API objects. The editing process may require some thought. This may require downtime for applications that rely on the feature.
|
||||
- Recommended for only non-business-critical uses because of potential for incompatible changes in subsequent releases. If you have
|
||||
multiple clusters which can be upgraded independently, you may be able to relax this restriction.
|
||||
- **Please do try our beta features and give feedback on them! Once they exit beta, it may not be practical for us to make more changes.**
|
||||
- Stable level:
|
||||
- The version name is `vX` where `X` is an integer.
|
||||
- Stable versions of features will appear in released software for many subsequent versions.
|
||||
|
||||
## API groups
|
||||
|
||||
To make it easier to extend the Kubernetes API, we implemented [*API groups*](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/api-group.md).
|
||||
The API group is specified in a REST path and in the `apiVersion` field of a serialized object.
|
||||
|
||||
Currently there are several API groups in use:
|
||||
|
||||
1. the "core" (oftentimes called "legacy", due to not having explicit group name) group, which is at
|
||||
REST path `/api/v1` and is not specified as part of the `apiVersion` field, e.g. `apiVersion: v1`.
|
||||
1. the named groups are at REST path `/apis/$GROUP_NAME/$VERSION`, and use `apiVersion: $GROUP_NAME/$VERSION`
|
||||
(e.g. `apiVersion: batch/v1`). Full list of supported API groups can be seen in [Kubernetes API reference](/docs/reference/).
|
||||
|
||||
|
||||
There are two supported paths to extending the API.
|
||||
1. [Third Party Resources](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/extending-api.md)
|
||||
are for users with very basic CRUD needs.
|
||||
1. Coming soon: users needing the full set of Kubernetes API semantics can implement their own apiserver
|
||||
and use the [aggregator](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/aggregated-api-servers.md)
|
||||
to make it seamless for clients.
|
||||
|
||||
|
||||
## Enabling API groups
|
||||
|
||||
Certain resources and API groups are enabled by default. They can be enabled or disabled by setting `--runtime-config`
|
||||
on apiserver. `--runtime-config` accepts comma separated values. For ex: to disable batch/v1, set
|
||||
`--runtime-config=batch/v1=false`, to enable batch/v2alpha1, set `--runtime-config=batch/v2alpha1`.
|
||||
The flag accepts comma separated set of key=value pairs describing runtime configuration of the apiserver.
|
||||
|
||||
IMPORTANT: Enabling or disabling groups or resources requires restarting apiserver and controller-manager
|
||||
to pick up the `--runtime-config` changes.
|
||||
|
||||
## Enabling resources in the groups
|
||||
|
||||
DaemonSets, Deployments, HorizontalPodAutoscalers, Ingress, Jobs and ReplicaSets are enabled by default.
|
||||
Other extensions resources can be enabled by setting `--runtime-config` on
|
||||
apiserver. `--runtime-config` accepts comma separated values. For ex: to disable deployments and jobs, set
|
||||
`--runtime-config=extensions/v1beta1/deployments=false,extensions/v1beta1/jobs=false`
|
|
@ -0,0 +1,117 @@
|
|||
---
|
||||
assignees:
|
||||
- bgrant0607
|
||||
- mikedanese
|
||||
title: What is Kubernetes?
|
||||
---
|
||||
|
||||
Kubernetes is an [open-source platform for automating deployment, scaling, and operations of application containers](http://www.slideshare.net/BrianGrant11/wso2con-us-2015-kubernetes-a-platform-for-automating-deployment-scaling-and-operations) across clusters of hosts, providing container-centric infrastructure.
|
||||
|
||||
With Kubernetes, you are able to quickly and efficiently respond to customer demand:
|
||||
|
||||
- Deploy your applications quickly and predictably.
|
||||
- Scale your applications on the fly.
|
||||
- Seamlessly roll out new features.
|
||||
- Optimize use of your hardware by using only the resources you need.
|
||||
|
||||
Our goal is to foster an ecosystem of components and tools that relieve the burden of running applications in public and private clouds.
|
||||
|
||||
#### Kubernetes is:
|
||||
|
||||
* **portable**: public, private, hybrid, multi-cloud
|
||||
* **extensible**: modular, pluggable, hookable, composable
|
||||
* **self-healing**: auto-placement, auto-restart, auto-replication, auto-scaling
|
||||
|
||||
The Kubernetes project was started by Google in 2014. Kubernetes builds upon a [decade and a half of experience that Google has with running production workloads at scale](https://research.google.com/pubs/pub43438.html), combined with best-of-breed ideas and practices from the community.
|
||||
|
||||
##### Ready to [Get Started](/docs/getting-started-guides/)?
|
||||
|
||||
## Why containers?
|
||||
|
||||
Looking for reasons why you should be using [containers](http://aucouranton.com/2014/06/13/linux-containers-parallels-lxc-openvz-docker-and-more/)?
|
||||
|
||||

|
||||
|
||||
The *Old Way* to deploy applications was to install the applications on a host using the operating system package manager. This had the disadvantage of entangling the applications' executables, configuration, libraries, and lifecycles with each other and with the host OS. One could build immutable virtual-machine images in order to achieve predictable rollouts and rollbacks, but VMs are heavyweight and non-portable.
|
||||
|
||||
The *New Way* is to deploy containers based on operating-system-level virtualization rather than hardware virtualization. These containers are isolated from each other and from the host: they have their own filesystems, they can't see each others' processes, and their computational resource usage can be bounded. They are easier to build than VMs, and because they are decoupled from the underlying infrastructure and from the host filesystem, they are portable across clouds and OS distributions.
|
||||
|
||||
Because containers are small and fast, one application can be packed in each container image. This one-to-one application-to-image relationship unlocks the full benefits of containers. With containers, immutable container images can be created at build/release time rather than deployment time, since each application doesn't need to be composed with the rest of the application stack, nor married to the production infrastructure environment. Generating container images at build/release time enables a consistent environment to be carried from development into production.
|
||||
Similarly, containers are vastly more transparent than VMs, which facilitates monitoring and management. This is especially true when the containers' process lifecycles are managed by the infrastructure rather than hidden by a process supervisor inside the container. Finally, with a single application per container, managing the containers becomes tantamount to managing deployment of the application.
|
||||
|
||||
Summary of container benefits:
|
||||
|
||||
* **Agile application creation and deployment**:
|
||||
Increased ease and efficiency of container image creation compared to VM image use.
|
||||
* **Continuous development, integration, and deployment**:
|
||||
Provides for reliable and frequent container image build and deployment with quick and easy rollbacks (due to image immutability).
|
||||
* **Dev and Ops separation of concerns**:
|
||||
Create application container images at build/release time rather than deployment time, thereby decoupling applications from infrastructure.
|
||||
* **Environmental consistency across development, testing, and production**:
|
||||
Runs the same on a laptop as it does in the cloud.
|
||||
* **Cloud and OS distribution portability**:
|
||||
Runs on Ubuntu, RHEL, CoreOS, on-prem, Google Container Engine, and anywhere else.
|
||||
* **Application-centric management**:
|
||||
Raises the level of abstraction from running an OS on virtual hardware to run an application on an OS using logical resources.
|
||||
* **Loosely coupled, distributed, elastic, liberated [micro-services](http://martinfowler.com/articles/microservices.html)**:
|
||||
Applications are broken into smaller, independent pieces and can be deployed and managed dynamically -- not a fat monolithic stack running on one big single-purpose machine.
|
||||
* **Resource isolation**:
|
||||
Predictable application performance.
|
||||
* **Resource utilization**:
|
||||
High efficiency and density.
|
||||
|
||||
#### Why do I need Kubernetes and what can it do?
|
||||
|
||||
At a minimum, Kubernetes can schedule and run application containers on clusters of physical or virtual machines. However, Kubernetes also allows developers to 'cut the cord' to physical and virtual machines, moving from a **host-centric** infrastructure to a **container-centric** infrastructure, which provides the full advantages and benefits inherent to containers. Kubernetes provides the infrastructure to build a truly **container-centric** development environment.
|
||||
|
||||
Kubernetes satisfies a number of common needs of applications running in production, such as:
|
||||
|
||||
* [co-locating helper processes](/docs/user-guide/pods/), facilitating composite applications and preserving the one-application-per-container model,
|
||||
* [mounting storage systems](/docs/user-guide/volumes/),
|
||||
* [distributing secrets](/docs/user-guide/secrets/),
|
||||
* [application health checking](/docs/user-guide/production-pods/#liveness-and-readiness-probes-aka-health-checks),
|
||||
* [replicating application instances](/docs/user-guide/replication-controller/),
|
||||
* [horizontal auto-scaling](/docs/user-guide/horizontal-pod-autoscaling/),
|
||||
* [naming and discovery](/docs/user-guide/connecting-applications/),
|
||||
* [load balancing](/docs/user-guide/services/),
|
||||
* [rolling updates](/docs/tasks/run-application/rolling-update-replication-controller/),
|
||||
* [resource monitoring](/docs/user-guide/monitoring/),
|
||||
* [log access and ingestion](/docs/user-guide/logging/overview/),
|
||||
* [support for introspection and debugging](/docs/user-guide/introspection-and-debugging/), and
|
||||
* [identity and authorization](/docs/admin/authorization/).
|
||||
|
||||
This provides the simplicity of Platform as a Service (PaaS) with the flexibility of Infrastructure as a Service (IaaS), and facilitates portability across infrastructure providers.
|
||||
|
||||
For more details, see the [user guide](/docs/user-guide/).
|
||||
|
||||
#### Why and how is Kubernetes a platform?
|
||||
|
||||
Even though Kubernetes provides a lot of functionality, there are always new scenarios that would benefit from new features. Application-specific workflows can be streamlined to accelerate developer velocity. Ad hoc orchestration that is acceptable initially often requires robust automation at scale. This is why Kubernetes was also designed to serve as a platform for building an ecosystem of components and tools to make it easier to deploy, scale, and manage applications.
|
||||
|
||||
[Labels](/docs/user-guide/labels/) empower users to organize their resources however they please. [Annotations](/docs/user-guide/annotations/) enable users to decorate resources with custom information to facilitate their workflows and provide an easy way for management tools to checkpoint state.
|
||||
|
||||
Additionally, the [Kubernetes control plane](/docs/admin/cluster-components) is built upon the same [APIs](/docs/api/) that are available to developers and users. Users can write their own controllers, [schedulers](https://github.com/kubernetes/kubernetes/tree/{{page.githubbranch}}/docs/devel/scheduler.md), etc., if they choose, with [their own APIs](https://github.com/kubernetes/kubernetes/blob/{{page.githubbranch}}/docs/design/extending-api.md) that can be targeted by a general-purpose [command-line tool](/docs/user-guide/kubectl-overview/).
|
||||
|
||||
This [design](https://github.com/kubernetes/kubernetes/blob/{{page.githubbranch}}/docs/design/principles.md) has enabled a number of other systems to build atop Kubernetes.
|
||||
|
||||
#### Kubernetes is not:
|
||||
|
||||
Kubernetes is not a traditional, all-inclusive PaaS (Platform as a Service) system. We preserve user choice where it is important.
|
||||
|
||||
* Kubernetes does not limit the types of applications supported. It does not dictate application frameworks (e.g., [Wildfly](http://wildfly.org/)), restrict the set of supported language runtimes (e.g., Java, Python, Ruby), cater to only [12-factor applications](http://12factor.net/), nor distinguish "apps" from "services". Kubernetes aims to support an extremely diverse variety of workloads, including stateless, stateful, and data-processing workloads. If an application can run in a container, it should run great on Kubernetes.
|
||||
* Kubernetes does not provide middleware (e.g., message buses), data-processing frameworks (e.g., Spark), databases (e.g., mysql), nor cluster storage systems (e.g., Ceph) as built-in services. Such applications run on Kubernetes.
|
||||
* Kubernetes does not have a click-to-deploy service marketplace.
|
||||
* Kubernetes is unopinionated in the source-to-image space. It does not deploy source code and does not build your application. Continuous Integration (CI) workflow is an area where different users and projects have their own requirements and preferences, so we support layering CI workflows on Kubernetes but don't dictate how it should work.
|
||||
* Kubernetes allows users to choose the logging, monitoring, and alerting systems of their choice. (Though we do provide some integrations as proof of concept.)
|
||||
* Kubernetes does not provide nor mandate a comprehensive application configuration language/system (e.g., [jsonnet](https://github.com/google/jsonnet)).
|
||||
* Kubernetes does not provide nor adopt any comprehensive machine configuration, maintenance, management, or self-healing systems.
|
||||
|
||||
On the other hand, a number of PaaS systems run *on* Kubernetes, such as [Openshift](https://github.com/openshift/origin), [Deis](http://deis.io/), and [Eldarion](http://eldarion.cloud/). You could also roll your own custom PaaS, integrate with a CI system of your choice, or get along just fine with just Kubernetes: bring your container images and deploy them on Kubernetes.
|
||||
|
||||
Since Kubernetes operates at the application level rather than at just the hardware level, it provides some generally applicable features common to PaaS offerings, such as deployment, scaling, load balancing, logging, monitoring, etc. However, Kubernetes is not monolithic, and these default solutions are optional and pluggable.
|
||||
|
||||
Additionally, Kubernetes is not a mere "orchestration system"; it eliminates the need for orchestration. The technical definition of "orchestration" is execution of a defined workflow: do A, then B, then C. In contrast, Kubernetes is comprised of a set of independent, composable control processes that continuously drive current state towards the provided desired state. It shouldn't matter how you get from A to C: make it so. Centralized control is also not required; the approach is more akin to "choreography". This results in a system that is easier to use and more powerful, robust, resilient, and extensible.
|
||||
|
||||
#### What does *Kubernetes* mean? K8s?
|
||||
|
||||
The name **Kubernetes** originates from Greek, meaning "helmsman" or "pilot", and is the root of "governor" and ["cybernetic"](http://www.etymonline.com/index.php?term=cybernetics). **K8s** is an abbreviation derived by replacing the 8 letters "ubernete" with 8.
|
|
@ -1,5 +1,9 @@
|
|||
---
|
||||
title: Kubernetes Objects
|
||||
title: Understanding Kubernetes Objects
|
||||
|
||||
redirect_from:
|
||||
- "/docs/concepts/abstractions/overview/"
|
||||
- "/docs/concepts/abstractions/overview.html"
|
||||
---
|
||||
|
||||
{% capture overview %}
|
||||
|
@ -23,6 +27,7 @@ To work with Kubernetes objects--whether to create, modify, or delete them--you'
|
|||
|
||||
Every Kubernetes object includes two nested object fields that govern the object's configuration: the object *spec* and the object *status*. The *spec*, which you must provide, describes your *desired state* for the object--the characteristics that you want the object to have. The *status* describes the *actual state* for the object, and is supplied and updated by the Kubernetes system. At any given time, the Kubernetes Control Plane actively manages an object's actual state to match the desired state you supplied.
|
||||
|
||||
|
||||
For example, a Kubernetes Deployment is an object that can represent an application running on your cluster. When you create the Deployment, you might set the Deployment spec to specify that you want three replicas of the application to be running. The Kubernetes system reads the Deployment spec and starts three instances of your desired application--updating the status to match your spec. If any of those instances should fail (a status change), the Kubernetes system responds to the difference between spec and status by making a correction--in this case, starting a replacement instance.
|
||||
|
||||
For more information on the object spec, status, and metadata, see the [Kubernetes API Conventions](https://github.com/kubernetes/community/blob/master/contributors/devel/api-conventions.md).
|
||||
|
@ -33,7 +38,7 @@ When you create an object in Kubernetes, you must provide the object spec that d
|
|||
|
||||
Here's an example `.yaml` file that shows the required fields and object spec for a Kubernetes Deployment:
|
||||
|
||||
{% include code.html language="yaml" file="nginx-deployment.yaml" ghlink="/docs/concepts/abstractions/nginx-deployment.yaml" %}
|
||||
{% include code.html language="yaml" file="nginx-deployment.yaml" ghlink="/docs/concepts/overview/working-with-objects/nginx-deployment.yaml" %}
|
||||
|
||||
One way to create a Deployment using a `.yaml` file like the one above is to use the []`kubectl create`]() command in the `kubectl` command-line interface, passing the `.yaml` file as an argument. Here's an example:
|
||||
|
|
@ -2,6 +2,9 @@
|
|||
assignees:
|
||||
- mikedanese
|
||||
title: Labels and Selectors
|
||||
redirect_from:
|
||||
- "/docs/user-guide/labels/"
|
||||
- "/docs/user-guide/labels.html"
|
||||
---
|
||||
|
||||
_Labels_ are key/value pairs that are attached to objects, such as pods.
|
||||
|
@ -154,7 +157,7 @@ this selector (respectively in `json` or `yaml` format) is equivalent to `compon
|
|||
|
||||
#### Resources that support set-based requirements
|
||||
|
||||
Newer resources, such as [`Job`](/docs/user-guide/jobs), [`Deployment`](/docs/user-guide/deployments/), [`Replica Set`](/docs/user-guide/replicasets/), and [`Daemon Set`](/docs/admin/daemons/), support _set-based_ requirements as well.
|
||||
Newer resources, such as [`Job`](/docs/concepts/jobs/run-to-completion-finite-workloads/), [`Deployment`](/docs/user-guide/deployments/), [`Replica Set`](/docs/user-guide/replicasets/), and [`Daemon Set`](/docs/admin/daemons/), support _set-based_ requirements as well.
|
||||
|
||||
```yaml
|
||||
selector:
|
|
@ -0,0 +1,16 @@
|
|||
apiVersion: extensions/v1beta1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: nginx-deployment
|
||||
spec:
|
||||
replicas: 3
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: nginx
|
||||
spec:
|
||||
containers:
|
||||
- name: nginx
|
||||
image: nginx:1.7.9
|
||||
ports:
|
||||
- containerPort: 80
|
|
@ -0,0 +1,240 @@
|
|||
---
|
||||
assignees:
|
||||
- derekwaynecarr
|
||||
title: Resource Quotas
|
||||
---
|
||||
|
||||
When several users or teams share a cluster with a fixed number of nodes,
|
||||
there is a concern that one team could use more than its fair share of resources.
|
||||
|
||||
Resource quotas are a tool for administrators to address this concern.
|
||||
|
||||
A resource quota, defined by a `ResourceQuota` object, provides constraints that limit
|
||||
aggregate resource consumption per namespace. It can limit the quantity of objects that can
|
||||
be created in a namespace by type, as well as the total amount of compute resources that may
|
||||
be consumed by resources in that project.
|
||||
|
||||
Resource quotas work like this:
|
||||
|
||||
- Different teams work in different namespaces. Currently this is voluntary, but
|
||||
support for making this mandatory via ACLs is planned.
|
||||
- The administrator creates one or more Resource Quota objects for each namespace.
|
||||
- Users create resources (pods, services, etc.) in the namespace, and the quota system
|
||||
tracks usage to ensure it does not exceed hard resource limits defined in a Resource Quota.
|
||||
- If creating or updating a resource violates a quota constraint, the request will fail with HTTP
|
||||
status code `403 FORBIDDEN` with a message explaining the constraint that would have been violated.
|
||||
- If quota is enabled in a namespace for compute resources like `cpu` and `memory`, users must specify
|
||||
requests or limits for those values; otherwise, the quota system may reject pod creation. Hint: Use
|
||||
the LimitRange admission controller to force defaults for pods that make no compute resource requirements.
|
||||
See the [walkthrough](/docs/admin/resourcequota/walkthrough/) for an example to avoid this problem.
|
||||
|
||||
Examples of policies that could be created using namespaces and quotas are:
|
||||
|
||||
- In a cluster with a capacity of 32 GiB RAM, and 16 cores, let team A use 20 Gib and 10 cores,
|
||||
let B use 10GiB and 4 cores, and hold 2GiB and 2 cores in reserve for future allocation.
|
||||
- Limit the "testing" namespace to using 1 core and 1GiB RAM. Let the "production" namespace
|
||||
use any amount.
|
||||
|
||||
In the case where the total capacity of the cluster is less than the sum of the quotas of the namespaces,
|
||||
there may be contention for resources. This is handled on a first-come-first-served basis.
|
||||
|
||||
Neither contention nor changes to quota will affect already created resources.
|
||||
|
||||
## Enabling Resource Quota
|
||||
|
||||
Resource Quota support is enabled by default for many Kubernetes distributions. It is
|
||||
enabled when the apiserver `--admission-control=` flag has `ResourceQuota` as
|
||||
one of its arguments.
|
||||
|
||||
Resource Quota is enforced in a particular namespace when there is a
|
||||
`ResourceQuota` object in that namespace. There should be at most one
|
||||
`ResourceQuota` object in a namespace.
|
||||
|
||||
## Compute Resource Quota
|
||||
|
||||
You can limit the total sum of [compute resources](/docs/user-guide/compute-resources) that can be requested in a given namespace.
|
||||
|
||||
The following resource types are supported:
|
||||
|
||||
| Resource Name | Description |
|
||||
| --------------------- | ----------------------------------------------------------- |
|
||||
| `cpu` | Across all pods in a non-terminal state, the sum of CPU requests cannot exceed this value. |
|
||||
| `limits.cpu` | Across all pods in a non-terminal state, the sum of CPU limits cannot exceed this value. |
|
||||
| `limits.memory` | Across all pods in a non-terminal state, the sum of memory limits cannot exceed this value. |
|
||||
| `memory` | Across all pods in a non-terminal state, the sum of memory requests cannot exceed this value. |
|
||||
| `requests.cpu` | Across all pods in a non-terminal state, the sum of CPU requests cannot exceed this value. |
|
||||
| `requests.memory` | Across all pods in a non-terminal state, the sum of memory requests cannot exceed this value. |
|
||||
|
||||
## Storage Resource Quota
|
||||
|
||||
You can limit the total sum of [storage resources](/docs/user-guide/persistent-volumes) that can be requested in a given namespace.
|
||||
|
||||
In addition, you can limit consumption of storage resources based on associated storage-class.
|
||||
|
||||
| Resource Name | Description |
|
||||
| --------------------- | ----------------------------------------------------------- |
|
||||
| `requests.storage` | Across all persistent volume claims, the sum of storage requests cannot exceed this value. |
|
||||
| `persistentvolumeclaims` | The total number of [persistent volume claims](/docs/user-guide/persistent-volumes/#persistentvolumeclaims) that can exist in the namespace. |
|
||||
| `<storage-class-name>.storageclass.storage.k8s.io/requests.storage` | Across all persistent volume claims associated with the storage-class-name, the sum of storage requests cannot exceed this value. |
|
||||
| `<storage-class-name>.storageclass.storage.k8s.io/persistentvolumeclaims` | Across all persistent volume claims associated with the storage-class-name, the total number of [persistent volume claims](/docs/user-guide/persistent-volumes/#persistentvolumeclaims) that can exist in the namespace. |
|
||||
|
||||
For example, if an operator wants to quota storage with `gold` storage class separate from `bronze` storage class, the operator can
|
||||
define a quota as follows:
|
||||
|
||||
* `gold.storageclass.storage.k8s.io/requests.storage: 500Gi`
|
||||
* `bronze.storageclass.storage.k8s.io/requests.storage: 100Gi`
|
||||
|
||||
## Object Count Quota
|
||||
|
||||
The number of objects of a given type can be restricted. The following types
|
||||
are supported:
|
||||
|
||||
| Resource Name | Description |
|
||||
| ------------------------------- | ------------------------------------------------- |
|
||||
| `configmaps` | The total number of config maps that can exist in the namespace. |
|
||||
| `persistentvolumeclaims` | The total number of [persistent volume claims](/docs/user-guide/persistent-volumes/#persistentvolumeclaims) that can exist in the namespace. |
|
||||
| `pods` | The total number of pods in a non-terminal state that can exist in the namespace. A pod is in a terminal state if `status.phase in (Failed, Succeeded)` is true. |
|
||||
| `replicationcontrollers` | The total number of replication controllers that can exist in the namespace. |
|
||||
| `resourcequotas` | The total number of [resource quotas](/docs/admin/admission-controllers/#resourcequota) that can exist in the namespace. |
|
||||
| `services` | The total number of services that can exist in the namespace. |
|
||||
| `services.loadbalancers` | The total number of services of type load balancer that can exist in the namespace. |
|
||||
| `services.nodeports` | The total number of services of type node port that can exist in the namespace. |
|
||||
| `secrets` | The total number of secrets that can exist in the namespace. |
|
||||
|
||||
For example, `pods` quota counts and enforces a maximum on the number of `pods`
|
||||
created in a single namespace.
|
||||
|
||||
You might want to set a pods quota on a namespace
|
||||
to avoid the case where a user creates many small pods and exhausts the cluster's
|
||||
supply of Pod IPs.
|
||||
|
||||
## Quota Scopes
|
||||
|
||||
Each quota can have an associated set of scopes. A quota will only measure usage for a resource if it matches
|
||||
the intersection of enumerated scopes.
|
||||
|
||||
When a scope is added to the quota, it limits the number of resources it supports to those that pertain to the scope.
|
||||
Resources specified on the quota outside of the allowed set results in a validation error.
|
||||
|
||||
| Scope | Description |
|
||||
| ----- | ----------- |
|
||||
| `Terminating` | Match pods where `spec.activeDeadlineSeconds >= 0` |
|
||||
| `NotTerminating` | Match pods where `spec.activeDeadlineSeconds is nil` |
|
||||
| `BestEffort` | Match pods that have best effort quality of service. |
|
||||
| `NotBestEffort` | Match pods that do not have best effort quality of service. |
|
||||
|
||||
The `BestEffort` scope restricts a quota to tracking the following resource: `pods`
|
||||
|
||||
The `Terminating`, `NotTerminating`, and `NotBestEffort` scopes restrict a quota to tracking the following resources:
|
||||
|
||||
* `cpu`
|
||||
* `limits.cpu`
|
||||
* `limits.memory`
|
||||
* `memory`
|
||||
* `pods`
|
||||
* `requests.cpu`
|
||||
* `requests.memory`
|
||||
|
||||
## Requests vs Limits
|
||||
|
||||
When allocating compute resources, each container may specify a request and a limit value for either CPU or memory.
|
||||
The quota can be configured to quota either value.
|
||||
|
||||
If the quota has a value specified for `requests.cpu` or `requests.memory`, then it requires that every incoming
|
||||
container makes an explicit request for those resources. If the quota has a value specified for `limits.cpu` or `limits.memory`,
|
||||
then it requires that every incoming container specifies an explicit limit for those resources.
|
||||
|
||||
## Viewing and Setting Quotas
|
||||
|
||||
Kubectl supports creating, updating, and viewing quotas:
|
||||
|
||||
```shell
|
||||
$ kubectl create namespace myspace
|
||||
|
||||
$ cat <<EOF > compute-resources.yaml
|
||||
apiVersion: v1
|
||||
kind: ResourceQuota
|
||||
metadata:
|
||||
name: compute-resources
|
||||
spec:
|
||||
hard:
|
||||
pods: "4"
|
||||
requests.cpu: "1"
|
||||
requests.memory: 1Gi
|
||||
limits.cpu: "2"
|
||||
limits.memory: 2Gi
|
||||
EOF
|
||||
$ kubectl create -f ./compute-resources.yaml --namespace=myspace
|
||||
|
||||
$ cat <<EOF > object-counts.yaml
|
||||
apiVersion: v1
|
||||
kind: ResourceQuota
|
||||
metadata:
|
||||
name: object-counts
|
||||
spec:
|
||||
hard:
|
||||
configmaps: "10"
|
||||
persistentvolumeclaims: "4"
|
||||
replicationcontrollers: "20"
|
||||
secrets: "10"
|
||||
services: "10"
|
||||
services.loadbalancers: "2"
|
||||
EOF
|
||||
$ kubectl create -f ./object-counts.yaml --namespace=myspace
|
||||
|
||||
$ kubectl get quota --namespace=myspace
|
||||
NAME AGE
|
||||
compute-resources 30s
|
||||
object-counts 32s
|
||||
|
||||
$ kubectl describe quota compute-resources --namespace=myspace
|
||||
Name: compute-resources
|
||||
Namespace: myspace
|
||||
Resource Used Hard
|
||||
-------- ---- ----
|
||||
limits.cpu 0 2
|
||||
limits.memory 0 2Gi
|
||||
pods 0 4
|
||||
requests.cpu 0 1
|
||||
requests.memory 0 1Gi
|
||||
|
||||
$ kubectl describe quota object-counts --namespace=myspace
|
||||
Name: object-counts
|
||||
Namespace: myspace
|
||||
Resource Used Hard
|
||||
-------- ---- ----
|
||||
configmaps 0 10
|
||||
persistentvolumeclaims 0 4
|
||||
replicationcontrollers 0 20
|
||||
secrets 1 10
|
||||
services 0 10
|
||||
services.loadbalancers 0 2
|
||||
```
|
||||
|
||||
## Quota and Cluster Capacity
|
||||
|
||||
Resource Quota objects are independent of the Cluster Capacity. They are
|
||||
expressed in absolute units. So, if you add nodes to your cluster, this does *not*
|
||||
automatically give each namespace the ability to consume more resources.
|
||||
|
||||
Sometimes more complex policies may be desired, such as:
|
||||
|
||||
- proportionally divide total cluster resources among several teams.
|
||||
- allow each tenant to grow resource usage as needed, but have a generous
|
||||
limit to prevent accidental resource exhaustion.
|
||||
- detect demand from one namespace, add nodes, and increase quota.
|
||||
|
||||
Such policies could be implemented using ResourceQuota as a building-block, by
|
||||
writing a 'controller' which watches the quota usage and adjusts the quota
|
||||
hard limits of each namespace according to other signals.
|
||||
|
||||
Note that resource quota divides up aggregate cluster resources, but it creates no
|
||||
restrictions around nodes: pods from several namespaces may run on the same node.
|
||||
|
||||
## Example
|
||||
|
||||
See a [detailed example for how to use resource quota](/docs/admin/resourcequota/walkthrough/).
|
||||
|
||||
## Read More
|
||||
|
||||
See [ResourceQuota design doc](https://github.com/kubernetes/kubernetes/blob/{{page.githubbranch}}/docs/design/admission_control_resource_quota.md) for more information.
|
|
@ -0,0 +1,389 @@
|
|||
---
|
||||
assignees:
|
||||
- davidopp
|
||||
- thockin
|
||||
title: DNS Pods and Services
|
||||
---
|
||||
|
||||
## Introduction
|
||||
|
||||
As of Kubernetes 1.3, DNS is a built-in service launched automatically using the addon manager [cluster add-on](http://releases.k8s.io/{{page.githubbranch}}/cluster/addons/README.md).
|
||||
|
||||
Kubernetes DNS schedules a DNS Pod and Service on the cluster, and configures
|
||||
the kubelets to tell individual containers to use the DNS Service's IP to
|
||||
resolve DNS names.
|
||||
|
||||
## What things get DNS names?
|
||||
|
||||
Every Service defined in the cluster (including the DNS server itself) is
|
||||
assigned a DNS name. By default, a client Pod's DNS search list will
|
||||
include the Pod's own namespace and the cluster's default domain. This is best
|
||||
illustrated by example:
|
||||
|
||||
Assume a Service named `foo` in the Kubernetes namespace `bar`. A Pod running
|
||||
in namespace `bar` can look up this service by simply doing a DNS query for
|
||||
`foo`. A Pod running in namespace `quux` can look up this service by doing a
|
||||
DNS query for `foo.bar`.
|
||||
|
||||
## Supported DNS schema
|
||||
|
||||
The following sections detail the supported record types and layout that is
|
||||
supported. Any other layout or names or queries that happen to work are
|
||||
considered implementation details and are subject to change without warning.
|
||||
|
||||
### Services
|
||||
|
||||
#### A records
|
||||
|
||||
"Normal" (not headless) Services are assigned a DNS A record for a name of the
|
||||
form `my-svc.my-namespace.svc.cluster.local`. This resolves to the cluster IP
|
||||
of the Service.
|
||||
|
||||
"Headless" (without a cluster IP) Services are also assigned a DNS A record for
|
||||
a name of the form `my-svc.my-namespace.svc.cluster.local`. Unlike normal
|
||||
Services, this resolves to the set of IPs of the pods selected by the Service.
|
||||
Clients are expected to consume the set or else use standard round-robin
|
||||
selection from the set.
|
||||
|
||||
### SRV records
|
||||
|
||||
SRV Records are created for named ports that are part of normal or [Headless
|
||||
Services](http://releases.k8s.io/docs/user-guide/services/#headless-services).
|
||||
For each named port, the SRV record would have the form
|
||||
`_my-port-name._my-port-protocol.my-svc.my-namespace.svc.cluster.local`.
|
||||
For a regular service, this resolves to the port number and the CNAME:
|
||||
`my-svc.my-namespace.svc.cluster.local`.
|
||||
For a headless service, this resolves to multiple answers, one for each pod
|
||||
that is backing the service, and contains the port number and a CNAME of the pod
|
||||
of the form `auto-generated-name.my-svc.my-namespace.svc.cluster.local`.
|
||||
|
||||
### Backwards compatibility
|
||||
|
||||
Previous versions of kube-dns made names of the form
|
||||
`my-svc.my-namespace.cluster.local` (the 'svc' level was added later). This
|
||||
is no longer supported.
|
||||
|
||||
### Pods
|
||||
|
||||
#### A Records
|
||||
|
||||
When enabled, pods are assigned a DNS A record in the form of `pod-ip-address.my-namespace.pod.cluster.local`.
|
||||
|
||||
For example, a pod with IP `1.2.3.4` in the namespace `default` with a DNS name of `cluster.local` would have an entry: `1-2-3-4.default.pod.cluster.local`.
|
||||
|
||||
#### A Records and hostname based on Pod's hostname and subdomain fields
|
||||
|
||||
Currently when a pod is created, its hostname is the Pod's `metadata.name` value.
|
||||
|
||||
With v1.2, users can specify a Pod annotation, `pod.beta.kubernetes.io/hostname`, to specify what the Pod's hostname should be.
|
||||
The Pod annotation, if specified, takes precedence over the Pod's name, to be the hostname of the pod.
|
||||
For example, given a Pod with annotation `pod.beta.kubernetes.io/hostname: my-pod-name`, the Pod will have its hostname set to "my-pod-name".
|
||||
|
||||
With v1.3, the PodSpec has a `hostname` field, which can be used to specify the Pod's hostname. This field value takes precedence over the
|
||||
`pod.beta.kubernetes.io/hostname` annotation value.
|
||||
|
||||
v1.2 introduces a beta feature where the user can specify a Pod annotation, `pod.beta.kubernetes.io/subdomain`, to specify the Pod's subdomain.
|
||||
The final domain will be "<hostname>.<subdomain>.<pod namespace>.svc.<cluster domain>".
|
||||
For example, a Pod with the hostname annotation set to "foo", and the subdomain annotation set to "bar", in namespace "my-namespace", will have the FQDN "foo.bar.my-namespace.svc.cluster.local"
|
||||
|
||||
With v1.3, the PodSpec has a `subdomain` field, which can be used to specify the Pod's subdomain. This field value takes precedence over the
|
||||
`pod.beta.kubernetes.io/subdomain` annotation value.
|
||||
|
||||
Example:
|
||||
|
||||
```yaml
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: default-subdomain
|
||||
spec:
|
||||
selector:
|
||||
name: busybox
|
||||
clusterIP: None
|
||||
ports:
|
||||
- name: foo # Actually, no port is needed.
|
||||
port: 1234
|
||||
targetPort: 1234
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Pod
|
||||
metadata:
|
||||
name: busybox1
|
||||
labels:
|
||||
name: busybox
|
||||
spec:
|
||||
hostname: busybox-1
|
||||
subdomain: default-subdomain
|
||||
containers:
|
||||
- image: busybox
|
||||
command:
|
||||
- sleep
|
||||
- "3600"
|
||||
name: busybox
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Pod
|
||||
metadata:
|
||||
name: busybox2
|
||||
labels:
|
||||
name: busybox
|
||||
spec:
|
||||
hostname: busybox-2
|
||||
subdomain: default-subdomain
|
||||
containers:
|
||||
- image: busybox
|
||||
command:
|
||||
- sleep
|
||||
- "3600"
|
||||
name: busybox
|
||||
```
|
||||
|
||||
If there exists a headless service in the same namespace as the pod and with the same name as the subdomain, the cluster's KubeDNS Server also returns an A record for the Pod's fully qualified hostname.
|
||||
Given a Pod with the hostname set to "busybox-1" and the subdomain set to "default-subdomain", and a headless Service named "default-subdomain" in the same namespace, the pod will see it's own FQDN as "busybox-1.default-subdomain.my-namespace.svc.cluster.local". DNS serves an A record at that name, pointing to the Pod's IP. Both pods "busybox1" and "busybox2" can have their distinct A records.
|
||||
|
||||
As of Kubernetes v1.2, the Endpoints object also has the annotation `endpoints.beta.kubernetes.io/hostnames-map`. Its value is the json representation of map[string(IP)][endpoints.HostRecord], for example: '{"10.245.1.6":{HostName: "my-webserver"}}'.
|
||||
If the Endpoints are for a headless service, an A record is created with the format <hostname>.<service name>.<pod namespace>.svc.<cluster domain>
|
||||
For the example json, if endpoints are for a headless service named "bar", and one of the endpoints has IP "10.245.1.6", an A record is created with the name "my-webserver.bar.my-namespace.svc.cluster.local" and the A record lookup would return "10.245.1.6".
|
||||
This endpoints annotation generally does not need to be specified by end-users, but can used by the internal service controller to deliver the aforementioned feature.
|
||||
|
||||
With v1.3, The Endpoints object can specify the `hostname` for any endpoint, along with its IP. The hostname field takes precedence over the hostname value
|
||||
that might have been specified via the `endpoints.beta.kubernetes.io/hostnames-map` annotation.
|
||||
|
||||
With v1.3, the following annotations are deprecated: `pod.beta.kubernetes.io/hostname`, `pod.beta.kubernetes.io/subdomain`, `endpoints.beta.kubernetes.io/hostnames-map`
|
||||
|
||||
## How do I test if it is working?
|
||||
|
||||
### Create a simple Pod to use as a test environment
|
||||
|
||||
Create a file named busybox.yaml with the
|
||||
following contents:
|
||||
|
||||
```yaml
|
||||
apiVersion: v1
|
||||
kind: Pod
|
||||
metadata:
|
||||
name: busybox
|
||||
namespace: default
|
||||
spec:
|
||||
containers:
|
||||
- image: busybox
|
||||
command:
|
||||
- sleep
|
||||
- "3600"
|
||||
imagePullPolicy: IfNotPresent
|
||||
name: busybox
|
||||
restartPolicy: Always
|
||||
```
|
||||
|
||||
Then create a pod using this file:
|
||||
|
||||
```
|
||||
kubectl create -f busybox.yaml
|
||||
```
|
||||
|
||||
### Wait for this pod to go into the running state
|
||||
|
||||
You can get its status with:
|
||||
```
|
||||
kubectl get pods busybox
|
||||
```
|
||||
|
||||
You should see:
|
||||
|
||||
```
|
||||
NAME READY STATUS RESTARTS AGE
|
||||
busybox 1/1 Running 0 <some-time>
|
||||
```
|
||||
|
||||
### Validate that DNS is working
|
||||
|
||||
Once that pod is running, you can exec nslookup in that environment:
|
||||
|
||||
```
|
||||
kubectl exec -ti busybox -- nslookup kubernetes.default
|
||||
```
|
||||
|
||||
You should see something like:
|
||||
|
||||
```
|
||||
Server: 10.0.0.10
|
||||
Address 1: 10.0.0.10
|
||||
|
||||
Name: kubernetes.default
|
||||
Address 1: 10.0.0.1
|
||||
```
|
||||
|
||||
If you see that, DNS is working correctly.
|
||||
|
||||
### Troubleshooting Tips
|
||||
|
||||
If the nslookup command fails, check the following:
|
||||
|
||||
#### Check the local DNS configuration first
|
||||
Take a look inside the resolv.conf file. (See "Inheriting DNS from the node" and "Known issues" below for more information)
|
||||
|
||||
```
|
||||
kubectl exec busybox cat /etc/resolv.conf
|
||||
```
|
||||
|
||||
Verify that the search path and name server are set up like the following (note that search path may vary for different cloud providers):
|
||||
|
||||
```
|
||||
search default.svc.cluster.local svc.cluster.local cluster.local google.internal c.gce_project_id.internal
|
||||
nameserver 10.0.0.10
|
||||
options ndots:5
|
||||
```
|
||||
|
||||
#### Quick diagnosis
|
||||
|
||||
Errors such as the following indicate a problem with the kube-dns add-on or associated Services:
|
||||
|
||||
```
|
||||
$ kubectl exec -ti busybox -- nslookup kubernetes.default
|
||||
Server: 10.0.0.10
|
||||
Address 1: 10.0.0.10
|
||||
|
||||
nslookup: can't resolve 'kubernetes.default'
|
||||
```
|
||||
|
||||
or
|
||||
|
||||
```
|
||||
$ kubectl exec -ti busybox -- nslookup kubernetes.default
|
||||
Server: 10.0.0.10
|
||||
Address 1: 10.0.0.10 kube-dns.kube-system.svc.cluster.local
|
||||
|
||||
nslookup: can't resolve 'kubernetes.default'
|
||||
```
|
||||
|
||||
#### Check if the DNS pod is running
|
||||
|
||||
Use the kubectl get pods command to verify that the DNS pod is running.
|
||||
|
||||
```
|
||||
kubectl get pods --namespace=kube-system -l k8s-app=kube-dns
|
||||
```
|
||||
|
||||
You should see something like:
|
||||
|
||||
```
|
||||
NAME READY STATUS RESTARTS AGE
|
||||
...
|
||||
kube-dns-v19-ezo1y 3/3 Running 0 1h
|
||||
...
|
||||
```
|
||||
|
||||
If you see that no pod is running or that the pod has failed/completed, the DNS add-on may not be deployed by default in your current environment and you will have to deploy it manually.
|
||||
|
||||
#### Check for Errors in the DNS pod
|
||||
|
||||
Use `kubectl logs` command to see logs for the DNS daemons.
|
||||
|
||||
```
|
||||
kubectl logs --namespace=kube-system $(kubectl get pods --namespace=kube-system -l k8s-app=kube-dns -o name) -c kubedns
|
||||
kubectl logs --namespace=kube-system $(kubectl get pods --namespace=kube-system -l k8s-app=kube-dns -o name) -c dnsmasq
|
||||
kubectl logs --namespace=kube-system $(kubectl get pods --namespace=kube-system -l k8s-app=kube-dns -o name) -c healthz
|
||||
```
|
||||
|
||||
See if there is any suspicious log. W, E, F letter at the beginning represent Warning, Error and Failure. Please search for entries that have these as the logging level and use [kubernetes issues](https://github.com/kubernetes/kubernetes/issues) to report unexpected errors.
|
||||
|
||||
#### Is DNS service up?
|
||||
|
||||
Verify that the DNS service is up by using the `kubectl get service` command.
|
||||
|
||||
```
|
||||
kubectl get svc --namespace=kube-system
|
||||
```
|
||||
|
||||
You should see:
|
||||
|
||||
```
|
||||
NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
|
||||
...
|
||||
kube-dns 10.0.0.10 <none> 53/UDP,53/TCP 1h
|
||||
...
|
||||
```
|
||||
|
||||
If you have created the service or in the case it should be created by default but it does not appear, see this [debugging services page](http://kubernetes.io/docs/user-guide/debugging-services/) for more information.
|
||||
|
||||
#### Are DNS endpoints exposed?
|
||||
|
||||
You can verify that DNS endpoints are exposed by using the `kubectl get endpoints` command.
|
||||
|
||||
```
|
||||
kubectl get ep kube-dns --namespace=kube-system
|
||||
```
|
||||
|
||||
You should see something like:
|
||||
```
|
||||
NAME ENDPOINTS AGE
|
||||
kube-dns 10.180.3.17:53,10.180.3.17:53 1h
|
||||
```
|
||||
|
||||
If you do not see the endpoints, see endpoints section in the [debugging services documentation](http://kubernetes.io/docs/user-guide/debugging-services/).
|
||||
|
||||
For additional Kubernetes DNS examples, see the [cluster-dns examples](https://github.com/kubernetes/kubernetes/tree/master/examples/cluster-dns) in the Kubernetes GitHub repository.
|
||||
|
||||
## Kubernetes Federation (Multiple Zone support)
|
||||
|
||||
Release 1.3 introduced Cluster Federation support for multi-site
|
||||
Kubernetes installations. This required some minor
|
||||
(backward-compatible) changes to the way
|
||||
the Kubernetes cluster DNS server processes DNS queries, to facilitate
|
||||
the lookup of federated services (which span multiple Kubernetes clusters).
|
||||
See the [Cluster Federation Administrators' Guide](/docs/admin/federation) for more
|
||||
details on Cluster Federation and multi-site support.
|
||||
|
||||
## How it Works
|
||||
|
||||
The running Kubernetes DNS pod holds 3 containers - kubedns, dnsmasq and a health check called healthz.
|
||||
The kubedns process watches the Kubernetes master for changes in Services and Endpoints, and maintains
|
||||
in-memory lookup structures to service DNS requests. The dnsmasq container adds DNS caching to improve
|
||||
performance. The healthz container provides a single health check endpoint while performing dual healthchecks
|
||||
(for dnsmasq and kubedns).
|
||||
|
||||
The DNS pod is exposed as a Kubernetes Service with a static IP. Once assigned the
|
||||
kubelet passes DNS configured using the `--cluster-dns=10.0.0.10` flag to each
|
||||
container.
|
||||
|
||||
DNS names also need domains. The local domain is configurable, in the kubelet using
|
||||
the flag `--cluster-domain=<default local domain>`
|
||||
|
||||
The Kubernetes cluster DNS server (based off the [SkyDNS](https://github.com/skynetservices/skydns) library)
|
||||
supports forward lookups (A records), service lookups (SRV records) and reverse IP address lookups (PTR records).
|
||||
|
||||
## Inheriting DNS from the node
|
||||
When running a pod, kubelet will prepend the cluster DNS server and search
|
||||
paths to the node's own DNS settings. If the node is able to resolve DNS names
|
||||
specific to the larger environment, pods should be able to, also. See "Known
|
||||
issues" below for a caveat.
|
||||
|
||||
If you don't want this, or if you want a different DNS config for pods, you can
|
||||
use the kubelet's `--resolv-conf` flag. Setting it to "" means that pods will
|
||||
not inherit DNS. Setting it to a valid file path means that kubelet will use
|
||||
this file instead of `/etc/resolv.conf` for DNS inheritance.
|
||||
|
||||
## Known issues
|
||||
Kubernetes installs do not configure the nodes' resolv.conf files to use the
|
||||
cluster DNS by default, because that process is inherently distro-specific.
|
||||
This should probably be implemented eventually.
|
||||
|
||||
Linux's libc is impossibly stuck ([see this bug from
|
||||
2005](https://bugzilla.redhat.com/show_bug.cgi?id=168253)) with limits of just
|
||||
3 DNS `nameserver` records and 6 DNS `search` records. Kubernetes needs to
|
||||
consume 1 `nameserver` record and 3 `search` records. This means that if a
|
||||
local installation already uses 3 `nameserver`s or uses more than 3 `search`es,
|
||||
some of those settings will be lost. As a partial workaround, the node can run
|
||||
`dnsmasq` which will provide more `nameserver` entries, but not more `search`
|
||||
entries. You can also use kubelet's `--resolv-conf` flag.
|
||||
|
||||
If you are using Alpine version 3.3 or earlier as your base image, DNS may not
|
||||
work properly owing to a known issue with Alpine. Check [here](https://github.com/kubernetes/kubernetes/issues/30215)
|
||||
for more information.
|
||||
|
||||
## References
|
||||
|
||||
- [Docs for the DNS cluster addon](http://releases.k8s.io/{{page.githubbranch}}/cluster/addons/dns/README.md)
|
||||
|
||||
## What's next
|
||||
- [Autoscaling the DNS Service in a Cluster](/docs/tasks/administer-cluster/dns-horizontal-autoscaling/).
|
|
@ -2,6 +2,9 @@
|
|||
assignees:
|
||||
- erictune
|
||||
title: Init Containers
|
||||
redirect_from:
|
||||
- "/docs/concepts/abstractions/init-containers/"
|
||||
- "/docs/concepts/abstractions/init-containers.html"
|
||||
---
|
||||
|
||||
{% capture overview %}
|
|
@ -1,5 +1,8 @@
|
|||
---
|
||||
title: Pods
|
||||
title: Pod Overview
|
||||
redirect_from:
|
||||
- "/docs/concepts/abstractions/pod/"
|
||||
- "/docs/concepts/abstractions/pod.html"
|
||||
---
|
||||
|
||||
{% capture overview %}
|
|
@ -0,0 +1,76 @@
|
|||
---
|
||||
title: Limiting Storage Consumption
|
||||
---
|
||||
|
||||
This example demonstrates an easy way to limit the amount of storage consumed in a namespace.
|
||||
|
||||
The following resources are used in the demonstration:
|
||||
|
||||
* [Resource Quota](/docs/admin/resourcequota/)
|
||||
* [Limit Range](/docs/admin/limitrange/)
|
||||
* [Persistent Volume Claim](/docs/user-guide/persistent-volumes/)
|
||||
|
||||
This example assumes you have a functional Kubernetes setup.
|
||||
|
||||
## Limiting Storage Consumption
|
||||
|
||||
The cluster-admin is operating a cluster on behalf of a user population and the admin wants to control
|
||||
how much storage a single namespace can consume in order to control cost.
|
||||
|
||||
The admin would like to limit:
|
||||
|
||||
1. The number of persistent volume claims in a namespace
|
||||
2. The amount of storage each claim can request
|
||||
3. The amount of cumulative storage the namespace can have
|
||||
|
||||
|
||||
## LimitRange to limit requests for storage
|
||||
|
||||
Adding a `LimitRange` to a namespace enforces storage request sizes to a minimum and maximum. Storage is requested
|
||||
via `PersistentVolumeClaim`. The admission controller that enforces limit ranges will reject any PVC that is above or below
|
||||
the values set by the admin.
|
||||
|
||||
In this example, a PVC requesting 10Gi of storage would be rejected because it exceeds the 2Gi max.
|
||||
|
||||
```
|
||||
apiVersion: v1
|
||||
kind: LimitRange
|
||||
metadata:
|
||||
name: storagelimits
|
||||
spec:
|
||||
limits:
|
||||
- type: PersistentVolumeClaim
|
||||
max:
|
||||
storage: 2Gi
|
||||
min:
|
||||
storage: 1Gi
|
||||
```
|
||||
|
||||
Minimum storage requests are used when the underlying storage provider requires certain minimums. For example,
|
||||
AWS EBS volumes have a 1Gi minimum requirement.
|
||||
|
||||
## StorageQuota to limit PVC count and cumulative storage capacity
|
||||
|
||||
Admins can limit the number of PVCs in a namespace as well as the cumulative capacity of those PVCs. New PVCs that exceed
|
||||
either maximum value will be rejected.
|
||||
|
||||
In this example, a 6th PVC in the namespace would be rejected because it exceeds the maximum count of 5. Alternatively,
|
||||
a 5Gi maximum quota when combined with the 2Gi max limit above, cannot have 3 PVCs where each has 2Gi. That would be 6Gi requested
|
||||
for a namespace capped at 5Gi.
|
||||
|
||||
```
|
||||
apiVersion: v1
|
||||
kind: ResourceQuota
|
||||
metadata:
|
||||
name: storagequota
|
||||
spec:
|
||||
hard:
|
||||
persistentvolumeclaims: "5"
|
||||
requests.storage: "5Gi"
|
||||
```
|
||||
|
||||
## Summary
|
||||
|
||||
A limit range can put a ceiling on how much storage is requested while a resource quota can effectively cap the storage
|
||||
consumed by a namespace through claim counts and cumulative storage capacity. The allows a cluster-admin to plan their
|
||||
cluster's storage budget without risk of any one project going over their allotment.
|
|
@ -0,0 +1,87 @@
|
|||
---
|
||||
title: Federated ConfigMap
|
||||
---
|
||||
|
||||
This guide explains how to use ConfigMaps in a Federation control plane.
|
||||
|
||||
* TOC
|
||||
{:toc}
|
||||
|
||||
## Prerequisites
|
||||
|
||||
This guide assumes that you have a running Kubernetes Cluster
|
||||
Federation installation. If not, then head over to the
|
||||
[federation admin guide](/docs/admin/federation/) to learn how to
|
||||
bring up a cluster federation (or have your cluster administrator do
|
||||
this for you).
|
||||
Other tutorials, such as Kelsey Hightower's
|
||||
[Federated Kubernetes Tutorial](https://github.com/kelseyhightower/kubernetes-cluster-federation),
|
||||
might also help you create a Federated Kubernetes cluster.
|
||||
|
||||
You should also have a basic
|
||||
[working knowledge of Kubernetes](/docs/getting-started-guides/) in
|
||||
general and [ConfigMaps](/docs/user-guide/configmap/) in particular.
|
||||
|
||||
## Overview
|
||||
|
||||
Federated ConfigMaps are very similar to the traditional [Kubernetes
|
||||
ConfigMaps](/docs/user-guide/configmap/) and provide the same functionality.
|
||||
Creating them in the federation control plane ensures that they are synchronized
|
||||
across all the clusters in federation.
|
||||
|
||||
|
||||
## Creating a Federated ConfigMap
|
||||
|
||||
The API for Federated ConfigMap is 100% compatible with the
|
||||
API for traditional Kubernetes ConfigMap. You can create a ConfigMap by sending
|
||||
a request to the federation apiserver.
|
||||
|
||||
You can do that using [kubectl](/docs/user-guide/kubectl/) by running:
|
||||
|
||||
``` shell
|
||||
kubectl --context=federation-cluster create -f myconfigmap.yaml
|
||||
```
|
||||
|
||||
The `--context=federation-cluster` flag tells kubectl to submit the
|
||||
request to the Federation apiserver instead of sending it to a Kubernetes
|
||||
cluster.
|
||||
|
||||
Once a Federated ConfigMap is created, the federation control plane will create
|
||||
a matching ConfigMap in all underlying Kubernetes clusters.
|
||||
You can verify this by checking each of the underlying clusters, for example:
|
||||
|
||||
``` shell
|
||||
kubectl --context=gce-asia-east1a get configmap myconfigmap
|
||||
```
|
||||
|
||||
The above assumes that you have a context named 'gce-asia-east1a'
|
||||
configured in your client for your cluster in that zone.
|
||||
|
||||
These ConfigMaps in underlying clusters will match the Federated ConfigMap.
|
||||
|
||||
|
||||
## Updating a Federated ConfigMap
|
||||
|
||||
You can update a Federated ConfigMap as you would update a Kubernetes
|
||||
ConfigMap; however, for a Federated ConfigMap, you must send the request to
|
||||
the federation apiserver instead of sending it to a specific Kubernetes cluster.
|
||||
The federation control plane ensures that whenever the Federated ConfigMap is
|
||||
updated, it updates the corresponding ConfigMaps in all underlying clusters to
|
||||
match it.
|
||||
|
||||
## Deleting a Federated ConfigMap
|
||||
|
||||
You can delete a Federated ConfigMap as you would delete a Kubernetes
|
||||
ConfigMap; however, for a Federated ConfigMap, you must send the request to
|
||||
the federation apiserver instead of sending it to a specific Kubernetes cluster.
|
||||
|
||||
For example, you can do that using kubectl by running:
|
||||
|
||||
```shell
|
||||
kubectl --context=federation-cluster delete configmap
|
||||
```
|
||||
|
||||
Note that at this point, deleting a Federated ConfigMap will not delete the
|
||||
corresponding ConfigMaps from underlying clusters.
|
||||
You must delete the underlying ConfigMaps manually.
|
||||
We intend to fix this in the future.
|
|
@ -0,0 +1,83 @@
|
|||
---
|
||||
title: Federated DaemonSet
|
||||
---
|
||||
|
||||
This guide explains how to use DaemonSets in a federation control plane.
|
||||
|
||||
* TOC
|
||||
{:toc}
|
||||
|
||||
## Prerequisites
|
||||
|
||||
This guide assumes that you have a running Kubernetes Cluster
|
||||
Federation installation. If not, then head over to the
|
||||
[federation admin guide](/docs/admin/federation/) to learn how to
|
||||
bring up a cluster federation (or have your cluster administrator do
|
||||
this for you).
|
||||
Other tutorials, such as Kelsey Hightower's
|
||||
[Federated Kubernetes Tutorial](https://github.com/kelseyhightower/kubernetes-cluster-federation),
|
||||
might also help you create a Federated Kubernetes cluster.
|
||||
|
||||
You should also have a basic
|
||||
[working knowledge of Kubernetes](/docs/getting-started-guides/) in
|
||||
general and DaemonSets in particular.
|
||||
|
||||
## Overview
|
||||
|
||||
DaemonSets in federation control plane ("Federated Daemonsets" in
|
||||
this guide) are very similar to the traditional Kubernetes
|
||||
DaemonSets and provide the same functionality.
|
||||
Creating them in the federation control plane ensures that they are synchronized
|
||||
across all the clusters in federation.
|
||||
|
||||
|
||||
## Creating a Federated Daemonset
|
||||
|
||||
The API for Federated Daemonset is 100% compatible with the
|
||||
API for traditional Kubernetes DaemonSet. You can create a DaemonSet by sending
|
||||
a request to the federation apiserver.
|
||||
|
||||
You can do that using [kubectl](/docs/user-guide/kubectl/) by running:
|
||||
|
||||
``` shell
|
||||
kubectl --context=federation-cluster create -f mydaemonset.yaml
|
||||
```
|
||||
|
||||
The `--context=federation-cluster` flag tells kubectl to submit the
|
||||
request to the Federation apiserver instead of sending it to a Kubernetes
|
||||
cluster.
|
||||
|
||||
Once a Federated Daemonset is created, the federation control plane will create
|
||||
a matching DaemonSet in all underlying Kubernetes clusters.
|
||||
You can verify this by checking each of the underlying clusters, for example:
|
||||
|
||||
``` shell
|
||||
kubectl --context=gce-asia-east1a get daemonset mydaemonset
|
||||
```
|
||||
|
||||
The above assumes that you have a context named 'gce-asia-east1a'
|
||||
configured in your client for your cluster in that zone.
|
||||
|
||||
These DaemonSets in underlying clusters will match the Federated Daemonset.
|
||||
|
||||
|
||||
## Updating a Federated Daemonset
|
||||
|
||||
You can update a Federated Daemonset as you would update a Kubernetes
|
||||
DaemonSet; however, for a Federated Daemonset, you must send the request to
|
||||
the federation apiserver instead of sending it to a specific Kubernetes cluster.
|
||||
The federation control plane ensures that whenever the Federated Daemonset is
|
||||
updated, it updates the corresponding DaemonSets in all underlying clusters to
|
||||
match it.
|
||||
|
||||
## Deleting a Federated Daemonset
|
||||
|
||||
You can delete a Federated Daemonset as you would delete a Kubernetes
|
||||
DaemonSet; however, for a Federated Daemonset, you must send the request to
|
||||
the federation apiserver instead of sending it to a specific Kubernetes cluster.
|
||||
|
||||
For example, you can do that using kubectl by running:
|
||||
|
||||
```shell
|
||||
kubectl --context=federation-cluster delete daemonset mydaemonset
|
||||
```
|
|
@ -0,0 +1,108 @@
|
|||
---
|
||||
title: Federated Deployment
|
||||
---
|
||||
|
||||
This guide explains how to use Deployments in the Federation control plane.
|
||||
|
||||
* TOC
|
||||
{:toc}
|
||||
|
||||
## Prerequisites
|
||||
|
||||
This guide assumes that you have a running Kubernetes Cluster
|
||||
Federation installation. If not, then head over to the
|
||||
[federation admin guide](/docs/admin/federation/) to learn how to
|
||||
bring up a cluster federation (or have your cluster administrator do
|
||||
this for you).
|
||||
Other tutorials, such as Kelsey Hightower's
|
||||
[Federated Kubernetes Tutorial](https://github.com/kelseyhightower/kubernetes-cluster-federation),
|
||||
might also help you create a Federated Kubernetes cluster.
|
||||
|
||||
You should also have a basic
|
||||
[working knowledge of Kubernetes](/docs/getting-started-guides/) in
|
||||
general and [Deployment](/docs/user-guide/deployments) in particular.
|
||||
|
||||
## Overview
|
||||
|
||||
Deployments in federation control plane (referred to as "Federated Deployments" in
|
||||
this guide) are very similar to the traditional [Kubernetes
|
||||
Deployment](/docs/user-guide/deployments/), and provide the same functionality.
|
||||
Creating them in the federation control plane ensures that the desired number of
|
||||
replicas exist across the registered clusters.
|
||||
|
||||
**As of Kubernetes version 1.5, Federated Deployment is an Alpha feature. The core
|
||||
functionality of Deployment is present, but some features
|
||||
(such as full rollout compatibility) are still in development.**
|
||||
|
||||
## Creating a Federated Deployment
|
||||
|
||||
The API for Federated Deployment is compatible with the
|
||||
API for traditional Kubernetes Deployment. You can create a Deployment by sending
|
||||
a request to the federation apiserver.
|
||||
|
||||
You can do that using [kubectl](/docs/user-guide/kubectl/) by running:
|
||||
|
||||
``` shell
|
||||
kubectl --context=federation-cluster create -f mydeployment.yaml
|
||||
```
|
||||
|
||||
The '--context=federation-cluster' flag tells kubectl to submit the
|
||||
request to the Federation apiserver instead of sending it to a Kubernetes
|
||||
cluster.
|
||||
|
||||
Once a Federated Deployment is created, the federation control plane will create
|
||||
a Deployment in all underlying Kubernetes clusters.
|
||||
You can verify this by checking each of the underlying clusters, for example:
|
||||
|
||||
``` shell
|
||||
kubectl --context=gce-asia-east1a get deployment mydep
|
||||
```
|
||||
|
||||
The above assumes that you have a context named 'gce-asia-east1a'
|
||||
configured in your client for your cluster in that zone.
|
||||
|
||||
These Deployments in underlying clusters will match the federation Deployment
|
||||
_except_ in the number of replicas and revision-related annotations.
|
||||
Federation control plane ensures that the
|
||||
sum of replicas in each cluster combined matches the desired number of replicas in the
|
||||
Federated Deployment.
|
||||
|
||||
### Spreading Replicas in Underlying Clusters
|
||||
|
||||
By default, replicas are spread equally in all the underlying clusters. For ex:
|
||||
if you have 3 registered clusters and you create a Federated Deployment with
|
||||
`spec.replicas = 9`, then each Deployment in the 3 clusters will have
|
||||
`spec.replicas=3`.
|
||||
To modify the number of replicas in each cluster, you can specify
|
||||
[FederatedReplicaSetPreference](https://github.com/kubernetes/kubernetes/blob/{{page.githubbranch}}/federation/apis/federation/types.go)
|
||||
as an annotation with key `federation.kubernetes.io/deployment-preferences`
|
||||
on Federated Deployment.
|
||||
|
||||
|
||||
## Updating a Federated Deployment
|
||||
|
||||
You can update a Federated Deployment as you would update a Kubernetes
|
||||
Deployment; however, for a Federated Deployment, you must send the request to
|
||||
the federation apiserver instead of sending it to a specific Kubernetes cluster.
|
||||
The federation control plane ensures that whenever the Federated Deployment is
|
||||
updated, it updates the corresponding Deployments in all underlying clusters to
|
||||
match it. So if the rolling update strategy was chosen then the underlying
|
||||
cluster will do the rolling update independently and `maxSurge` and `maxUnavailable`
|
||||
will apply only to individual clusters. This behavior may change in the future.
|
||||
|
||||
If your update includes a change in number of replicas, the federation
|
||||
control plane will change the number of replicas in underlying clusters to
|
||||
ensure that their sum remains equal to the number of desired replicas in
|
||||
Federated Deployment.
|
||||
|
||||
## Deleting a Federated Deployment
|
||||
|
||||
You can delete a Federated Deployment as you would delete a Kubernetes
|
||||
Deployment; however, for a Federated Deployment, you must send the request to
|
||||
the federation apiserver instead of sending it to a specific Kubernetes cluster.
|
||||
|
||||
For example, you can do that using kubectl by running:
|
||||
|
||||
```shell
|
||||
kubectl --context=federation-cluster delete deployment mydep
|
||||
```
|
|
@ -0,0 +1,40 @@
|
|||
---
|
||||
title: Federated Events
|
||||
---
|
||||
|
||||
This guide explains how to use events in federation control plane to help in debugging.
|
||||
|
||||
|
||||
* TOC
|
||||
{:toc}
|
||||
|
||||
## Prerequisites
|
||||
|
||||
This guide assumes that you have a running Kubernetes Cluster
|
||||
Federation installation. If not, then head over to the
|
||||
[federation admin guide](/docs/admin/federation/) to learn how to
|
||||
bring up a cluster federation (or have your cluster administrator do
|
||||
this for you). Other tutorials, for example
|
||||
[this one](https://github.com/kelseyhightower/kubernetes-cluster-federation)
|
||||
by Kelsey Hightower, are also available to help you.
|
||||
|
||||
You are also expected to have a basic
|
||||
[working knowledge of Kubernetes](/docs/getting-started-guides/) in
|
||||
general.
|
||||
|
||||
## Overview
|
||||
|
||||
Events in federation control plane (referred to as "federation events" in
|
||||
this guide) are very similar to the traditional Kubernetes
|
||||
Events providing the same functionality.
|
||||
Federation Events are stored only in federation control plane and are not passed on to the underlying Kubernetes clusters.
|
||||
|
||||
Federation controllers create events as they process API resources to surface to the
|
||||
user, the state that they are in.
|
||||
You can get all events from federation apiserver by running:
|
||||
|
||||
```shell
|
||||
kubectl --context=federation-cluster get events
|
||||
```
|
||||
|
||||
The standard kubectl get, update, delete commands will all work.
|
|
@ -0,0 +1,355 @@
|
|||
---
|
||||
title: Federated Ingress
|
||||
---
|
||||
|
||||
This guide explains how to use Kubernetes Federated Ingress to deploy
|
||||
a common HTTP(S) virtual IP load balancer across a federated service running in
|
||||
multiple Kubernetes clusters. As of v1.4, clusters hosted in Google
|
||||
Cloud (both GKE and GCE, or both) are supported. This makes it
|
||||
easy to deploy a service that reliably serves HTTP(S) traffic
|
||||
originating from web clients around the globe on a single, static IP
|
||||
address. Low
|
||||
network latency, high fault tolerance and easy administration are
|
||||
ensured through intelligent request routing and automatic replica
|
||||
relocation (using [Federated ReplicaSets](/docs/tasks/administer-federation/replicaset/).
|
||||
Clients are automatically routed, via the shortest network path, to
|
||||
the cluster closest to them with available capacity (despite the fact
|
||||
that all clients use exactly the same static IP address). The load balancer
|
||||
automatically checks the health of the pods comprising the service,
|
||||
and avoids sending requests to unresponsive or slow pods (or entire
|
||||
unresponsive clusters).
|
||||
|
||||
Federated Ingress is released as an alpha feature, and supports Google Cloud Platform (GKE,
|
||||
GCE and hybrid scenarios involving both) in Kubernetes v1.4. Work is under way to support other cloud
|
||||
providers such as AWS, and other hybrid cloud scenarios (e.g. services
|
||||
spanning private on-premise as well as public cloud Kubernetes
|
||||
clusters). We welcome your feedback.
|
||||
|
||||
* TOC
|
||||
{:toc}
|
||||
|
||||
## Prerequisites
|
||||
|
||||
This guide assumes that you have a running Kubernetes Cluster
|
||||
Federation installation. If not, then head over to the
|
||||
[federation admin guide](/docs/admin/federation/) to learn how to
|
||||
bring up a cluster federation (or have your cluster administrator do
|
||||
this for you). Other tutorials, for example
|
||||
[this one](https://github.com/kelseyhightower/kubernetes-cluster-federation)
|
||||
by Kelsey Hightower, are also available to help you.
|
||||
|
||||
You are also expected to have a basic
|
||||
[working knowledge of Kubernetes](/docs/getting-started-guides/) in
|
||||
general, and [Ingress](/docs/user-guide/ingress/) in particular.
|
||||
|
||||
## Overview
|
||||
|
||||
Federated Ingresses are created in much that same way as traditional
|
||||
[Kubernetes Ingresses](/docs/user-guide/ingress/): by making an API
|
||||
call which specifies the desired properties of your logical ingress point. In the
|
||||
case of Federated Ingress, this API call is directed to the
|
||||
Federation API endpoint, rather than a Kubernetes cluster API
|
||||
endpoint. The API for Federated Ingress is 100% compatible with the
|
||||
API for traditional Kubernetes Services.
|
||||
|
||||
Once created, the Federated Ingress automatically:
|
||||
|
||||
1. creates matching Kubernetes Ingress objects in every cluster
|
||||
underlying your Cluster Federation,
|
||||
2. ensures that all of these in-cluster ingress objects share the same
|
||||
logical global L7 (i.e. HTTP(S)) load balancer and IP address.
|
||||
3. monitors the health and capacity of the service "shards" (i.e. your
|
||||
pods) behind this ingress in each cluster
|
||||
4. ensures that all client connections are routed to an appropriate
|
||||
healthy backend service endpoint at all times, even in the event of
|
||||
pod, cluster,
|
||||
availability zone or regional outages.
|
||||
|
||||
Note that in the case of Google Cloud, the logical L7 load balancer is
|
||||
not a single physical device (which would present both a single point
|
||||
of failure, and a single global network routing choke point), but
|
||||
rather a
|
||||
[truly global, highly available load balancing managed service](https://cloud.google.com/load-balancing/),
|
||||
globally reachable via a single, static IP address.
|
||||
|
||||
Clients inside your federated Kubernetes clusters (i.e. Pods) will be
|
||||
automatically routed to the cluster-local shard of the Federated Service
|
||||
backing the Ingress in their
|
||||
cluster if it exists and is healthy, or the closest healthy shard in a
|
||||
different cluster if it does not. Note that this involves a network
|
||||
trip to the HTTP(s) load balancer, which resides outside your local
|
||||
Kubernetes cluster but inside the same GCP region.
|
||||
|
||||
## Creating a federated ingress
|
||||
|
||||
You can create a federated ingress in any of the usual ways, for example using kubectl:
|
||||
|
||||
``` shell
|
||||
kubectl --context=federation-cluster create -f myingress.yaml
|
||||
```
|
||||
For example ingress YAML configurations, see the [Ingress User Guide](/docs/user-guide/ingress/)
|
||||
The '--context=federation-cluster' flag tells kubectl to submit the
|
||||
request to the Federation API endpoint, with the appropriate
|
||||
credentials. If you have not yet configured such a context, visit the
|
||||
[federation admin guide](/docs/admin/federation/) or one of the
|
||||
[administration tutorials](https://github.com/kelseyhightower/kubernetes-cluster-federation)
|
||||
to find out how to do so.
|
||||
|
||||
As described above, the Federated Ingress will automatically create
|
||||
and maintain matching Kubernetes ingresses in all of the clusters
|
||||
underlying your federation. These cluster-specific ingresses (and
|
||||
their associated ingress controllers) configure and manage the load
|
||||
balancing and health checking infrastructure that ensures that traffic
|
||||
is load balanced to each cluster appropriately.
|
||||
|
||||
You can verify this by checking in each of the underlying clusters, for example:
|
||||
|
||||
``` shell
|
||||
kubectl --context=gce-asia-east1a get ingress myingress
|
||||
NAME HOSTS ADDRESS PORTS AGE
|
||||
myingress * 130.211.5.194 80, 443 1m
|
||||
```
|
||||
|
||||
The above assumes that you have a context named 'gce-asia-east1a'
|
||||
configured in your client for your cluster in that zone. The name and
|
||||
namespace of the underlying ingress will automatically match those of
|
||||
the Federated Ingress that you created above (and if you happen to
|
||||
have had ingresses of the same name and namespace already existing in
|
||||
any of those clusters, they will be automatically adopted by the
|
||||
Federation and updated to conform with the specification of your
|
||||
Federated Ingress - either way, the end result will be the same).
|
||||
|
||||
The status of your Federated Ingress will automatically reflect the
|
||||
real-time status of the underlying Kubernetes ingresses, for example:
|
||||
|
||||
``` shell
|
||||
$kubectl --context=federation-cluster describe ingress myingress
|
||||
|
||||
Name: myingress
|
||||
Namespace: default
|
||||
Address: 130.211.5.194
|
||||
TLS:
|
||||
tls-secret terminates
|
||||
Rules:
|
||||
Host Path Backends
|
||||
---- ---- --------
|
||||
* * echoheaders-https:80 (10.152.1.3:8080,10.152.2.4:8080)
|
||||
Annotations:
|
||||
https-target-proxy: k8s-tps-default-myingress--ff1107f83ed600c0
|
||||
target-proxy: k8s-tp-default-myingress--ff1107f83ed600c0
|
||||
url-map: k8s-um-default-myingress--ff1107f83ed600c0
|
||||
backends: {"k8s-be-30301--ff1107f83ed600c0":"Unknown"}
|
||||
forwarding-rule: k8s-fw-default-myingress--ff1107f83ed600c0
|
||||
https-forwarding-rule: k8s-fws-default-myingress--ff1107f83ed600c0
|
||||
Events:
|
||||
FirstSeen LastSeen Count From SubobjectPath Type Reason Message
|
||||
--------- -------- ----- ---- ------------- -------- ------ -------
|
||||
3m 3m 1 {loadbalancer-controller } Normal ADD default/myingress
|
||||
2m 2m 1 {loadbalancer-controller } Normal CREATE ip: 130.211.5.194
|
||||
```
|
||||
|
||||
Note that:
|
||||
|
||||
1. the address of your Federated Ingress
|
||||
corresponds with the address of all of the
|
||||
underlying Kubernetes ingresses (once these have been allocated - this
|
||||
may take up to a few minutes).
|
||||
2. we have not yet provisioned any backend Pods to receive
|
||||
the network traffic directed to this ingress (i.e. 'Service
|
||||
Endpoints' behind the service backing the Ingress), so the Federated Ingress does not yet consider these to
|
||||
be healthy shards and will not direct traffic to any of these clusters.
|
||||
3. the federation control system will
|
||||
automatically reconfigure the load balancer controllers in all of the
|
||||
clusters in your federation to make them consistent, and allow
|
||||
them to share global load balancers. But this reconfiguration can
|
||||
only complete successfully if there are no pre-existing Ingresses in
|
||||
those clusters (this is a safety feature to prevent accidental
|
||||
breakage of existing ingresses). So to ensure that your federated
|
||||
ingresses function correctly, either start with new, empty clusters, or make
|
||||
sure that you delete (and recreate if necessary) all pre-existing
|
||||
Ingresses in the clusters comprising your federation.
|
||||
|
||||
#Adding backend services and pods
|
||||
|
||||
To render the underlying ingress shards healthy, we need to add
|
||||
backend Pods behind the service upon which the Ingress is based. There are several ways to achieve this, but
|
||||
the easiest is to create a Federated Service and
|
||||
Federated Replicaset. Details of how those
|
||||
work are covered in the aforementioned user guides - here we'll simply use them, to
|
||||
create appropriately labelled pods and services in the 13 underlying clusters of
|
||||
our federation:
|
||||
|
||||
``` shell
|
||||
kubectl --context=federation-cluster create -f services/nginx.yaml
|
||||
```
|
||||
|
||||
``` shell
|
||||
kubectl --context=federation-cluster create -f myreplicaset.yaml
|
||||
```
|
||||
|
||||
Note that in order for your federated ingress to work correctly on
|
||||
Google Cloud, the node ports of all of the underlying cluster-local
|
||||
services need to be identical. If you're using a federated service
|
||||
this is easy to do. Simply pick a node port that is not already
|
||||
being used in any of your clusters, and add that to the spec of your
|
||||
federated service. If you do not specify a node port for your
|
||||
federated service, each cluster will choose it's own node port for
|
||||
its cluster-local shard of the service, and these will probably end
|
||||
up being different, which is not what you want.
|
||||
|
||||
You can verify this by checking in each of the underlying clusters, for example:
|
||||
|
||||
``` shell
|
||||
kubectl --context=gce-asia-east1a get services nginx
|
||||
NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
|
||||
nginx 10.63.250.98 104.199.136.89 80/TCP 9m
|
||||
```
|
||||
|
||||
|
||||
## Hybrid cloud capabilities
|
||||
|
||||
Federations of Kubernetes Clusters can include clusters running in
|
||||
different cloud providers (e.g. Google Cloud, AWS), and on-premises
|
||||
(e.g. on OpenStack). However, in Kubernetes v1.4, Federated Ingress is only
|
||||
supported across Google Cloud clusters. In future versions we intend
|
||||
to support hybrid cloud Ingress-based deployments.
|
||||
|
||||
## Discovering a federated ingress
|
||||
|
||||
Ingress objects (in both plain Kubernets clusters, and in federations
|
||||
of clusters) expose one or more IP addresses (via
|
||||
the Status.Loadbalancer.Ingress field) that remains static for the lifetime
|
||||
of the Ingress object (in future, automatically managed DNS names
|
||||
might also be added). All clients (whether internal to your cluster,
|
||||
or on the external network or internet) should connect to one of these IP
|
||||
or DNS addresses. As mentioned above, all client requests are automatically
|
||||
routed, via the shortest network path, to a healthy pod in the
|
||||
closest cluster to the origin of the request. So for example, HTTP(S)
|
||||
requests from internet
|
||||
users in Europe will be routed directly to the closest cluster in
|
||||
Europe that has available capacity. If there are no such clusters in
|
||||
Europe, the request will be routed to the next closest cluster
|
||||
(typically in the U.S.).
|
||||
|
||||
## Handling failures of backend pods and whole clusters
|
||||
|
||||
Ingresses are backed by Services, which are typically (but not always)
|
||||
backed by one or more ReplicaSets. For Federated Ingresses, it is
|
||||
common practise to use the federated variants of Services and
|
||||
ReplicaSets for this purpose, as
|
||||
described above.
|
||||
|
||||
In particular, Federated ReplicaSets ensure that the desired number of
|
||||
pods are kept running in each cluster, even in the event of node
|
||||
failures. In the event of entire cluster or availability zone
|
||||
failures, Federated ReplicaSets automatically place additional
|
||||
replacas in the other available clusters in the federation to accommodate the
|
||||
traffic which was previously being served by the now unavailable
|
||||
cluster. While the Federated ReplicaSet ensures that sufficient replicas are
|
||||
kept running, the Federated Ingress ensures that user traffic is
|
||||
automatically redirected away from the failed cluster to other
|
||||
available clusters.
|
||||
|
||||
## Known issue
|
||||
|
||||
GCE L7 load balancer back-ends and health checks are known to "flap"; this is due
|
||||
to conflicting firewall rules in the federation's underlying clusters, which might override one another. To work around this problem, you can
|
||||
install the firewall rules manually to expose the targets of all the
|
||||
underlying clusters in your federation for each Federated Ingress
|
||||
object. This way, the health checks can consistently pass and the GCE L7 load balancer
|
||||
can remain stable. You install the rules using the
|
||||
[`gcloud`](https://cloud.google.com/sdk/gcloud/) command line tool,
|
||||
[Google Cloud Console](https://console.cloud.google.com) or the
|
||||
[Google Compute Engine APIs](https://cloud.google.com/compute/docs/reference/latest/).
|
||||
|
||||
You can install these rules using
|
||||
[`gcloud`](https://cloud.google.com/sdk/gcloud/) as follows:
|
||||
|
||||
```shell
|
||||
gcloud compute firewall-rules create <firewall-rule-name> \
|
||||
--source-ranges 130.211.0.0/22 --allow [<service-nodeports>] \
|
||||
--target-tags [<target-tags>] \
|
||||
--network <network-name>
|
||||
```
|
||||
|
||||
where:
|
||||
|
||||
1. `firewall-rule-name` can be any name.
|
||||
2. `[<service-nodeports>]` is the comma separated list of node ports corresponding to the services that back the Federated Ingress.
|
||||
3. [<target-tags>] is the comma separated list of the target tags assigned to the nodes in a Kubernetes cluster.
|
||||
4. <network-name> is the name of the network where the firewall rule must be installed.
|
||||
|
||||
Example:
|
||||
```shell
|
||||
gcloud compute firewall-rules create my-federated-ingress-firewall-rule \
|
||||
--source-ranges 130.211.0.0/22 --allow tcp:30301, tcp:30061, tcp:34564 \
|
||||
--target-tags my-cluster-1-minion, my-cluster-2-minion \
|
||||
--network default
|
||||
```
|
||||
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
#### I cannot connect to my cluster federation API
|
||||
Check that your
|
||||
|
||||
1. Client (typically kubectl) is correctly configured (including API endpoints and login credentials), and
|
||||
2. Cluster Federation API server is running and network-reachable.
|
||||
|
||||
See the [federation admin guide](/docs/admin/federation/) to learn
|
||||
how to bring up a cluster federation correctly (or have your cluster administrator do this for you), and how to correctly configure your client.
|
||||
|
||||
#### I can create a federated ingress/service/replicaset successfully against the cluster federation API, but no matching ingresses/services/replicasets are created in my underlying clusters
|
||||
|
||||
Check that:
|
||||
|
||||
1. Your clusters are correctly registered in the Cluster Federation API (`kubectl describe clusters`)
|
||||
2. Your clusters are all 'Active'. This means that the cluster
|
||||
Federation system was able to connect and authenticate against the
|
||||
clusters' endpoints. If not, consult the event logs of the federation-controller-manager pod to ascertain what the failure might be. (`kubectl --namespace=federation logs $(kubectl get pods --namespace=federation -l module=federation-controller-manager -oname`)
|
||||
3. That the login credentials provided to the Cluster Federation API
|
||||
for the clusters have the correct authorization and quota to create
|
||||
ingresses/services/replicasets in the relevant namespace in the
|
||||
clusters. Again you should see associated error messages providing
|
||||
more detail in the above event log file if this is not the case.
|
||||
4. Whether any other error is preventing the service creation
|
||||
operation from succeeding (look for `ingress-controller`,
|
||||
`service-controller` or `replicaset-controller`,
|
||||
errors in the output of `kubectl logs federation-controller-manager --namespace federation`).
|
||||
|
||||
#### I can create a federated ingress successfully, but request load is not correctly distributed across the underlying clusters
|
||||
|
||||
Check that:
|
||||
|
||||
1. the services underlying your federated ingress in each cluster have
|
||||
identical node ports. See [above](#creating_a_federated_ingress) for further explanation.
|
||||
2. the load balancer controllers in each of your clusters are of the
|
||||
correct type ("GLBC") and have been correctly reconfigured by the
|
||||
federation control plane to share a global GCE load balancer (this
|
||||
should happen automatically). If they of the correct type, and
|
||||
have been correctly reconfigured, the UID data item in the GLBC
|
||||
configmap in each cluster will be identical across all clusters.
|
||||
See
|
||||
[the GLBC docs](https://github.com/kubernetes/ingress/blob/7dcb4ae17d5def23d3e9c878f3146ac6df61b09d/controllers/gce/README.md)
|
||||
for further details.
|
||||
If this is not the case, check the logs of your federation
|
||||
controller manager to determine why this automated reconfiguration
|
||||
might be failing.
|
||||
3. no ingresses have been manually created in any of your clusters before the above
|
||||
reconfiguration of the load balancer controller completed
|
||||
successfully. Ingresses created before the reconfiguration of
|
||||
your GLBC will interfere with the behavior of your federated
|
||||
ingresses created after the reconfiguration (see
|
||||
[the GLBC docs](https://github.com/kubernetes/ingress/blob/7dcb4ae17d5def23d3e9c878f3146ac6df61b09d/controllers/gce/README.md)
|
||||
for further information. To remedy this,
|
||||
delete any ingresses created before the cluster joined the
|
||||
federation (and had it's GLBC reconfigured), and recreate them if
|
||||
necessary.
|
||||
|
||||
#### This troubleshooting guide did not help me solve my problem
|
||||
|
||||
Please use one of our [support channels](http://kubernetes.io/docs/troubleshooting/) to seek assistance.
|
||||
|
||||
## For more information
|
||||
|
||||
* [Federation proposal](https://github.com/kubernetes/kubernetes/blob/{{page.githubbranch}}/docs/proposals/federation.md) details use cases that motivated this work.
|
|
@ -0,0 +1,90 @@
|
|||
---
|
||||
title: Federated Namespaces
|
||||
---
|
||||
|
||||
This guide explains how to use namespaces in Federation control plane.
|
||||
|
||||
* TOC
|
||||
{:toc}
|
||||
|
||||
## Prerequisites
|
||||
|
||||
This guide assumes that you have a running Kubernetes Cluster
|
||||
Federation installation. If not, then head over to the
|
||||
[federation admin guide](/docs/admin/federation/) to learn how to
|
||||
bring up a cluster federation (or have your cluster administrator do
|
||||
this for you). Other tutorials, for example
|
||||
[this one](https://github.com/kelseyhightower/kubernetes-cluster-federation)
|
||||
by Kelsey Hightower, are also available to help you.
|
||||
|
||||
You are also expected to have a basic
|
||||
[working knowledge of Kubernetes](/docs/getting-started-guides/) in
|
||||
general and [Namespaces](/docs/user-guide/namespaces/) in particular.
|
||||
|
||||
## Overview
|
||||
|
||||
Namespaces in federation control plane (referred to as "federated namespaces" in
|
||||
this guide) are very similar to the traditional [Kubernetes
|
||||
Namespaces](/docs/user-guide/namespaces/) providing the same functionality.
|
||||
Creating them in the federation control plane ensures that they are synchronized
|
||||
across all the clusters in federation.
|
||||
|
||||
|
||||
## Creating a Federated Namespace
|
||||
|
||||
The API for Federated Namespaces is 100% compatible with the
|
||||
API for traditional Kubernetes Namespaces. You can create a namespace by sending
|
||||
a request to the federation apiserver.
|
||||
|
||||
You can do that using kubectl by running:
|
||||
|
||||
``` shell
|
||||
kubectl --context=federation-cluster create -f myns.yaml
|
||||
```
|
||||
|
||||
The '--context=federation-cluster' flag tells kubectl to submit the
|
||||
request to the Federation apiserver instead of sending it to a Kubernetes
|
||||
cluster.
|
||||
|
||||
Once a federated namespace is created, the federation control plane will create
|
||||
a matching namespace in all underlying Kubernetes clusters.
|
||||
You can verify this by checking each of the underlying clusters, for example:
|
||||
|
||||
``` shell
|
||||
kubectl --context=gce-asia-east1a get namespaces myns
|
||||
```
|
||||
|
||||
The above assumes that you have a context named 'gce-asia-east1a'
|
||||
configured in your client for your cluster in that zone. The name and
|
||||
spec of the underlying namespace will match those of
|
||||
the Federated Namespace that you created above.
|
||||
|
||||
|
||||
## Updating a Federated Namespace
|
||||
|
||||
You can update a federated namespace as you would update a Kubernetes
|
||||
namespace, just send the request to federation apiserver instead of sending it
|
||||
to a specific Kubernetes cluster.
|
||||
Federation control plan will ensure that whenever the federated namespace is
|
||||
updated, it updates the corresponding namespaces in all underlying clusters to
|
||||
match it.
|
||||
|
||||
## Deleting a Federated Namespace
|
||||
|
||||
You can delete a federated namespace as you would delete a Kubernetes
|
||||
namespace, just send the request to federation apiserver instead of sending it
|
||||
to a specific Kubernetes cluster.
|
||||
|
||||
For example, you can do that using kubectl by running:
|
||||
|
||||
```shell
|
||||
kubectl --context=federation-cluster delete ns myns
|
||||
```
|
||||
|
||||
As in Kubernetes, deleting a federated namespace will delete all resources in that
|
||||
namespace from the federation control plane.
|
||||
|
||||
Note that at this point, deleting a federated namespace will not delete the
|
||||
corresponding namespaces and resources in those namespaces from underlying clusters.
|
||||
Users are expected to delete them manually.
|
||||
We intend to fix this in the future.
|
|
@ -0,0 +1,105 @@
|
|||
---
|
||||
title: Federated ReplicaSets
|
||||
---
|
||||
|
||||
This guide explains how to use replica sets in the Federation control plane.
|
||||
|
||||
* TOC
|
||||
{:toc}
|
||||
|
||||
## Prerequisites
|
||||
|
||||
This guide assumes that you have a running Kubernetes Cluster
|
||||
Federation installation. If not, then head over to the
|
||||
[federation admin guide](/docs/admin/federation/) to learn how to
|
||||
bring up a cluster federation (or have your cluster administrator do
|
||||
this for you). Other tutorials, for example
|
||||
[this one](https://github.com/kelseyhightower/kubernetes-cluster-federation)
|
||||
by Kelsey Hightower, are also available to help you.
|
||||
|
||||
You are also expected to have a basic
|
||||
[working knowledge of Kubernetes](/docs/getting-started-guides/) in
|
||||
general and [ReplicaSets](/docs/user-guide/replicasets/) in particular.
|
||||
|
||||
## Overview
|
||||
|
||||
Replica Sets in federation control plane (referred to as "federated replica sets" in
|
||||
this guide) are very similar to the traditional [Kubernetes
|
||||
ReplicaSets](/docs/user-guide/replicasets/), and provide the same functionality.
|
||||
Creating them in the federation control plane ensures that the desired number of
|
||||
replicas exist across the registered clusters.
|
||||
|
||||
|
||||
## Creating a Federated Replica Set
|
||||
|
||||
The API for Federated Replica Set is 100% compatible with the
|
||||
API for traditional Kubernetes Replica Set. You can create a replica set by sending
|
||||
a request to the federation apiserver.
|
||||
|
||||
You can do that using [kubectl](/docs/user-guide/kubectl/) by running:
|
||||
|
||||
``` shell
|
||||
kubectl --context=federation-cluster create -f myrs.yaml
|
||||
```
|
||||
|
||||
The '--context=federation-cluster' flag tells kubectl to submit the
|
||||
request to the Federation apiserver instead of sending it to a Kubernetes
|
||||
cluster.
|
||||
|
||||
Once a federated replica set is created, the federation control plane will create
|
||||
a replica set in all underlying Kubernetes clusters.
|
||||
You can verify this by checking each of the underlying clusters, for example:
|
||||
|
||||
``` shell
|
||||
kubectl --context=gce-asia-east1a get rs myrs
|
||||
```
|
||||
|
||||
The above assumes that you have a context named 'gce-asia-east1a'
|
||||
configured in your client for your cluster in that zone.
|
||||
|
||||
These replica sets in underlying clusters will match the federation replica set
|
||||
except in the number of replicas. Federation control plane will ensure that the
|
||||
sum of replicas in each cluster match the desired number of replicas in the
|
||||
federation replica set.
|
||||
|
||||
### Spreading Replicas in Underlying Clusters
|
||||
|
||||
By default, replicas are spread equally in all the underlying clusters. For ex:
|
||||
if you have 3 registered clusters and you create a federated replica set with
|
||||
`spec.replicas = 9`, then each replica set in the 3 clusters will have
|
||||
`spec.replicas=3`.
|
||||
To modify the number of replicas in each cluster, you can specify
|
||||
[FederatedReplicaSetPreference](https://github.com/kubernetes/kubernetes/blob/{{page.githubbranch}}/federation/apis/federation/types.go)
|
||||
as an annotation with key `federation.kubernetes.io/replica-set-preferences`
|
||||
on federated replica set.
|
||||
|
||||
|
||||
## Updating a Federated Replica Set
|
||||
|
||||
You can update a federated replica set as you would update a Kubernetes
|
||||
replica set; however, for a federated replica set, you must send the request to
|
||||
the federation apiserver instead of sending it to a specific Kubernetes cluster.
|
||||
The Federation control plan ensures that whenever the federated replica set is
|
||||
updated, it updates the corresponding replica sets in all underlying clusters to
|
||||
match it.
|
||||
If your update includes a change in number of replicas, the federation
|
||||
control plane will change the number of replicas in underlying clusters to
|
||||
ensure that their sum remains equal to the number of desired replicas in
|
||||
federated replica set.
|
||||
|
||||
## Deleting a Federated Replica Set
|
||||
|
||||
You can delete a federated replica set as you would delete a Kubernetes
|
||||
replica set; however, for a federated replica set, you must send the request to
|
||||
the federation apiserver instead of sending it to a specific Kubernetes cluster.
|
||||
|
||||
For example, you can do that using kubectl by running:
|
||||
|
||||
```shell
|
||||
kubectl --context=federation-cluster delete rs myrs
|
||||
```
|
||||
|
||||
Note that at this point, deleting a federated replica set will not delete the
|
||||
corresponding replica sets from underlying clusters.
|
||||
You must delete the underlying Replica Sets manually.
|
||||
We intend to fix this in the future.
|
|
@ -0,0 +1,87 @@
|
|||
---
|
||||
title: Federated Secrets
|
||||
---
|
||||
|
||||
This guide explains how to use secrets in Federation control plane.
|
||||
|
||||
* TOC
|
||||
{:toc}
|
||||
|
||||
## Prerequisites
|
||||
|
||||
This guide assumes that you have a running Kubernetes Cluster
|
||||
Federation installation. If not, then head over to the
|
||||
[federation admin guide](/docs/admin/federation/) to learn how to
|
||||
bring up a cluster federation (or have your cluster administrator do
|
||||
this for you). Other tutorials, for example
|
||||
[this one](https://github.com/kelseyhightower/kubernetes-cluster-federation)
|
||||
by Kelsey Hightower, are also available to help you.
|
||||
|
||||
You are also expected to have a basic
|
||||
[working knowledge of Kubernetes](/docs/getting-started-guides/) in
|
||||
general and [Secrets](/docs/user-guide/secrets/) in particular.
|
||||
|
||||
## Overview
|
||||
|
||||
Secrets in federation control plane (referred to as "federated secrets" in
|
||||
this guide) are very similar to the traditional [Kubernetes
|
||||
Secrets](/docs/user-guide/secrets/) providing the same functionality.
|
||||
Creating them in the federation control plane ensures that they are synchronized
|
||||
across all the clusters in federation.
|
||||
|
||||
|
||||
## Creating a Federated Secret
|
||||
|
||||
The API for Federated Secret is 100% compatible with the
|
||||
API for traditional Kubernetes Secret. You can create a secret by sending
|
||||
a request to the federation apiserver.
|
||||
|
||||
You can do that using [kubectl](/docs/user-guide/kubectl/) by running:
|
||||
|
||||
``` shell
|
||||
kubectl --context=federation-cluster create -f mysecret.yaml
|
||||
```
|
||||
|
||||
The '--context=federation-cluster' flag tells kubectl to submit the
|
||||
request to the Federation apiserver instead of sending it to a Kubernetes
|
||||
cluster.
|
||||
|
||||
Once a federated secret is created, the federation control plane will create
|
||||
a matching secret in all underlying Kubernetes clusters.
|
||||
You can verify this by checking each of the underlying clusters, for example:
|
||||
|
||||
``` shell
|
||||
kubectl --context=gce-asia-east1a get secret mysecret
|
||||
```
|
||||
|
||||
The above assumes that you have a context named 'gce-asia-east1a'
|
||||
configured in your client for your cluster in that zone.
|
||||
|
||||
These secrets in underlying clusters will match the federated secret.
|
||||
|
||||
|
||||
## Updating a Federated Secret
|
||||
|
||||
You can update a federated secret as you would update a Kubernetes
|
||||
secret; however, for a federated secret, you must send the request to
|
||||
the federation apiserver instead of sending it to a specific Kubernetes cluster.
|
||||
The Federation control plan ensures that whenever the federated secret is
|
||||
updated, it updates the corresponding secrets in all underlying clusters to
|
||||
match it.
|
||||
|
||||
## Deleting a Federated Secret
|
||||
|
||||
You can delete a federated secret as you would delete a Kubernetes
|
||||
secret; however, for a federated secret, you must send the request to
|
||||
the federation apiserver instead of sending it to a specific Kubernetes cluster.
|
||||
|
||||
For example, you can do that using kubectl by running:
|
||||
|
||||
```shell
|
||||
kubectl --context=federation-cluster delete secret mysecret
|
||||
```
|
||||
|
||||
Note that at this point, deleting a federated secret will not delete the
|
||||
corresponding secrets from underlying clusters.
|
||||
You must delete the underlying secrets manually.
|
||||
We intend to fix this in the future.
|
|
@ -0,0 +1,366 @@
|
|||
---
|
||||
assignees:
|
||||
- derekwaynecarr
|
||||
- janetkuo
|
||||
title: Applying Resource Quotas and Limits
|
||||
---
|
||||
|
||||
This example demonstrates a typical setup to control for resource usage in a namespace.
|
||||
|
||||
It demonstrates using the following resources:
|
||||
|
||||
* [Namespace](/docs/admin/namespaces)
|
||||
* [Resource Quota](/docs/admin/resourcequota/)
|
||||
* [Limit Range](/docs/admin/limitrange/)
|
||||
|
||||
This example assumes you have a functional Kubernetes setup.
|
||||
|
||||
## Scenario
|
||||
|
||||
The cluster-admin is operating a cluster on behalf of a user population and the cluster-admin
|
||||
wants to control the amount of resources that can be consumed in a particular namespace to promote
|
||||
fair sharing of the cluster and control cost.
|
||||
|
||||
The cluster-admin has the following goals:
|
||||
|
||||
* Limit the amount of compute resource for running pods
|
||||
* Limit the number of persistent volume claims to control access to storage
|
||||
* Limit the number of load balancers to control cost
|
||||
* Prevent the use of node ports to preserve scarce resources
|
||||
* Provide default compute resource requests to enable better scheduling decisions
|
||||
|
||||
## Step 1: Create a namespace
|
||||
|
||||
This example will work in a custom namespace to demonstrate the concepts involved.
|
||||
|
||||
Let's create a new namespace called quota-example:
|
||||
|
||||
```shell
|
||||
$ kubectl create -f docs/admin/resourcequota/namespace.yaml
|
||||
namespace "quota-example" created
|
||||
$ kubectl get namespaces
|
||||
NAME STATUS AGE
|
||||
default Active 2m
|
||||
kube-system Active 2m
|
||||
quota-example Active 39s
|
||||
```
|
||||
|
||||
## Step 2: Apply an object-count quota to the namespace
|
||||
|
||||
The cluster-admin wants to control the following resources:
|
||||
|
||||
* persistent volume claims
|
||||
* load balancers
|
||||
* node ports
|
||||
|
||||
Let's create a simple quota that controls object counts for those resource types in this namespace.
|
||||
|
||||
```shell
|
||||
$ kubectl create -f docs/admin/resourcequota/object-counts.yaml --namespace=quota-example
|
||||
resourcequota "object-counts" created
|
||||
```
|
||||
|
||||
The quota system will observe that a quota has been created, and will calculate consumption
|
||||
in the namespace in response. This should happen quickly.
|
||||
|
||||
Let's describe the quota to see what is currently being consumed in this namespace:
|
||||
|
||||
```shell
|
||||
$ kubectl describe quota object-counts --namespace=quota-example
|
||||
Name: object-counts
|
||||
Namespace: quota-example
|
||||
Resource Used Hard
|
||||
-------- ---- ----
|
||||
persistentvolumeclaims 0 2
|
||||
services.loadbalancers 0 2
|
||||
services.nodeports 0 0
|
||||
```
|
||||
|
||||
The quota system will now prevent users from creating more than the specified amount for each resource.
|
||||
|
||||
|
||||
## Step 3: Apply a compute-resource quota to the namespace
|
||||
|
||||
To limit the amount of compute resource that can be consumed in this namespace,
|
||||
let's create a quota that tracks compute resources.
|
||||
|
||||
```shell
|
||||
$ kubectl create -f docs/admin/resourcequota/compute-resources.yaml --namespace=quota-example
|
||||
resourcequota "compute-resources" created
|
||||
```
|
||||
|
||||
Let's describe the quota to see what is currently being consumed in this namespace:
|
||||
|
||||
```shell
|
||||
$ kubectl describe quota compute-resources --namespace=quota-example
|
||||
Name: compute-resources
|
||||
Namespace: quota-example
|
||||
Resource Used Hard
|
||||
-------- ---- ----
|
||||
limits.cpu 0 2
|
||||
limits.memory 0 2Gi
|
||||
pods 0 4
|
||||
requests.cpu 0 1
|
||||
requests.memory 0 1Gi
|
||||
```
|
||||
|
||||
The quota system will now prevent the namespace from having more than 4 non-terminal pods. In
|
||||
addition, it will enforce that each container in a pod makes a `request` and defines a `limit` for
|
||||
`cpu` and `memory`.
|
||||
|
||||
## Step 4: Applying default resource requests and limits
|
||||
|
||||
Pod authors rarely specify resource requests and limits for their pods.
|
||||
|
||||
Since we applied a quota to our project, let's see what happens when an end-user creates a pod that has unbounded
|
||||
cpu and memory by creating an nginx container.
|
||||
|
||||
To demonstrate, lets create a deployment that runs nginx:
|
||||
|
||||
```shell
|
||||
$ kubectl run nginx --image=nginx --replicas=1 --namespace=quota-example
|
||||
deployment "nginx" created
|
||||
```
|
||||
|
||||
Now let's look at the pods that were created.
|
||||
|
||||
```shell
|
||||
$ kubectl get pods --namespace=quota-example
|
||||
```
|
||||
|
||||
What happened? I have no pods! Let's describe the deployment to get a view of what is happening.
|
||||
|
||||
```shell
|
||||
$ kubectl describe deployment nginx --namespace=quota-example
|
||||
Name: nginx
|
||||
Namespace: quota-example
|
||||
CreationTimestamp: Mon, 06 Jun 2016 16:11:37 -0400
|
||||
Labels: run=nginx
|
||||
Selector: run=nginx
|
||||
Replicas: 0 updated | 1 total | 0 available | 1 unavailable
|
||||
StrategyType: RollingUpdate
|
||||
MinReadySeconds: 0
|
||||
RollingUpdateStrategy: 1 max unavailable, 1 max surge
|
||||
OldReplicaSets: <none>
|
||||
NewReplicaSet: nginx-3137573019 (0/1 replicas created)
|
||||
...
|
||||
```
|
||||
|
||||
A deployment created a corresponding replica set and attempted to size it to create a single pod.
|
||||
|
||||
Let's look at the replica set to get more detail.
|
||||
|
||||
```shell
|
||||
$ kubectl describe rs nginx-3137573019 --namespace=quota-example
|
||||
Name: nginx-3137573019
|
||||
Namespace: quota-example
|
||||
Image(s): nginx
|
||||
Selector: pod-template-hash=3137573019,run=nginx
|
||||
Labels: pod-template-hash=3137573019
|
||||
run=nginx
|
||||
Replicas: 0 current / 1 desired
|
||||
Pods Status: 0 Running / 0 Waiting / 0 Succeeded / 0 Failed
|
||||
No volumes.
|
||||
Events:
|
||||
FirstSeen LastSeen Count From SubobjectPath Type Reason Message
|
||||
--------- -------- ----- ---- ------------- -------- ------ -------
|
||||
4m 7s 11 {replicaset-controller } Warning FailedCreate Error creating: pods "nginx-3137573019-" is forbidden: Failed quota: compute-resources: must specify limits.cpu,limits.memory,requests.cpu,requests.memory
|
||||
```
|
||||
|
||||
The Kubernetes API server is rejecting the replica set requests to create a pod because our pods
|
||||
do not specify `requests` or `limits` for `cpu` and `memory`.
|
||||
|
||||
So let's set some default values for the amount of `cpu` and `memory` a pod can consume:
|
||||
|
||||
```shell
|
||||
$ kubectl create -f docs/admin/resourcequota/limits.yaml --namespace=quota-example
|
||||
limitrange "limits" created
|
||||
$ kubectl describe limits limits --namespace=quota-example
|
||||
Name: limits
|
||||
Namespace: quota-example
|
||||
Type Resource Min Max Default Request Default Limit Max Limit/Request Ratio
|
||||
---- -------- --- --- --------------- ------------- -----------------------
|
||||
Container memory - - 256Mi 512Mi -
|
||||
Container cpu - - 100m 200m -
|
||||
```
|
||||
|
||||
If the Kubernetes API server observes a request to create a pod in this namespace, and the containers
|
||||
in that pod do not make any compute resource requests, a default request and default limit will be applied
|
||||
as part of admission control.
|
||||
|
||||
In this example, each pod created will have compute resources equivalent to the following:
|
||||
|
||||
```shell
|
||||
$ kubectl run nginx \
|
||||
--image=nginx \
|
||||
--replicas=1 \
|
||||
--requests=cpu=100m,memory=256Mi \
|
||||
--limits=cpu=200m,memory=512Mi \
|
||||
--namespace=quota-example
|
||||
```
|
||||
|
||||
Now that we have applied default compute resources for our namespace, our replica set should be able to create
|
||||
its pods.
|
||||
|
||||
```shell
|
||||
$ kubectl get pods --namespace=quota-example
|
||||
NAME READY STATUS RESTARTS AGE
|
||||
nginx-3137573019-fvrig 1/1 Running 0 6m
|
||||
```
|
||||
|
||||
And if we print out our quota usage in the namespace:
|
||||
|
||||
```shell
|
||||
$ kubectl describe quota --namespace=quota-example
|
||||
Name: compute-resources
|
||||
Namespace: quota-example
|
||||
Resource Used Hard
|
||||
-------- ---- ----
|
||||
limits.cpu 200m 2
|
||||
limits.memory 512Mi 2Gi
|
||||
pods 1 4
|
||||
requests.cpu 100m 1
|
||||
requests.memory 256Mi 1Gi
|
||||
|
||||
|
||||
Name: object-counts
|
||||
Namespace: quota-example
|
||||
Resource Used Hard
|
||||
-------- ---- ----
|
||||
persistentvolumeclaims 0 2
|
||||
services.loadbalancers 0 2
|
||||
services.nodeports 0 0
|
||||
```
|
||||
|
||||
As you can see, the pod that was created is consuming explicit amounts of compute resources, and the usage is being
|
||||
tracked by Kubernetes properly.
|
||||
|
||||
## Step 5: Advanced quota scopes
|
||||
|
||||
Let's imagine you did not want to specify default compute resource consumption in your namespace.
|
||||
|
||||
Instead, you want to let users run a specific number of `BestEffort` pods in their namespace to take
|
||||
advantage of slack compute resources, and then require that users make an explicit resource request for
|
||||
pods that require a higher quality of service.
|
||||
|
||||
Let's create a new namespace with two quotas to demonstrate this behavior:
|
||||
|
||||
```shell
|
||||
$ kubectl create namespace quota-scopes
|
||||
namespace "quota-scopes" created
|
||||
$ kubectl create -f docs/admin/resourcequota/best-effort.yaml --namespace=quota-scopes
|
||||
resourcequota "best-effort" created
|
||||
$ kubectl create -f docs/admin/resourcequota/not-best-effort.yaml --namespace=quota-scopes
|
||||
resourcequota "not-best-effort" created
|
||||
$ kubectl describe quota --namespace=quota-scopes
|
||||
Name: best-effort
|
||||
Namespace: quota-scopes
|
||||
Scopes: BestEffort
|
||||
* Matches all pods that have best effort quality of service.
|
||||
Resource Used Hard
|
||||
-------- ---- ----
|
||||
pods 0 10
|
||||
|
||||
|
||||
Name: not-best-effort
|
||||
Namespace: quota-scopes
|
||||
Scopes: NotBestEffort
|
||||
* Matches all pods that do not have best effort quality of service.
|
||||
Resource Used Hard
|
||||
-------- ---- ----
|
||||
limits.cpu 0 2
|
||||
limits.memory 0 2Gi
|
||||
pods 0 4
|
||||
requests.cpu 0 1
|
||||
requests.memory 0 1Gi
|
||||
```
|
||||
|
||||
In this scenario, a pod that makes no compute resource requests will be tracked by the `best-effort` quota.
|
||||
|
||||
A pod that does make compute resource requests will be tracked by the `not-best-effort` quota.
|
||||
|
||||
Let's demonstrate this by creating two deployments:
|
||||
|
||||
```shell
|
||||
$ kubectl run best-effort-nginx --image=nginx --replicas=8 --namespace=quota-scopes
|
||||
deployment "best-effort-nginx" created
|
||||
$ kubectl run not-best-effort-nginx \
|
||||
--image=nginx \
|
||||
--replicas=2 \
|
||||
--requests=cpu=100m,memory=256Mi \
|
||||
--limits=cpu=200m,memory=512Mi \
|
||||
--namespace=quota-scopes
|
||||
deployment "not-best-effort-nginx" created
|
||||
```
|
||||
|
||||
Even though no default limits were specified, the `best-effort-nginx` deployment will create
|
||||
all 8 pods. This is because it is tracked by the `best-effort` quota, and the `not-best-effort`
|
||||
quota will just ignore it. The `not-best-effort` quota will track the `not-best-effort-nginx`
|
||||
deployment since it creates pods with `Burstable` quality of service.
|
||||
|
||||
Let's list the pods in the namespace:
|
||||
|
||||
```shell
|
||||
$ kubectl get pods --namespace=quota-scopes
|
||||
NAME READY STATUS RESTARTS AGE
|
||||
best-effort-nginx-3488455095-2qb41 1/1 Running 0 51s
|
||||
best-effort-nginx-3488455095-3go7n 1/1 Running 0 51s
|
||||
best-effort-nginx-3488455095-9o2xg 1/1 Running 0 51s
|
||||
best-effort-nginx-3488455095-eyg40 1/1 Running 0 51s
|
||||
best-effort-nginx-3488455095-gcs3v 1/1 Running 0 51s
|
||||
best-effort-nginx-3488455095-rq8p1 1/1 Running 0 51s
|
||||
best-effort-nginx-3488455095-udhhd 1/1 Running 0 51s
|
||||
best-effort-nginx-3488455095-zmk12 1/1 Running 0 51s
|
||||
not-best-effort-nginx-2204666826-7sl61 1/1 Running 0 23s
|
||||
not-best-effort-nginx-2204666826-ke746 1/1 Running 0 23s
|
||||
```
|
||||
|
||||
As you can see, all 10 pods have been allowed to be created.
|
||||
|
||||
Let's describe current quota usage in the namespace:
|
||||
|
||||
```shell
|
||||
$ kubectl describe quota --namespace=quota-scopes
|
||||
Name: best-effort
|
||||
Namespace: quota-scopes
|
||||
Scopes: BestEffort
|
||||
* Matches all pods that have best effort quality of service.
|
||||
Resource Used Hard
|
||||
-------- ---- ----
|
||||
pods 8 10
|
||||
|
||||
|
||||
Name: not-best-effort
|
||||
Namespace: quota-scopes
|
||||
Scopes: NotBestEffort
|
||||
* Matches all pods that do not have best effort quality of service.
|
||||
Resource Used Hard
|
||||
-------- ---- ----
|
||||
limits.cpu 400m 2
|
||||
limits.memory 1Gi 2Gi
|
||||
pods 2 4
|
||||
requests.cpu 200m 1
|
||||
requests.memory 512Mi 1Gi
|
||||
```
|
||||
|
||||
As you can see, the `best-effort` quota has tracked the usage for the 8 pods we created in
|
||||
the `best-effort-nginx` deployment, and the `not-best-effort` quota has tracked the usage for
|
||||
the 2 pods we created in the `not-best-effort-nginx` quota.
|
||||
|
||||
Scopes provide a mechanism to subdivide the set of resources that are tracked by
|
||||
any quota document to allow greater flexibility in how operators deploy and track resource
|
||||
consumption.
|
||||
|
||||
In addition to `BestEffort` and `NotBestEffort` scopes, there are scopes to restrict
|
||||
long-running versus time-bound pods. The `Terminating` scope will match any pod
|
||||
where `spec.activeDeadlineSeconds is not nil`. The `NotTerminating` scope will match any pod
|
||||
where `spec.activeDeadlineSeconds is nil`. These scopes allow you to quota pods based on their
|
||||
anticipated permanence on a node in your cluster.
|
||||
|
||||
## Summary
|
||||
|
||||
Actions that consume node resources for cpu and memory can be subject to hard quota limits defined by the namespace quota.
|
||||
|
||||
Any action that consumes those resources can be tweaked, or can pick up namespace level defaults to meet your end goal.
|
||||
|
||||
Quota can be apportioned based on quality of service and anticipated permanence on a node in your cluster.
|
|
@ -0,0 +1,95 @@
|
|||
---
|
||||
assignees:
|
||||
- davidopp
|
||||
title: Configuring a Pod Disruption Budget
|
||||
---
|
||||
This guide is for anyone wishing to specify safety constraints on pods or anyone
|
||||
wishing to write software (typically automation software) that respects those
|
||||
constraints.
|
||||
|
||||
* TOC
|
||||
{:toc}
|
||||
|
||||
## Rationale
|
||||
|
||||
Various cluster management operations may voluntarily evict pods. "Voluntary"
|
||||
means an eviction can be safely delayed for a reasonable period of time. The
|
||||
principal examples today are draining a node for maintenance or upgrade
|
||||
(`kubectl drain`), and cluster autoscaling down. In the future the
|
||||
[rescheduler](https://github.com/kubernetes/kubernetes/blob/master/docs/proposals/rescheduling.md)
|
||||
may also perform voluntary evictions. By contrast, something like evicting pods
|
||||
because a node has become unreachable or reports `NotReady`, is not "voluntary."
|
||||
|
||||
For voluntary evictions, it can be useful for applications to be able to limit
|
||||
the number of pods that are down simultaneously. For example, a quorum-based application would
|
||||
like to ensure that the number of replicas running is never brought below the
|
||||
number needed for a quorum, even temporarily. Or a web front end might want to
|
||||
ensure that the number of replicas serving load never falls below a certain
|
||||
percentage of the total, even briefly. `PodDisruptionBudget` is an API object
|
||||
that specifies the minimum number or percentage of replicas of a collection that
|
||||
must be up at a time. Components that wish to evict a pod subject to disruption
|
||||
budget use the `/eviction` subresource; unlike a regular pod deletion, this
|
||||
operation may be rejected by the API server if the eviction would cause a
|
||||
disruption budget to be violated.
|
||||
|
||||
## Specifying a PodDisruptionBudget
|
||||
|
||||
A `PodDisruptionBudget` has two components: a label selector `selector` to specify the set of
|
||||
pods to which it applies, and `minAvailable` which is a description of the number of pods from that
|
||||
set that must still be available after the eviction, i.e. even in the absence
|
||||
of the evicted pod. `minAvailable` can be either an absolute number or a percentage.
|
||||
So for example, 100% means no voluntary evictions from the set are permitted. In
|
||||
typical usage, a single budget would be used for a collection of pods managed by
|
||||
a controller—for example, the pods in a single ReplicaSet.
|
||||
|
||||
Note that a disruption budget does not truly guarantee that the specified
|
||||
number/percentage of pods will always be up. For example, a node that hosts a
|
||||
pod from the collection may fail when the collection is at the minimum size
|
||||
specified in the budget, thus bringing the number of available pods from the
|
||||
collection below the specified size. The budget can only protect against
|
||||
voluntary evictions, not all causes of unavailability.
|
||||
|
||||
## Requesting an eviction
|
||||
|
||||
If you are writing infrastructure software that wants to produce these voluntary
|
||||
evictions, you will need to use the eviction API. The eviction subresource of a
|
||||
pod can be thought of as a kind of policy-controlled DELETE operation on the pod
|
||||
itself. To attempt an eviction (perhaps more REST-precisely, to attempt to
|
||||
*create* an eviction), you POST an attempted operation. Here's an example:
|
||||
|
||||
```json
|
||||
{
|
||||
"apiVersion": "policy/v1beta1",
|
||||
"kind": "Eviction",
|
||||
"metadata": {
|
||||
"name": "quux",
|
||||
"namespace": "default"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
You can attempt an eviction using `curl`:
|
||||
|
||||
```bash
|
||||
$ curl -v -H 'Content-type: application/json' http://127.0.0.1:8080/api/v1/namespaces/default/pods/quux/eviction -d @eviction.json
|
||||
```
|
||||
|
||||
The API can respond in one of three ways.
|
||||
|
||||
1. If the eviction is granted, then the pod is deleted just as if you had sent
|
||||
a `DELETE` request to the pod's URL and you get back `200 OK`.
|
||||
2. If the current state of affairs wouldn't allow an eviction by the rules set
|
||||
forth in the budget, you get back `429 Too Many Requests`. This is
|
||||
typically used for generic rate limiting of *any* requests, but here we mean
|
||||
that this request isn't allowed *right now* but it may be allowed later.
|
||||
Currently, callers do not get any `Retry-After` advice, but they may in
|
||||
future versions.
|
||||
3. If there is some kind of misconfiguration, like multiple budgets pointing at
|
||||
the same pod, you will get `500 Internal Server Error`.
|
||||
|
||||
For a given eviction request, there are two cases.
|
||||
|
||||
1. There is no budget that matches this pod. In this case, the server always
|
||||
returns `200 OK`.
|
||||
2. There is at least one budget. In this case, any of the three above responses may
|
||||
apply.
|
|
@ -0,0 +1,214 @@
|
|||
---
|
||||
assignees:
|
||||
- derekwaynecarr
|
||||
- janetkuo
|
||||
title: Setting Pod CPU and Memory Limits
|
||||
---
|
||||
|
||||
By default, pods run with unbounded CPU and memory limits. This means that any pod in the
|
||||
system will be able to consume as much CPU and memory on the node that executes the pod.
|
||||
|
||||
Users may want to impose restrictions on the amount of resources a single pod in the system may consume
|
||||
for a variety of reasons.
|
||||
|
||||
For example:
|
||||
|
||||
1. Each node in the cluster has 2GB of memory. The cluster operator does not want to accept pods
|
||||
that require more than 2GB of memory since no node in the cluster can support the requirement. To prevent a
|
||||
pod from being permanently unscheduled to a node, the operator instead chooses to reject pods that exceed 2GB
|
||||
of memory as part of admission control.
|
||||
2. A cluster is shared by two communities in an organization that runs production and development workloads
|
||||
respectively. Production workloads may consume up to 8GB of memory, but development workloads may consume up
|
||||
to 512MB of memory. The cluster operator creates a separate namespace for each workload, and applies limits to
|
||||
each namespace.
|
||||
3. Users may create a pod which consumes resources just below the capacity of a machine. The left over space
|
||||
may be too small to be useful, but big enough for the waste to be costly over the entire cluster. As a result,
|
||||
the cluster operator may want to set limits that a pod must consume at least 20% of the memory and CPU of their
|
||||
average node size in order to provide for more uniform scheduling and limit waste.
|
||||
|
||||
This example demonstrates how limits can be applied to a Kubernetes [namespace](/docs/admin/namespaces/walkthrough/) to control
|
||||
min/max resource limits per pod. In addition, this example demonstrates how you can
|
||||
apply default resource limits to pods in the absence of an end-user specified value.
|
||||
|
||||
See [LimitRange design doc](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/admission_control_limit_range.md) for more information. For a detailed description of the Kubernetes resource model, see [Resources](/docs/user-guide/compute-resources/)
|
||||
|
||||
## Step 0: Prerequisites
|
||||
|
||||
This example requires a running Kubernetes cluster. See the [Getting Started guides](/docs/getting-started-guides/) for how to get started.
|
||||
|
||||
Change to the `<kubernetes>` directory if you're not already there.
|
||||
|
||||
## Step 1: Create a namespace
|
||||
|
||||
This example will work in a custom namespace to demonstrate the concepts involved.
|
||||
|
||||
Let's create a new namespace called limit-example:
|
||||
|
||||
```shell
|
||||
$ kubectl create namespace limit-example
|
||||
namespace "limit-example" created
|
||||
```
|
||||
|
||||
Note that `kubectl` commands will print the type and name of the resource created or mutated, which can then be used in subsequent commands:
|
||||
|
||||
```shell
|
||||
$ kubectl get namespaces
|
||||
NAME STATUS AGE
|
||||
default Active 51s
|
||||
limit-example Active 45s
|
||||
```
|
||||
|
||||
## Step 2: Apply a limit to the namespace
|
||||
|
||||
Let's create a simple limit in our namespace.
|
||||
|
||||
```shell
|
||||
$ kubectl create -f docs/admin/limitrange/limits.yaml --namespace=limit-example
|
||||
limitrange "mylimits" created
|
||||
```
|
||||
|
||||
Let's describe the limits that we have imposed in our namespace.
|
||||
|
||||
```shell
|
||||
$ kubectl describe limits mylimits --namespace=limit-example
|
||||
Name: mylimits
|
||||
Namespace: limit-example
|
||||
Type Resource Min Max Default Request Default Limit Max Limit/Request Ratio
|
||||
---- -------- --- --- --------------- ------------- -----------------------
|
||||
Pod cpu 200m 2 - - -
|
||||
Pod memory 6Mi 1Gi - - -
|
||||
Container cpu 100m 2 200m 300m -
|
||||
Container memory 3Mi 1Gi 100Mi 200Mi -
|
||||
```
|
||||
|
||||
In this scenario, we have said the following:
|
||||
|
||||
1. If a max constraint is specified for a resource (2 CPU and 1Gi memory in this case), then a limit
|
||||
must be specified for that resource across all containers. Failure to specify a limit will result in
|
||||
a validation error when attempting to create the pod. Note that a default value of limit is set by
|
||||
*default* in file `limits.yaml` (300m CPU and 200Mi memory).
|
||||
2. If a min constraint is specified for a resource (100m CPU and 3Mi memory in this case), then a
|
||||
request must be specified for that resource across all containers. Failure to specify a request will
|
||||
result in a validation error when attempting to create the pod. Note that a default value of request is
|
||||
set by *defaultRequest* in file `limits.yaml` (200m CPU and 100Mi memory).
|
||||
3. For any pod, the sum of all containers memory requests must be >= 6Mi and the sum of all containers
|
||||
memory limits must be <= 1Gi; the sum of all containers CPU requests must be >= 200m and the sum of all
|
||||
containers CPU limits must be <= 2.
|
||||
|
||||
## Step 3: Enforcing limits at point of creation
|
||||
|
||||
The limits enumerated in a namespace are only enforced when a pod is created or updated in
|
||||
the cluster. If you change the limits to a different value range, it does not affect pods that
|
||||
were previously created in a namespace.
|
||||
|
||||
If a resource (CPU or memory) is being restricted by a limit, the user will get an error at time
|
||||
of creation explaining why.
|
||||
|
||||
Let's first spin up a [Deployment](/docs/user-guide/deployments) that creates a single container Pod to demonstrate
|
||||
how default values are applied to each pod.
|
||||
|
||||
```shell
|
||||
$ kubectl run nginx --image=nginx --replicas=1 --namespace=limit-example
|
||||
deployment "nginx" created
|
||||
```
|
||||
|
||||
Note that `kubectl run` creates a Deployment named "nginx" on Kubernetes cluster >= v1.2. If you are running older versions, it creates replication controllers instead.
|
||||
If you want to obtain the old behavior, use `--generator=run/v1` to create replication controllers. See [`kubectl run`](/docs/user-guide/kubectl/kubectl_run/) for more details.
|
||||
The Deployment manages 1 replica of single container Pod. Let's take a look at the Pod it manages. First, find the name of the Pod:
|
||||
|
||||
```shell
|
||||
$ kubectl get pods --namespace=limit-example
|
||||
NAME READY STATUS RESTARTS AGE
|
||||
nginx-2040093540-s8vzu 1/1 Running 0 11s
|
||||
```
|
||||
|
||||
Let's print this Pod with yaml output format (using `-o yaml` flag), and then `grep` the `resources` field. Note that your pod name will be different.
|
||||
|
||||
```shell
|
||||
$ kubectl get pods nginx-2040093540-s8vzu --namespace=limit-example -o yaml | grep resources -C 8
|
||||
resourceVersion: "57"
|
||||
selfLink: /api/v1/namespaces/limit-example/pods/nginx-2040093540-ivimu
|
||||
uid: 67b20741-f53b-11e5-b066-64510658e388
|
||||
spec:
|
||||
containers:
|
||||
- image: nginx
|
||||
imagePullPolicy: Always
|
||||
name: nginx
|
||||
resources:
|
||||
limits:
|
||||
cpu: 300m
|
||||
memory: 200Mi
|
||||
requests:
|
||||
cpu: 200m
|
||||
memory: 100Mi
|
||||
terminationMessagePath: /dev/termination-log
|
||||
volumeMounts:
|
||||
```
|
||||
|
||||
Note that our nginx container has picked up the namespace default CPU and memory resource *limits* and *requests*.
|
||||
|
||||
Let's create a pod that exceeds our allowed limits by having it have a container that requests 3 CPU cores.
|
||||
|
||||
```shell
|
||||
$ kubectl create -f docs/admin/limitrange/invalid-pod.yaml --namespace=limit-example
|
||||
Error from server: error when creating "docs/admin/limitrange/invalid-pod.yaml": Pod "invalid-pod" is forbidden: [Maximum cpu usage per Pod is 2, but limit is 3., Maximum cpu usage per Container is 2, but limit is 3.]
|
||||
```
|
||||
|
||||
Let's create a pod that falls within the allowed limit boundaries.
|
||||
|
||||
```shell
|
||||
$ kubectl create -f docs/admin/limitrange/valid-pod.yaml --namespace=limit-example
|
||||
pod "valid-pod" created
|
||||
```
|
||||
|
||||
Now look at the Pod's resources field:
|
||||
|
||||
```shell
|
||||
$ kubectl get pods valid-pod --namespace=limit-example -o yaml | grep -C 6 resources
|
||||
uid: 3b1bfd7a-f53c-11e5-b066-64510658e388
|
||||
spec:
|
||||
containers:
|
||||
- image: gcr.io/google_containers/serve_hostname
|
||||
imagePullPolicy: Always
|
||||
name: kubernetes-serve-hostname
|
||||
resources:
|
||||
limits:
|
||||
cpu: "1"
|
||||
memory: 512Mi
|
||||
requests:
|
||||
cpu: "1"
|
||||
memory: 512Mi
|
||||
```
|
||||
|
||||
Note that this pod specifies explicit resource *limits* and *requests* so it did not pick up the namespace
|
||||
default values.
|
||||
|
||||
Note: The *limits* for CPU resource are enforced in the default Kubernetes setup on the physical node
|
||||
that runs the container unless the administrator deploys the kubelet with the following flag:
|
||||
|
||||
```shell
|
||||
$ kubelet --help
|
||||
Usage of kubelet
|
||||
....
|
||||
--cpu-cfs-quota[=true]: Enable CPU CFS quota enforcement for containers that specify CPU limits
|
||||
$ kubelet --cpu-cfs-quota=false ...
|
||||
```
|
||||
|
||||
## Step 4: Cleanup
|
||||
|
||||
To remove the resources used by this example, you can just delete the limit-example namespace.
|
||||
|
||||
```shell
|
||||
$ kubectl delete namespace limit-example
|
||||
namespace "limit-example" deleted
|
||||
$ kubectl get namespaces
|
||||
NAME STATUS AGE
|
||||
default Active 12m
|
||||
```
|
||||
|
||||
## Summary
|
||||
|
||||
Cluster operators that want to restrict the amount of resources a single container or pod may consume
|
||||
are able to define allowable ranges per Kubernetes namespace. In the absence of any explicit assignments,
|
||||
the Kubernetes system is able to apply default resource *limits* and *requests* if desired in order to
|
||||
constrain the amount of resource a pod consumes on a node.
|
|
@ -0,0 +1,248 @@
|
|||
---
|
||||
assignees:
|
||||
- Random-Liu
|
||||
- dchen1107
|
||||
title: Monitoring Node Health
|
||||
---
|
||||
|
||||
* TOC
|
||||
{:toc}
|
||||
|
||||
## Node Problem Detector
|
||||
|
||||
*Node problem detector* is a [DaemonSet](/docs/admin/daemons/) monitoring the
|
||||
node health. It collects node problems from various daemons and reports them
|
||||
to the apiserver as [NodeCondition](/docs/admin/node/#node-condition) and
|
||||
[Event](/docs/api-reference/v1/definitions/#_v1_event).
|
||||
|
||||
It supports some known kernel issue detection now, and will detect more and
|
||||
more node problems over time.
|
||||
|
||||
Currently Kubernetes won't take any action on the node conditions and events
|
||||
generated by node problem detector. In the future, a remedy system could be
|
||||
introduced to deal with node problems.
|
||||
|
||||
See more information
|
||||
[here](https://github.com/kubernetes/node-problem-detector).
|
||||
|
||||
## Limitations
|
||||
|
||||
* The kernel issue detection of node problem detector only supports file based
|
||||
kernel log now. It doesn't support log tools like journald.
|
||||
|
||||
* The kernel issue detection of node problem detector has assumption on kernel
|
||||
log format, and now it only works on Ubuntu and Debian. However, it is easy to extend
|
||||
it to [support other log format](/docs/admin/node-problem/#support-other-log-format).
|
||||
|
||||
## Enable/Disable in GCE cluster
|
||||
|
||||
Node problem detector is [running as a cluster addon](cluster-large.md/#addon-resources) enabled by default in the
|
||||
gce cluster.
|
||||
|
||||
You can enable/disable it by setting the environment variable
|
||||
`KUBE_ENABLE_NODE_PROBLEM_DETECTOR` before `kube-up.sh`.
|
||||
|
||||
## Use in Other Environment
|
||||
|
||||
To enable node problem detector in other environment outside of GCE, you can use
|
||||
either `kubectl` or addon pod.
|
||||
|
||||
### Kubectl
|
||||
|
||||
This is the recommended way to start node problem detector outside of GCE. It
|
||||
provides more flexible management, such as overwriting the default
|
||||
configuration to fit it into your environment or detect
|
||||
customized node problems.
|
||||
|
||||
* **Step 1:** Create `node-problem-detector.yaml`:
|
||||
|
||||
```yaml
|
||||
apiVersion: extensions/v1beta1
|
||||
kind: DaemonSet
|
||||
metadata:
|
||||
name: node-problem-detector-v0.1
|
||||
namespace: kube-system
|
||||
labels:
|
||||
k8s-app: node-problem-detector
|
||||
version: v0.1
|
||||
kubernetes.io/cluster-service: "true"
|
||||
spec:
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
k8s-app: node-problem-detector
|
||||
version: v0.1
|
||||
kubernetes.io/cluster-service: "true"
|
||||
spec:
|
||||
hostNetwork: true
|
||||
containers:
|
||||
- name: node-problem-detector
|
||||
image: gcr.io/google_containers/node-problem-detector:v0.1
|
||||
securityContext:
|
||||
privileged: true
|
||||
resources:
|
||||
limits:
|
||||
cpu: "200m"
|
||||
memory: "100Mi"
|
||||
requests:
|
||||
cpu: "20m"
|
||||
memory: "20Mi"
|
||||
volumeMounts:
|
||||
- name: log
|
||||
mountPath: /log
|
||||
readOnly: true
|
||||
volumes:
|
||||
- name: log
|
||||
hostPath:
|
||||
path: /var/log/
|
||||
```
|
||||
|
||||
***Notice that you should make sure the system log directory is right for your
|
||||
OS distro.***
|
||||
|
||||
* **Step 2:** Start node problem detector with `kubectl`:
|
||||
|
||||
```shell
|
||||
kubectl create -f node-problem-detector.yaml
|
||||
```
|
||||
|
||||
### Addon Pod
|
||||
|
||||
This is for those who have their own cluster bootstrap solution, and don't need
|
||||
to overwrite the default configuration. They could leverage the addon pod to
|
||||
further automate the deployment.
|
||||
|
||||
Just create `node-problem-detector.yaml`, and put it under the addon pods directory
|
||||
`/etc/kubernetes/addons/node-problem-detector` on master node.
|
||||
|
||||
## Overwrite the Configuration
|
||||
|
||||
The [default configuration](https://github.com/kubernetes/node-problem-detector/tree/v0.1/config)
|
||||
is embedded when building the docker image of node problem detector.
|
||||
|
||||
However, you can use [ConfigMap](/docs/user-guide/configmap/) to overwrite it
|
||||
following the steps:
|
||||
|
||||
* **Step 1:** Change the config files in `config/`.
|
||||
* **Step 2:** Create the ConfigMap `node-problem-detector-config` with `kubectl create configmap
|
||||
node-problem-detector-config --from-file=config/`.
|
||||
* **Step 3:** Change the `node-problem-detector.yaml` to use the ConfigMap:
|
||||
|
||||
```yaml
|
||||
apiVersion: extensions/v1beta1
|
||||
kind: DaemonSet
|
||||
metadata:
|
||||
name: node-problem-detector-v0.1
|
||||
namespace: kube-system
|
||||
labels:
|
||||
k8s-app: node-problem-detector
|
||||
version: v0.1
|
||||
kubernetes.io/cluster-service: "true"
|
||||
spec:
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
k8s-app: node-problem-detector
|
||||
version: v0.1
|
||||
kubernetes.io/cluster-service: "true"
|
||||
spec:
|
||||
hostNetwork: true
|
||||
containers:
|
||||
- name: node-problem-detector
|
||||
image: gcr.io/google_containers/node-problem-detector:v0.1
|
||||
securityContext:
|
||||
privileged: true
|
||||
resources:
|
||||
limits:
|
||||
cpu: "200m"
|
||||
memory: "100Mi"
|
||||
requests:
|
||||
cpu: "20m"
|
||||
memory: "20Mi"
|
||||
volumeMounts:
|
||||
- name: log
|
||||
mountPath: /log
|
||||
readOnly: true
|
||||
- name: config # Overwrite the config/ directory with ConfigMap volume
|
||||
mountPath: /config
|
||||
readOnly: true
|
||||
volumes:
|
||||
- name: log
|
||||
hostPath:
|
||||
path: /var/log/
|
||||
- name: config # Define ConfigMap volume
|
||||
configMap:
|
||||
name: node-problem-detector-config
|
||||
```
|
||||
|
||||
* **Step 4:** Re-create the node problem detector with the new yaml file:
|
||||
|
||||
```shell
|
||||
kubectl delete -f node-problem-detector.yaml # If you have a node-problem-detector running
|
||||
kubectl create -f node-problem-detector.yaml
|
||||
```
|
||||
|
||||
***Notice that this approach only applies to node problem detector started with `kubectl`.***
|
||||
|
||||
For node problem detector running as cluster addon, because addon manager doesn't support
|
||||
ConfigMap, configuration overwriting is not supported now.
|
||||
|
||||
## Kernel Monitor
|
||||
|
||||
*Kernel Monitor* is a problem daemon in node problem detector. It monitors kernel log
|
||||
and detects known kernel issues following predefined rules.
|
||||
|
||||
The Kernel Monitor matches kernel issues according to a set of predefined rule list in
|
||||
[`config/kernel-monitor.json`](https://github.com/kubernetes/node-problem-detector/blob/v0.1/config/kernel-monitor.json).
|
||||
The rule list is extensible, and you can always extend it by [overwriting the
|
||||
configuration](/docs/admin/node-problem/#overwrite-the-configuration).
|
||||
|
||||
### Add New NodeConditions
|
||||
|
||||
To support new node conditions, you can extend the `conditions` field in
|
||||
`config/kernel-monitor.json` with new condition definition:
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "NodeConditionType",
|
||||
"reason": "CamelCaseDefaultNodeConditionReason",
|
||||
"message": "arbitrary default node condition message"
|
||||
}
|
||||
```
|
||||
|
||||
### Detect New Problems
|
||||
|
||||
To detect new problems, you can extend the `rules` field in `config/kernel-monitor.json`
|
||||
with new rule definition:
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "temporary/permanent",
|
||||
"condition": "NodeConditionOfPermanentIssue",
|
||||
"reason": "CamelCaseShortReason",
|
||||
"message": "regexp matching the issue in the kernel log"
|
||||
}
|
||||
```
|
||||
|
||||
### Change Log Path
|
||||
|
||||
Kernel log in different OS distros may locate in different path. The `log`
|
||||
field in `config/kernel-monitor.json` is the log path inside the container.
|
||||
You can always configure it to match your OS distro.
|
||||
|
||||
### Support Other Log Format
|
||||
|
||||
Kernel monitor uses [`Translator`](https://github.com/kubernetes/node-problem-detector/blob/v0.1/pkg/kernelmonitor/translator/translator.go)
|
||||
plugin to translate kernel log the internal data structure. It is easy to
|
||||
implement a new translator for a new log format.
|
||||
|
||||
## Caveats
|
||||
|
||||
It is recommended to run the node problem detector in your cluster to monitor
|
||||
the node health. However, you should be aware that this will introduce extra
|
||||
resource overhead on each node. Usually this is fine, because:
|
||||
|
||||
* The kernel log is generated relatively slowly.
|
||||
* Resource limit is set for node problem detector.
|
||||
* Even under high load, the resource usage is acceptable.
|
||||
(see [benchmark result](https://github.com/kubernetes/node-problem-detector/issues/2#issuecomment-220255629))
|
|
@ -0,0 +1,6 @@
|
|||
FROM python
|
||||
RUN pip install redis
|
||||
COPY ./worker.py /worker.py
|
||||
COPY ./rediswq.py /rediswq.py
|
||||
|
||||
CMD python worker.py
|
|
@ -0,0 +1,213 @@
|
|||
---
|
||||
title: Fine Parallel Processing Using a Work Queue
|
||||
---
|
||||
|
||||
* TOC
|
||||
{:toc}
|
||||
|
||||
# Example: Job with Work Queue with Pod Per Work Item
|
||||
|
||||
In this example, we will run a Kubernetes Job with multiple parallel
|
||||
worker processes. You may want to be familiar with the basic,
|
||||
non-parallel, use of [Job](/docs/concepts/jobs/run-to-completion-finite-workloads/) first.
|
||||
|
||||
In this example, as each pod is created, it picks up one unit of work
|
||||
from a task queue, completes it, deletes it from the queue, and exits.
|
||||
|
||||
|
||||
Here is an overview of the steps in this example:
|
||||
|
||||
1. **Start a storage service to hold the work queue.** In this example, we use Redis to store
|
||||
our work items. In the previous example, we used RabbitMQ. In this example, we use Redis and
|
||||
a custom work-queue client library because AMQP does not provide a good way for clients to
|
||||
detect when a finite-length work queue is empty. In practice you would set up a store such
|
||||
as Redis once and reuse it for the work queues of many jobs, and other things.
|
||||
1. **Create a queue, and fill it with messages.** Each message represents one task to be done. In
|
||||
this example, a message is just an integer that we will do a lengthy computation on.
|
||||
1. **Start a Job that works on tasks from the queue**. The Job starts several pods. Each pod takes
|
||||
one task from the message queue, processes it, and repeats until the end of the queue is reached.
|
||||
|
||||
|
||||
## Starting Redis
|
||||
|
||||
For this example, for simplicitly, we will start a single instance of Redis.
|
||||
See the [Redis Example](https://github.com/kubernetes/kubernetes/tree/master/examples/guestbook) for an example
|
||||
of deploying Redis scalably and redundantly.
|
||||
|
||||
Start a temporary Pod running Redis and a service so we can find it.
|
||||
|
||||
```shell
|
||||
$ kubectl create -f docs/tasks/job/fine-parallel-processing-work-queue/redis-pod.yaml
|
||||
pod "redis-master" created
|
||||
$ kubectl create -f docs/tasks/job/fine-parallel-processing-work-queue/redis-service.yaml
|
||||
service "redis" created
|
||||
```
|
||||
|
||||
If you're not working from the source tree, you could also download [`redis-pod.yaml`](redis-pod.yaml?raw=true) and [`redis-service.yaml`](redis-service.yaml?raw=true) directly.
|
||||
|
||||
## Filling the Queue with tasks
|
||||
|
||||
Now let's fill the queue with some "tasks". In our example, our tasks are just strings to be
|
||||
printed.
|
||||
|
||||
Start a temporary interactive pod for running the Redis CLI
|
||||
|
||||
```shell
|
||||
$ kubectl run -i --tty temp --image redis --command "/bin/sh"
|
||||
Waiting for pod default/redis2-c7h78 to be running, status is Pending, pod ready: false
|
||||
Hit enter for command prompt
|
||||
```
|
||||
|
||||
Now hit enter, start the redis CLI, and create a list with some work items in it.
|
||||
|
||||
```
|
||||
# redis-cli -h redis
|
||||
redis:6379> rpush job2 "apple"
|
||||
(integer) 1
|
||||
redis:6379> rpush job2 "banana"
|
||||
(integer) 2
|
||||
redis:6379> rpush job2 "cherry"
|
||||
(integer) 3
|
||||
redis:6379> rpush job2 "date"
|
||||
(integer) 4
|
||||
redis:6379> rpush job2 "fig"
|
||||
(integer) 5
|
||||
redis:6379> rpush job2 "grape"
|
||||
(integer) 6
|
||||
redis:6379> rpush job2 "lemon"
|
||||
(integer) 7
|
||||
redis:6379> rpush job2 "melon"
|
||||
(integer) 8
|
||||
redis:6379> rpush job2 "orange"
|
||||
(integer) 9
|
||||
redis:6379> lrange job2 0 -1
|
||||
1) "apple"
|
||||
2) "banana"
|
||||
3) "cherry"
|
||||
4) "date"
|
||||
5) "fig"
|
||||
6) "grape"
|
||||
7) "lemon"
|
||||
8) "melon"
|
||||
9) "orange"
|
||||
```
|
||||
|
||||
So, the list with key `job2` will be our work queue.
|
||||
|
||||
Note: if you do not have Kube DNS setup correctly, you may need to change
|
||||
the first step of the above block to `redis-cli -h $REDIS_SERVICE_HOST`.
|
||||
|
||||
|
||||
## Create an Image
|
||||
|
||||
Now we are ready to create an image that we will run.
|
||||
|
||||
We will use a python worker program with a redis client to read
|
||||
the messages from the message queue.
|
||||
|
||||
A simple Redis work queue client library is provided,
|
||||
called rediswq.py ([Download](rediswq.py?raw=true)).
|
||||
|
||||
The "worker" program in each Pod of the Job uses the work queue
|
||||
client library to get work. Here it is:
|
||||
|
||||
{% include code.html language="python" file="worker.py" ghlink="/docs/tasks/job/fine-parallel-processing-work-queue/worker.py" %}
|
||||
|
||||
If you are working from the source tree,
|
||||
change directory to the `docs/tasks/job/fine-parallel-processing-work-queue/` directory.
|
||||
Otherwise, download [`worker.py`](worker.py?raw=true), [`rediswq.py`](rediswq.py?raw=true), and [`Dockerfile`](Dockerfile?raw=true)
|
||||
using above links. Then build the image:
|
||||
|
||||
```shell
|
||||
docker build -t job-wq-2 .
|
||||
```
|
||||
|
||||
### Push the image
|
||||
|
||||
For the [Docker Hub](https://hub.docker.com/), tag your app image with
|
||||
your username and push to the Hub with the below commands. Replace
|
||||
`<username>` with your Hub username.
|
||||
|
||||
```shell
|
||||
docker tag job-wq-2 <username>/job-wq-2
|
||||
docker push <username>/job-wq-2
|
||||
```
|
||||
|
||||
You need to push to a public repository or [configure your cluster to be able to access
|
||||
your private repository](/docs/user-guide/images).
|
||||
|
||||
If you are using [Google Container
|
||||
Registry](https://cloud.google.com/tools/container-registry/), tag
|
||||
your app image with your project ID, and push to GCR. Replace
|
||||
`<project>` with your project ID.
|
||||
|
||||
```shell
|
||||
docker tag job-wq-2 gcr.io/<project>/job-wq-2
|
||||
gcloud docker push gcr.io/<project>/job-wq-2
|
||||
```
|
||||
|
||||
## Defining a Job
|
||||
|
||||
Here is the job definition:
|
||||
|
||||
{% include code.html language="yaml" file="job.yaml" ghlink="/docs/tasks/job/fine-parallel-processing-work-queue/job.yaml" %}
|
||||
|
||||
Be sure to edit the job template to
|
||||
change `gcr.io/myproject` to your own path.
|
||||
|
||||
In this example, each pod works on several items from the queue and then exits when there are no more items.
|
||||
Since the workers themselves detect when the workqueue is empty, and the Job controller does not
|
||||
know about the workqueue, it relies on the workers to signal when they are done working.
|
||||
The workers signal that the queue is empty by exiting with success. So, as soon as any worker
|
||||
exits with success, the controller knows the work is done, and the Pods will exit soon.
|
||||
So, we set the completion count of the Job to 1. The job controller will wait for the other pods to complete
|
||||
too.
|
||||
|
||||
|
||||
## Running the Job
|
||||
|
||||
So, now run the Job:
|
||||
|
||||
```shell
|
||||
kubectl create -f ./job.yaml
|
||||
```
|
||||
|
||||
Now wait a bit, then check on the job.
|
||||
|
||||
```shell
|
||||
$ kubectl describe jobs/job-wq-2
|
||||
Name: job-wq-2
|
||||
Namespace: default
|
||||
Image(s): gcr.io/exampleproject/job-wq-2
|
||||
Selector: app in (job-wq-2)
|
||||
Parallelism: 2
|
||||
Completions: Unset
|
||||
Start Time: Mon, 11 Jan 2016 17:07:59 -0800
|
||||
Labels: app=job-wq-2
|
||||
Pods Statuses: 1 Running / 0 Succeeded / 0 Failed
|
||||
No volumes.
|
||||
Events:
|
||||
FirstSeen LastSeen Count From SubobjectPath Type Reason Message
|
||||
--------- -------- ----- ---- ------------- -------- ------ -------
|
||||
33s 33s 1 {job-controller } Normal SuccessfulCreate Created pod: job-wq-2-lglf8
|
||||
|
||||
|
||||
$ kubectl logs pods/job-wq-2-7r7b2
|
||||
Worker with sessionID: bbd72d0a-9e5c-4dd6-abf6-416cc267991f
|
||||
Initial queue state: empty=False
|
||||
Working on banana
|
||||
Working on date
|
||||
Working on lemon
|
||||
```
|
||||
|
||||
As you can see, one of our pods worked on several work units.
|
||||
|
||||
## Alternatives
|
||||
|
||||
If running a queue service or modifying your containers to use a work queue is inconvenient, you may
|
||||
want to consider one of the other [job patterns](/docs/concepts/jobs/run-to-completion-finite-workloads/#job-patterns).
|
||||
|
||||
If you have a continuous stream of background processing work to run, then
|
||||
consider running your background workers with a `replicationController` instead,
|
||||
and consider running a background processing library such as
|
||||
https://github.com/resque/resque.
|
|
@ -0,0 +1,14 @@
|
|||
apiVersion: batch/v1
|
||||
kind: Job
|
||||
metadata:
|
||||
name: job-wq-2
|
||||
spec:
|
||||
parallelism: 2
|
||||
template:
|
||||
metadata:
|
||||
name: job-wq-2
|
||||
spec:
|
||||
containers:
|
||||
- name: c
|
||||
image: gcr.io/myproject/job-wq-2
|
||||
restartPolicy: OnFailure
|
|
@ -0,0 +1,15 @@
|
|||
apiVersion: v1
|
||||
kind: Pod
|
||||
metadata:
|
||||
name: redis-master
|
||||
labels:
|
||||
app: redis
|
||||
spec:
|
||||
containers:
|
||||
- name: master
|
||||
image: redis
|
||||
env:
|
||||
- name: MASTER
|
||||
value: "true"
|
||||
ports:
|
||||
- containerPort: 6379
|
|
@ -0,0 +1,10 @@
|
|||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: redis
|
||||
spec:
|
||||
ports:
|
||||
- port: 6379
|
||||
targetPort: 6379
|
||||
selector:
|
||||
app: redis
|
|
@ -0,0 +1,130 @@
|
|||
#!/usr/bin/env python
|
||||
|
||||
# Based on http://peter-hoffmann.com/2012/python-simple-queue-redis-queue.html
|
||||
# and the suggestion in the redis documentation for RPOPLPUSH, at
|
||||
# http://redis.io/commands/rpoplpush, which suggests how to implement a work-queue.
|
||||
|
||||
|
||||
import redis
|
||||
import uuid
|
||||
import hashlib
|
||||
|
||||
class RedisWQ(object):
|
||||
"""Simple Finite Work Queue with Redis Backend
|
||||
|
||||
This work queue is finite: as long as no more work is added
|
||||
after workers start, the workers can detect when the queue
|
||||
is completely empty.
|
||||
|
||||
The items in the work queue are assumed to have unique values.
|
||||
|
||||
This object is not intended to be used by multiple threads
|
||||
concurrently.
|
||||
"""
|
||||
def __init__(self, name, **redis_kwargs):
|
||||
"""The default connection parameters are: host='localhost', port=6379, db=0
|
||||
|
||||
The work queue is identified by "name". The library may create other
|
||||
keys with "name" as a prefix.
|
||||
"""
|
||||
self._db = redis.StrictRedis(**redis_kwargs)
|
||||
# The session ID will uniquely identify this "worker".
|
||||
self._session = str(uuid.uuid4())
|
||||
# Work queue is implemented as two queues: main, and processing.
|
||||
# Work is initially in main, and moved to processing when a client picks it up.
|
||||
self._main_q_key = name
|
||||
self._processing_q_key = name + ":processing"
|
||||
self._lease_key_prefix = name + ":leased_by_session:"
|
||||
|
||||
def sessionID(self):
|
||||
"""Return the ID for this session."""
|
||||
return self._session
|
||||
|
||||
def _main_qsize(self):
|
||||
"""Return the size of the main queue."""
|
||||
return self._db.llen(self._main_q_key)
|
||||
|
||||
def _processing_qsize(self):
|
||||
"""Return the size of the main queue."""
|
||||
return self._db.llen(self._processing_q_key)
|
||||
|
||||
def empty(self):
|
||||
"""Return True if the queue is empty, including work being done, False otherwise.
|
||||
|
||||
False does not necessarily mean that there is work available to work on right now,
|
||||
"""
|
||||
return self._main_qsize() == 0 and self._processing_qsize() == 0
|
||||
|
||||
# TODO: implement this
|
||||
# def check_expired_leases(self):
|
||||
# """Return to the work queueReturn True if the queue is empty, False otherwise."""
|
||||
# # Processing list should not be _too_ long since it is approximately as long
|
||||
# # as the number of active and recently active workers.
|
||||
# processing = self._db.lrange(self._processing_q_key, 0, -1)
|
||||
# for item in processing:
|
||||
# # If the lease key is not present for an item (it expired or was
|
||||
# # never created because the client crashed before creating it)
|
||||
# # then move the item back to the main queue so others can work on it.
|
||||
# if not self._lease_exists(item):
|
||||
# TODO: transactionally move the key from processing queue to
|
||||
# to main queue, while detecting if a new lease is created
|
||||
# or if either queue is modified.
|
||||
|
||||
def _itemkey(self, item):
|
||||
"""Returns a string that uniquely identifies an item (bytes)."""
|
||||
return hashlib.sha224(item).hexdigest()
|
||||
|
||||
def _lease_exists(self, item):
|
||||
"""True if a lease on 'item' exists."""
|
||||
return self._db.exists(self._lease_key_prefix + self._itemkey(item))
|
||||
|
||||
def lease(self, lease_secs=60, block=True, timeout=None):
|
||||
"""Begin working on an item the work queue.
|
||||
|
||||
Lease the item for lease_secs. After that time, other
|
||||
workers may consider this client to have crashed or stalled
|
||||
and pick up the item instead.
|
||||
|
||||
If optional args block is true and timeout is None (the default), block
|
||||
if necessary until an item is available."""
|
||||
if block:
|
||||
item = self._db.brpoplpush(self._main_q_key, self._processing_q_key, timeout=timeout)
|
||||
else:
|
||||
item = self._db.rpoplpush(self._main_q_key, self._processing_q_key)
|
||||
if item:
|
||||
# Record that we (this session id) are working on a key. Expire that
|
||||
# note after the lease timeout.
|
||||
# Note: if we crash at this line of the program, then GC will see no lease
|
||||
# for this item a later return it to the main queue.
|
||||
itemkey = self._itemkey(item)
|
||||
self._db.setex(self._lease_key_prefix + itemkey, lease_secs, self._session)
|
||||
return item
|
||||
|
||||
def complete(self, value):
|
||||
"""Complete working on the item with 'value'.
|
||||
|
||||
If the lease expired, the item may not have completed, and some
|
||||
other worker may have picked it up. There is no indication
|
||||
of what happened.
|
||||
"""
|
||||
self._db.lrem(self._processing_q_key, 0, value)
|
||||
# If we crash here, then the GC code will try to move the value, but it will
|
||||
# not be here, which is fine. So this does not need to be a transaction.
|
||||
itemkey = self._itemkey(value)
|
||||
self._db.delete(self._lease_key_prefix + itemkey, self._session)
|
||||
|
||||
# TODO: add functions to clean up all keys associated with "name" when
|
||||
# processing is complete.
|
||||
|
||||
# TODO: add a function to add an item to the queue. Atomically
|
||||
# check if the queue is empty and if so fail to add the item
|
||||
# since other workers might think work is done and be in the process
|
||||
# of exiting.
|
||||
|
||||
# TODO(etune): move to my own github for hosting, e.g. github.com/erictune/rediswq-py and
|
||||
# make it so it can be pip installed by anyone (see
|
||||
# http://stackoverflow.com/questions/8247605/configuring-so-that-pip-install-can-work-from-github)
|
||||
|
||||
# TODO(etune): finish code to GC expired leases, and call periodically
|
||||
# e.g. each time lease times out.
|
||||
|
|
@ -0,0 +1,23 @@
|
|||
#!/usr/bin/env python
|
||||
|
||||
import time
|
||||
import rediswq
|
||||
|
||||
host="redis"
|
||||
# Uncomment next two lines if you do not have Kube-DNS working.
|
||||
# import os
|
||||
# host = os.getenv("REDIS_SERVICE_HOST")
|
||||
|
||||
q = rediswq.RedisWQ(name="job2", host="redis")
|
||||
print("Worker with sessionID: " + q.sessionID())
|
||||
print("Initial queue state: empty=" + str(q.empty()))
|
||||
while not q.empty():
|
||||
item = q.lease(lease_secs=10, block=True, timeout=2)
|
||||
if item is not None:
|
||||
itemstr = item.decode("utf=8")
|
||||
print("Working on " + itemstr)
|
||||
time.sleep(10) # Put your actual work here instead of sleep.
|
||||
q.complete(item)
|
||||
else:
|
||||
print("Waiting for work")
|
||||
print("Queue empty, exiting")
|
|
@ -0,0 +1,18 @@
|
|||
apiVersion: batch/v1
|
||||
kind: Job
|
||||
metadata:
|
||||
name: process-item-$ITEM
|
||||
labels:
|
||||
jobgroup: jobexample
|
||||
spec:
|
||||
template:
|
||||
metadata:
|
||||
name: jobexample
|
||||
labels:
|
||||
jobgroup: jobexample
|
||||
spec:
|
||||
containers:
|
||||
- name: c
|
||||
image: busybox
|
||||
command: ["sh", "-c", "echo Processing item $ITEM && sleep 5"]
|
||||
restartPolicy: Never
|
|
@ -0,0 +1,195 @@
|
|||
---
|
||||
title: Parallel Processing using Expansions
|
||||
---
|
||||
|
||||
* TOC
|
||||
{:toc}
|
||||
|
||||
# Example: Multiple Job Objects from Template Expansion
|
||||
|
||||
In this example, we will run multiple Kubernetes Jobs created from
|
||||
a common template. You may want to be familiar with the basic,
|
||||
non-parallel, use of [Jobs](/docs/concepts/jobs/run-to-completion-finite-workloads/) first.
|
||||
|
||||
## Basic Template Expansion
|
||||
|
||||
First, download the following template of a job to a file called `job.yaml.txt`
|
||||
|
||||
{% include code.html language="yaml" file="job.yaml" ghlink="/docs/tasks/job/parallel-processing-expansion/job.yaml" %}
|
||||
|
||||
Unlike a *pod template*, our *job template* is not a Kubernetes API type. It is just
|
||||
a yaml representation of a Job object that has some placeholders that need to be filled
|
||||
in before it can be used. The `$ITEM` syntax is not meaningful to Kubernetes.
|
||||
|
||||
In this example, the only processing the container does is to `echo` a string and sleep for a bit.
|
||||
In a real use case, the processing would be some substantial computation, such as rendering a frame
|
||||
of a movie, or processing a range of rows in a database. The "$ITEM" parameter would specify for
|
||||
example, the frame number or the row range.
|
||||
|
||||
This Job and its Pod template have a label: `jobgroup=jobexample`. There is nothing special
|
||||
to the system about this label. This label
|
||||
makes it convenient to operate on all the jobs in this group at once.
|
||||
We also put the same label on the pod template so that we can check on all Pods of these Jobs
|
||||
with a single command.
|
||||
After the job is created, the system will add more labels that distinguish one Job's pods
|
||||
from another Job's pods.
|
||||
Note that the label key `jobgroup` is not special to Kubernetes. You can pick your own label scheme.
|
||||
|
||||
Next, expand the template into multiple files, one for each item to be processed.
|
||||
|
||||
```shell
|
||||
# Expand files into a temporary directory
|
||||
mkdir ./jobs
|
||||
for i in apple banana cherry
|
||||
do
|
||||
cat job.yaml.txt | sed "s/\$ITEM/$i/" > ./jobs/job-$i.yaml
|
||||
done
|
||||
```
|
||||
|
||||
Check if it worked:
|
||||
|
||||
```shell
|
||||
$ ls jobs/
|
||||
job-apple.yaml
|
||||
job-banana.yaml
|
||||
job-cherry.yaml
|
||||
```
|
||||
|
||||
Here, we used `sed` to replace the string `$ITEM` with the loop variable.
|
||||
You could use any type of template language (jinja2, erb) or write a program
|
||||
to generate the Job objects.
|
||||
|
||||
Next, create all the jobs with one kubectl command:
|
||||
|
||||
```shell
|
||||
$ kubectl create -f ./jobs
|
||||
job "process-item-apple" created
|
||||
job "process-item-banana" created
|
||||
job "process-item-cherry" created
|
||||
```
|
||||
|
||||
Now, check on the jobs:
|
||||
|
||||
```shell
|
||||
$ kubectl get jobs -l jobgroup=jobexample
|
||||
JOB CONTAINER(S) IMAGE(S) SELECTOR SUCCESSFUL
|
||||
process-item-apple c busybox app in (jobexample),item in (apple) 1
|
||||
process-item-banana c busybox app in (jobexample),item in (banana) 1
|
||||
process-item-cherry c busybox app in (jobexample),item in (cherry) 1
|
||||
```
|
||||
|
||||
Here we use the `-l` option to select all jobs that are part of this
|
||||
group of jobs. (There might be other unrelated jobs in the system that we
|
||||
do not care to see.)
|
||||
|
||||
We can check on the pods as well using the same label selector:
|
||||
|
||||
```shell
|
||||
$ kubectl get pods -l jobgroup=jobexample --show-all
|
||||
NAME READY STATUS RESTARTS AGE
|
||||
process-item-apple-kixwv 0/1 Completed 0 4m
|
||||
process-item-banana-wrsf7 0/1 Completed 0 4m
|
||||
process-item-cherry-dnfu9 0/1 Completed 0 4m
|
||||
```
|
||||
|
||||
There is not a single command to check on the output of all jobs at once,
|
||||
but looping over all the pods is pretty easy:
|
||||
|
||||
```shell
|
||||
$ for p in $(kubectl get pods -l jobgroup=jobexample -o name)
|
||||
do
|
||||
kubectl logs $p
|
||||
done
|
||||
Processing item apple
|
||||
Processing item banana
|
||||
Processing item cherry
|
||||
```
|
||||
|
||||
## Multiple Template Parameters
|
||||
|
||||
In the first example, each instance of the template had one parameter, and that parameter was also
|
||||
used as a label. However label keys are limited in [what characters they can
|
||||
contain](/docs/user-guide/labels/#syntax-and-character-set).
|
||||
|
||||
This slightly more complex example uses the jinja2 template language to generate our objects.
|
||||
We will use a one-line python script to convert the template to a file.
|
||||
|
||||
First, copy and paste the following template of a Job object, into a file called `job.yaml.jinja2`:
|
||||
|
||||
|
||||
```liquid{% raw %}
|
||||
{%- set params = [{ "name": "apple", "url": "http://www.orangepippin.com/apples", },
|
||||
{ "name": "banana", "url": "https://en.wikipedia.org/wiki/Banana", },
|
||||
{ "name": "raspberry", "url": "https://www.raspberrypi.org/" }]
|
||||
%}
|
||||
{%- for p in params %}
|
||||
{%- set name = p["name"] %}
|
||||
{%- set url = p["url"] %}
|
||||
apiVersion: batch/v1
|
||||
kind: Job
|
||||
metadata:
|
||||
name: jobexample-{{ name }}
|
||||
labels:
|
||||
jobgroup: jobexample
|
||||
spec:
|
||||
template:
|
||||
name: jobexample
|
||||
labels:
|
||||
jobgroup: jobexample
|
||||
spec:
|
||||
containers:
|
||||
- name: c
|
||||
image: busybox
|
||||
command: ["sh", "-c", "echo Processing URL {{ url }} && sleep 5"]
|
||||
restartPolicy: Never
|
||||
---
|
||||
{%- endfor %}
|
||||
{% endraw %}
|
||||
```
|
||||
|
||||
The above template defines parameters for each job object using a list of
|
||||
python dicts (lines 1-4). Then a for loop emits one job yaml object
|
||||
for each set of parameters (remaining lines).
|
||||
We take advantage of the fact that multiple yaml documents can be concatenated
|
||||
with the `---` separator (second to last line).
|
||||
.) We can pipe the output directly to kubectl to
|
||||
create the objects.
|
||||
|
||||
You will need the jinja2 package if you do not already have it: `pip install --user jinja2`.
|
||||
Now, use this one-line python program to expand the template:
|
||||
|
||||
```shell
|
||||
alias render_template='python -c "from jinja2 import Template; import sys; print(Template(sys.stdin.read()).render());"'
|
||||
```
|
||||
|
||||
|
||||
|
||||
The output can be saved to a file, like this:
|
||||
|
||||
```shell
|
||||
cat job.yaml.jinja2 | render_template > jobs.yaml
|
||||
```
|
||||
|
||||
or sent directly to kubectl, like this:
|
||||
|
||||
```shell
|
||||
cat job.yaml.jinja2 | render_template | kubectl create -f -
|
||||
```
|
||||
|
||||
## Alternatives
|
||||
|
||||
If you have a large number of job objects, you may find that:
|
||||
|
||||
- even using labels, managing so many Job objects is cumbersome.
|
||||
- You exceed resource quota when creating all the Jobs at once,
|
||||
and do not want to wait to create them incrementally.
|
||||
- You need a way to easily scale the number of pods running
|
||||
concurrently. One reason would be to avoid using too many
|
||||
compute resources. Another would be to limit the number of
|
||||
concurrent requests to a shared resource, such as a database,
|
||||
used by all the pods in the job.
|
||||
- very large numbers of jobs created at once overload the
|
||||
Kubernetes apiserver, controller, or scheduler.
|
||||
|
||||
In this case, you can consider one of the
|
||||
other [job patterns](/docs/concepts/jobs/run-to-completion-finite-workloads/#job-patterns).
|
|
@ -0,0 +1,10 @@
|
|||
# Specify BROKER_URL and QUEUE when running
|
||||
FROM ubuntu:14.04
|
||||
|
||||
RUN apt-get update && \
|
||||
apt-get install -y curl ca-certificates amqp-tools python \
|
||||
--no-install-recommends \
|
||||
&& rm -rf /var/lib/apt/lists/*
|
||||
COPY ./worker.py /worker.py
|
||||
|
||||
CMD /usr/bin/amqp-consume --url=$BROKER_URL -q $QUEUE -c 1 /worker.py
|
|
@ -0,0 +1,284 @@
|
|||
---
|
||||
title: Coarse Parallel Processing Using a Work Queue
|
||||
---
|
||||
|
||||
* TOC
|
||||
{:toc}
|
||||
|
||||
# Example: Job with Work Queue with Pod Per Work Item
|
||||
|
||||
In this example, we will run a Kubernetes Job with multiple parallel
|
||||
worker processes. You may want to be familiar with the basic,
|
||||
non-parallel, use of [Job](/docs/concepts/jobs/run-to-completion-finite-workloads/) first.
|
||||
|
||||
In this example, as each pod is created, it picks up one unit of work
|
||||
from a task queue, completes it, deletes it from the queue, and exits.
|
||||
|
||||
|
||||
Here is an overview of the steps in this example:
|
||||
|
||||
1. **Start a message queue service.** In this example, we use RabbitMQ, but you could use another
|
||||
one. In practice you would set up a message queue service once and reuse it for many jobs.
|
||||
1. **Create a queue, and fill it with messages.** Each message represents one task to be done. In
|
||||
this example, a message is just an integer that we will do a lengthy computation on.
|
||||
1. **Start a Job that works on tasks from the queue**. The Job starts several pods. Each pod takes
|
||||
one task from the message queue, processes it, and repeats until the end of the queue is reached.
|
||||
|
||||
## Starting a message queue service
|
||||
|
||||
This example uses RabbitMQ, but it should be easy to adapt to another AMQP-type message service.
|
||||
|
||||
In practice you could set up a message queue service once in a
|
||||
cluster and reuse it for many jobs, as well as for long-running services.
|
||||
|
||||
Start RabbitMQ as follows:
|
||||
|
||||
```shell
|
||||
$ kubectl create -f examples/celery-rabbitmq/rabbitmq-service.yaml
|
||||
service "rabbitmq-service" created
|
||||
$ kubectl create -f examples/celery-rabbitmq/rabbitmq-controller.yaml
|
||||
replicationController "rabbitmq-controller" created
|
||||
```
|
||||
|
||||
We will only use the rabbitmq part from the [celery-rabbitmq example](https://github.com/kubernetes/kubernetes/tree/release-1.3/examples/celery-rabbitmq).
|
||||
|
||||
## Testing the message queue service
|
||||
|
||||
Now, we can experiment with accessing the message queue. We will
|
||||
create a temporary interactive pod, install some tools on it,
|
||||
and experiment with queues.
|
||||
|
||||
First create a temporary interactive Pod.
|
||||
|
||||
```shell
|
||||
# Create a temporary interactive container
|
||||
$ kubectl run -i --tty temp --image ubuntu:14.04
|
||||
Waiting for pod default/temp-loe07 to be running, status is Pending, pod ready: false
|
||||
... [ previous line repeats several times .. hit return when it stops ] ...
|
||||
```
|
||||
|
||||
Note that your pod name and command prompt will be different.
|
||||
|
||||
Next install the `amqp-tools` so we can work with message queues.
|
||||
|
||||
```shell
|
||||
# Install some tools
|
||||
root@temp-loe07:/# apt-get update
|
||||
.... [ lots of output ] ....
|
||||
root@temp-loe07:/# apt-get install -y curl ca-certificates amqp-tools python dnsutils
|
||||
.... [ lots of output ] ....
|
||||
```
|
||||
|
||||
Later, we will make a docker image that includes these packages.
|
||||
|
||||
Next, we will check that we can discover the rabbitmq service:
|
||||
|
||||
```
|
||||
# Note the rabitmq-service has a DNS name, provided by Kubernetes:
|
||||
|
||||
root@temp-loe07:/# nslookup rabbitmq-service
|
||||
Server: 10.0.0.10
|
||||
Address: 10.0.0.10#53
|
||||
|
||||
Name: rabbitmq-service.default.svc.cluster.local
|
||||
Address: 10.0.147.152
|
||||
|
||||
# Your address will vary.
|
||||
```
|
||||
|
||||
If Kube-DNS is not setup correctly, the previous step may not work for you.
|
||||
You can also find the service IP in an env var:
|
||||
|
||||
```
|
||||
# env | grep RABBIT | grep HOST
|
||||
RABBITMQ_SERVICE_SERVICE_HOST=10.0.147.152
|
||||
# Your address will vary.
|
||||
```
|
||||
|
||||
Next we will verify we can create a queue, and publish and consume messages.
|
||||
|
||||
```shell
|
||||
# In the next line, rabbitmq-service is the hostname where the rabbitmq-service
|
||||
# can be reached. 5672 is the standard port for rabbitmq.
|
||||
|
||||
root@temp-loe07:/# export BROKER_URL=amqp://guest:guest@rabbitmq-service:5672
|
||||
# If you could not resolve "rabbitmq-service" in the previous step,
|
||||
# then use this command instead:
|
||||
# root@temp-loe07:/# BROKER_URL=amqp://guest:guest@$RABBITMQ_SERVICE_SERVICE_HOST:5672
|
||||
|
||||
# Now create a queue:
|
||||
|
||||
root@temp-loe07:/# /usr/bin/amqp-declare-queue --url=$BROKER_URL -q foo -d
|
||||
foo
|
||||
|
||||
# Publish one message to it:
|
||||
|
||||
root@temp-loe07:/# /usr/bin/amqp-publish --url=$BROKER_URL -r foo -p -b Hello
|
||||
|
||||
# And get it back.
|
||||
|
||||
root@temp-loe07:/# /usr/bin/amqp-consume --url=$BROKER_URL -q foo -c 1 cat && echo
|
||||
Hello
|
||||
root@temp-loe07:/#
|
||||
```
|
||||
|
||||
In the last command, the `amqp-consume` tool takes one message (`-c 1`)
|
||||
from the queue, and passes that message to the standard input of an arbitrary command. In this case, the program `cat` is just printing
|
||||
out what it gets on the standard input, and the echo is just to add a carriage
|
||||
return so the example is readable.
|
||||
|
||||
## Filling the Queue with tasks
|
||||
|
||||
Now lets fill the queue with some "tasks". In our example, our tasks are just strings to be
|
||||
printed.
|
||||
|
||||
In a practice, the content of the messages might be:
|
||||
|
||||
- names of files to that need to be processed
|
||||
- extra flags to the program
|
||||
- ranges of keys in a database table
|
||||
- configuration parameters to a simulation
|
||||
- frame numbers of a scene to be rendered
|
||||
|
||||
In practice, if there is large data that is needed in a read-only mode by all pods
|
||||
of the Job, you will typically put that in a shared file system like NFS and mount
|
||||
that readonly on all the pods, or the program in the pod will natively read data from
|
||||
a cluster file system like HDFS.
|
||||
|
||||
For our example, we will create the queue and fill it using the amqp command line tools.
|
||||
In practice, you might write a program to fill the queue using an amqp client library.
|
||||
|
||||
```shell
|
||||
$ /usr/bin/amqp-declare-queue --url=$BROKER_URL -q job1 -d
|
||||
job1
|
||||
$ for f in apple banana cherry date fig grape lemon melon
|
||||
do
|
||||
/usr/bin/amqp-publish --url=$BROKER_URL -r job1 -p -b $f
|
||||
done
|
||||
```
|
||||
|
||||
So, we filled the queue with 8 messages.
|
||||
|
||||
## Create an Image
|
||||
|
||||
Now we are ready to create an image that we will run as a job.
|
||||
|
||||
We will use the `amqp-consume` utility to read the message
|
||||
from the queue and run our actual program. Here is a very simple
|
||||
example program:
|
||||
|
||||
{% include code.html language="python" file="worker.py" ghlink="/docs/tasks/job/work-queue-1/worker.py" %}
|
||||
|
||||
Now, build an image. If you are working in the source
|
||||
tree, then change directory to `examples/job/work-queue-1`.
|
||||
Otherwise, make a temporary directory, change to it,
|
||||
download the [Dockerfile](Dockerfile?raw=true),
|
||||
and [worker.py](worker.py?raw=true). In either case,
|
||||
build the image with this command: `
|
||||
|
||||
```shell
|
||||
$ docker build -t job-wq-1 .
|
||||
```
|
||||
|
||||
For the [Docker Hub](https://hub.docker.com/), tag your app image with
|
||||
your username and push to the Hub with the below commands. Replace
|
||||
`<username>` with your Hub username.
|
||||
|
||||
```shell
|
||||
docker tag job-wq-1 <username>/job-wq-1
|
||||
docker push <username>/job-wq-1
|
||||
```
|
||||
|
||||
If you are using [Google Container
|
||||
Registry](https://cloud.google.com/tools/container-registry/), tag
|
||||
your app image with your project ID, and push to GCR. Replace
|
||||
`<project>` with your project ID.
|
||||
|
||||
```shell
|
||||
docker tag job-wq-1 gcr.io/<project>/job-wq-1
|
||||
gcloud docker push gcr.io/<project>/job-wq-1
|
||||
```
|
||||
|
||||
## Defining a Job
|
||||
|
||||
Here is a job definition. You'll need to make a copy of the Job and edit the
|
||||
image to match the name you used, and call it `./job.yaml`.
|
||||
|
||||
|
||||
{% include code.html language="yaml" file="job.yaml" ghlink="/docs/tasks/job/work-queue-1/job.yaml" %}
|
||||
|
||||
In this example, each pod works on one item from the queue and then exits.
|
||||
So, the completion count of the Job corresponds to the number of work items
|
||||
done. So we set, `.spec.completions: 8` for the example, since we put 8 items in the queue.
|
||||
|
||||
## Running the Job
|
||||
|
||||
So, now run the Job:
|
||||
|
||||
```shell
|
||||
kubectl create -f ./job.yaml
|
||||
```
|
||||
|
||||
Now wait a bit, then check on the job.
|
||||
|
||||
```shell
|
||||
$ kubectl describe jobs/job-wq-1
|
||||
Name: job-wq-1
|
||||
Namespace: default
|
||||
Image(s): gcr.io/causal-jigsaw-637/job-wq-1
|
||||
Selector: app in (job-wq-1)
|
||||
Parallelism: 2
|
||||
Completions: 8
|
||||
Labels: app=job-wq-1
|
||||
Pods Statuses: 0 Running / 8 Succeeded / 0 Failed
|
||||
No volumes.
|
||||
Events:
|
||||
FirstSeen LastSeen Count From SubobjectPath Reason Message
|
||||
───────── ──────── ───── ──── ───────────── ────── ───────
|
||||
27s 27s 1 {job } SuccessfulCreate Created pod: job-wq-1-hcobb
|
||||
27s 27s 1 {job } SuccessfulCreate Created pod: job-wq-1-weytj
|
||||
27s 27s 1 {job } SuccessfulCreate Created pod: job-wq-1-qaam5
|
||||
27s 27s 1 {job } SuccessfulCreate Created pod: job-wq-1-b67sr
|
||||
26s 26s 1 {job } SuccessfulCreate Created pod: job-wq-1-xe5hj
|
||||
15s 15s 1 {job } SuccessfulCreate Created pod: job-wq-1-w2zqe
|
||||
14s 14s 1 {job } SuccessfulCreate Created pod: job-wq-1-d6ppa
|
||||
14s 14s 1 {job } SuccessfulCreate Created pod: job-wq-1-p17e0
|
||||
```
|
||||
|
||||
All our pods succeeded. Yay.
|
||||
|
||||
|
||||
## Alternatives
|
||||
|
||||
This approach has the advantage that you
|
||||
do not need to modify your "worker" program to be aware that there is a work queue.
|
||||
|
||||
It does require that you run a message queue service.
|
||||
If running a queue service is inconvenient, you may
|
||||
want to consider one of the other [job patterns](/docs/concepts/jobs/run-to-completion-finite-workloads/#job-patterns).
|
||||
|
||||
This approach creates a pod for every work item. If your work items only take a few seconds,
|
||||
though, creating a Pod for every work item may add a lot of overhead. Consider another
|
||||
[example](/docs/tasks/job/fine-parallel-processing-work-queue/), that executes multiple work items per Pod.
|
||||
|
||||
In this example, we used use the `amqp-consume` utility to read the message
|
||||
from the queue and run our actual program. This has the advantage that you
|
||||
do not need to modify your program to be aware of the queue.
|
||||
A [different example](/docs/tasks/job/fine-parallel-processing-work-queue/), shows how to
|
||||
communicate with the work queue using a client library.
|
||||
|
||||
## Caveats
|
||||
|
||||
If the number of completions is set to less than the number of items in the queue, then
|
||||
not all items will be processed.
|
||||
|
||||
If the number of completions is set to more than the number of items in the queue,
|
||||
then the Job will not appear to be completed, even though all items in the queue
|
||||
have been processed. It will start additional pods which will block waiting
|
||||
for a message.
|
||||
|
||||
There is an unlikely race with this pattern. If the container is killed in between the time
|
||||
that the message is acknowledged by the amqp-consume command and the time that the container
|
||||
exits with success, or if the node crashes before the kubelet is able to post the success of the pod
|
||||
back to the api-server, then the Job will not appear to be complete, even though all items
|
||||
in the queue have been processed.
|
|
@ -0,0 +1,20 @@
|
|||
apiVersion: batch/v1
|
||||
kind: Job
|
||||
metadata:
|
||||
name: job-wq-1
|
||||
spec:
|
||||
completions: 8
|
||||
parallelism: 2
|
||||
template:
|
||||
metadata:
|
||||
name: job-wq-1
|
||||
spec:
|
||||
containers:
|
||||
- name: c
|
||||
image: gcr.io/<project>/job-wq-1
|
||||
env:
|
||||
- name: BROKER_URL
|
||||
value: amqp://guest:guest@rabbitmq-service:5672
|
||||
- name: QUEUE
|
||||
value: job1
|
||||
restartPolicy: OnFailure
|
|
@ -0,0 +1,7 @@
|
|||
#!/usr/bin/env python
|
||||
|
||||
# Just prints standard out and sleeps for 10 seconds.
|
||||
import sys
|
||||
import time
|
||||
print("Processing " + sys.stdin.lines())
|
||||
time.sleep(10)
|
|
@ -47,7 +47,7 @@ kubectl scale statefulsets <stateful-set-name> --replicas=<new-replicas>
|
|||
|
||||
### Alternative: `kubectl apply` / `kubectl edit` / `kubectl patch`
|
||||
|
||||
Alternatively, you can do [in-place updates](/docs/user-guide/managing-deployments/#in-place-updates-of-resources) on your StatefulSets.
|
||||
Alternatively, you can do [in-place updates](/docs/concepts/cluster-administration/manage-deployment/#in-place-updates-of-resources) on your StatefulSets.
|
||||
|
||||
If your StatefulSet was initially created with `kubectl apply` or `kubectl create --save-config`,
|
||||
update `.spec.replicas` of the StatefulSet manifests, and then do a `kubectl apply`:
|
||||
|
|
|
@ -0,0 +1,256 @@
|
|||
---
|
||||
assignees:
|
||||
- janetkuo
|
||||
title: Rolling Update Replication Controller
|
||||
---
|
||||
|
||||
* TOC
|
||||
{:toc}
|
||||
|
||||
## Overview
|
||||
|
||||
To update a service without an outage, `kubectl` supports what is called ['rolling update'](/docs/user-guide/kubectl/kubectl_rolling-update), which updates one pod at a time, rather than taking down the entire service at the same time. See the [rolling update design document](https://github.com/kubernetes/kubernetes/blob/{{page.githubbranch}}/docs/design/simple-rolling-update.md) and the [example of rolling update](/docs/tasks/run-application/rolling-update-replication-controller/) for more information.
|
||||
|
||||
Note that `kubectl rolling-update` only supports Replication Controllers. However, if you deploy applications with Replication Controllers,
|
||||
consider switching them to [Deployments](/docs/user-guide/deployments/). A Deployment is a higher-level controller that automates rolling updates
|
||||
of applications declaratively, and therefore is recommended. If you still want to keep your Replication Controllers and use `kubectl rolling-update`, keep reading:
|
||||
|
||||
A rolling update applies changes to the configuration of pods being managed by
|
||||
a replication controller. The changes can be passed as a new replication
|
||||
controller configuration file; or, if only updating the image, a new container
|
||||
image can be specified directly.
|
||||
|
||||
A rolling update works by:
|
||||
|
||||
1. Creating a new replication controller with the updated configuration.
|
||||
2. Increasing/decreasing the replica count on the new and old controllers until
|
||||
the correct number of replicas is reached.
|
||||
3. Deleting the original replication controller.
|
||||
|
||||
Rolling updates are initiated with the `kubectl rolling-update` command:
|
||||
|
||||
$ kubectl rolling-update NAME \
|
||||
([NEW_NAME] --image=IMAGE | -f FILE)
|
||||
|
||||
## Passing a configuration file
|
||||
|
||||
To initiate a rolling update using a configuration file, pass the new file to
|
||||
`kubectl rolling-update`:
|
||||
|
||||
$ kubectl rolling-update NAME -f FILE
|
||||
|
||||
The configuration file must:
|
||||
|
||||
* Specify a different `metadata.name` value.
|
||||
|
||||
* Overwrite at least one common label in its `spec.selector` field.
|
||||
|
||||
* Use the same `metadata.namespace`.
|
||||
|
||||
Replication controller configuration files are described in
|
||||
[Creating Replication Controllers](/docs/user-guide/replication-controller/operations/).
|
||||
|
||||
### Examples
|
||||
|
||||
// Update pods of frontend-v1 using new replication controller data in frontend-v2.json.
|
||||
$ kubectl rolling-update frontend-v1 -f frontend-v2.json
|
||||
|
||||
// Update pods of frontend-v1 using JSON data passed into stdin.
|
||||
$ cat frontend-v2.json | kubectl rolling-update frontend-v1 -f -
|
||||
|
||||
## Updating the container image
|
||||
|
||||
To update only the container image, pass a new image name and tag with the
|
||||
`--image` flag and (optionally) a new controller name:
|
||||
|
||||
$ kubectl rolling-update NAME [NEW_NAME] --image=IMAGE:TAG
|
||||
|
||||
The `--image` flag is only supported for single-container pods. Specifying
|
||||
`--image` with multi-container pods returns an error.
|
||||
|
||||
If no `NEW_NAME` is specified, a new replication controller is created with
|
||||
a temporary name. Once the rollout is complete, the old controller is deleted,
|
||||
and the new controller is updated to use the original name.
|
||||
|
||||
The update will fail if `IMAGE:TAG` is identical to the
|
||||
current value. For this reason, we recommend the use of versioned tags as
|
||||
opposed to values such as `:latest`. Doing a rolling update from `image:latest`
|
||||
to a new `image:latest` will fail, even if the image at that tag has changed.
|
||||
Moreover, the use of `:latest` is not recommended, see
|
||||
[Best Practices for Configuration](/docs/concepts/configuration/overview/#container-images) for more information.
|
||||
|
||||
### Examples
|
||||
|
||||
// Update the pods of frontend-v1 to frontend-v2
|
||||
$ kubectl rolling-update frontend-v1 frontend-v2 --image=image:v2
|
||||
|
||||
// Update the pods of frontend, keeping the replication controller name
|
||||
$ kubectl rolling-update frontend --image=image:v2
|
||||
|
||||
## Required and optional fields
|
||||
|
||||
Required fields are:
|
||||
|
||||
* `NAME`: The name of the replication controller to update.
|
||||
|
||||
as well as either:
|
||||
|
||||
* `-f FILE`: A replication controller configuration file, in either JSON or
|
||||
YAML format. The configuration file must specify a new top-level `id` value
|
||||
and include at least one of the existing `spec.selector` key:value pairs.
|
||||
See the
|
||||
[Run Stateless AP Replication Controller](/docs/tutorials/stateless-application/run-stateless-ap-replication-controller/#replication-controller-configuration-file)
|
||||
page for details.
|
||||
<br>
|
||||
<br>
|
||||
or:
|
||||
<br>
|
||||
<br>
|
||||
* `--image IMAGE:TAG`: The name and tag of the image to update to. Must be
|
||||
different than the current image:tag currently specified.
|
||||
|
||||
Optional fields are:
|
||||
|
||||
* `NEW_NAME`: Only used in conjunction with `--image` (not with `-f FILE`). The
|
||||
name to assign to the new replication controller.
|
||||
* `--poll-interval DURATION`: The time between polling the controller status
|
||||
after update. Valid units are `ns` (nanoseconds), `us` or `µs` (microseconds),
|
||||
`ms` (milliseconds), `s` (seconds), `m` (minutes), or `h` (hours). Units can
|
||||
be combined (e.g. `1m30s`). The default is `3s`.
|
||||
* `--timeout DURATION`: The maximum time to wait for the controller to update a
|
||||
pod before exiting. Default is `5m0s`. Valid units are as described for
|
||||
`--poll-interval` above.
|
||||
* `--update-period DURATION`: The time to wait between updating pods. Default
|
||||
is `1m0s`. Valid units are as described for `--poll-interval` above.
|
||||
|
||||
Additional information about the `kubectl rolling-update` command is available
|
||||
from the [`kubectl` reference](/docs/user-guide/kubectl/kubectl_rolling-update/).
|
||||
|
||||
## Walkthrough
|
||||
|
||||
Let's say you were running version 1.7.9 of nginx:
|
||||
|
||||
```yaml
|
||||
apiVersion: v1
|
||||
kind: ReplicationController
|
||||
metadata:
|
||||
name: my-nginx
|
||||
spec:
|
||||
replicas: 5
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: nginx
|
||||
spec:
|
||||
containers:
|
||||
- name: nginx
|
||||
image: nginx:1.7.9
|
||||
ports:
|
||||
- containerPort: 80
|
||||
```
|
||||
|
||||
To update to version 1.9.1, you can use [`kubectl rolling-update --image`](https://github.com/kubernetes/kubernetes/blob/{{page.githubbranch}}/docs/design/simple-rolling-update.md) to specify the new image:
|
||||
|
||||
```shell
|
||||
$ kubectl rolling-update my-nginx --image=nginx:1.9.1
|
||||
Created my-nginx-ccba8fbd8cc8160970f63f9a2696fc46
|
||||
```
|
||||
|
||||
In another window, you can see that `kubectl` added a `deployment` label to the pods, whose value is a hash of the configuration, to distinguish the new pods from the old:
|
||||
|
||||
```shell
|
||||
$ kubectl get pods -l app=nginx -L deployment
|
||||
NAME READY STATUS RESTARTS AGE DEPLOYMENT
|
||||
my-nginx-ccba8fbd8cc8160970f63f9a2696fc46-k156z 1/1 Running 0 1m ccba8fbd8cc8160970f63f9a2696fc46
|
||||
my-nginx-ccba8fbd8cc8160970f63f9a2696fc46-v95yh 1/1 Running 0 35s ccba8fbd8cc8160970f63f9a2696fc46
|
||||
my-nginx-divi2 1/1 Running 0 2h 2d1d7a8f682934a254002b56404b813e
|
||||
my-nginx-o0ef1 1/1 Running 0 2h 2d1d7a8f682934a254002b56404b813e
|
||||
my-nginx-q6all 1/1 Running 0 8m 2d1d7a8f682934a254002b56404b813e
|
||||
```
|
||||
|
||||
`kubectl rolling-update` reports progress as it progresses:
|
||||
|
||||
```
|
||||
Scaling up my-nginx-ccba8fbd8cc8160970f63f9a2696fc46 from 0 to 3, scaling down my-nginx from 3 to 0 (keep 3 pods available, don't exceed 4 pods)
|
||||
Scaling my-nginx-ccba8fbd8cc8160970f63f9a2696fc46 up to 1
|
||||
Scaling my-nginx down to 2
|
||||
Scaling my-nginx-ccba8fbd8cc8160970f63f9a2696fc46 up to 2
|
||||
Scaling my-nginx down to 1
|
||||
Scaling my-nginx-ccba8fbd8cc8160970f63f9a2696fc46 up to 3
|
||||
Scaling my-nginx down to 0
|
||||
Update succeeded. Deleting old controller: my-nginx
|
||||
Renaming my-nginx-ccba8fbd8cc8160970f63f9a2696fc46 to my-nginx
|
||||
replicationcontroller "my-nginx" rolling updated
|
||||
```
|
||||
|
||||
If you encounter a problem, you can stop the rolling update midway and revert to the previous version using `--rollback`:
|
||||
|
||||
```shell
|
||||
$ kubectl rolling-update my-nginx --rollback
|
||||
Setting "my-nginx" replicas to 1
|
||||
Continuing update with existing controller my-nginx.
|
||||
Scaling up nginx from 1 to 1, scaling down my-nginx-ccba8fbd8cc8160970f63f9a2696fc46 from 1 to 0 (keep 1 pods available, don't exceed 2 pods)
|
||||
Scaling my-nginx-ccba8fbd8cc8160970f63f9a2696fc46 down to 0
|
||||
Update succeeded. Deleting my-nginx-ccba8fbd8cc8160970f63f9a2696fc46
|
||||
replicationcontroller "my-nginx" rolling updated
|
||||
```
|
||||
|
||||
This is one example where the immutability of containers is a huge asset.
|
||||
|
||||
If you need to update more than just the image (e.g., command arguments, environment variables), you can create a new replication controller, with a new name and distinguishing label value, such as:
|
||||
|
||||
```yaml
|
||||
apiVersion: v1
|
||||
kind: ReplicationController
|
||||
metadata:
|
||||
name: my-nginx-v4
|
||||
spec:
|
||||
replicas: 5
|
||||
selector:
|
||||
app: nginx
|
||||
deployment: v4
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: nginx
|
||||
deployment: v4
|
||||
spec:
|
||||
containers:
|
||||
- name: nginx
|
||||
image: nginx:1.9.2
|
||||
args: ["nginx", "-T"]
|
||||
ports:
|
||||
- containerPort: 80
|
||||
```
|
||||
|
||||
and roll it out:
|
||||
|
||||
```shell
|
||||
$ kubectl rolling-update my-nginx -f ./nginx-rc.yaml
|
||||
Created my-nginx-v4
|
||||
Scaling up my-nginx-v4 from 0 to 5, scaling down my-nginx from 4 to 0 (keep 4 pods available, don't exceed 5 pods)
|
||||
Scaling my-nginx-v4 up to 1
|
||||
Scaling my-nginx down to 3
|
||||
Scaling my-nginx-v4 up to 2
|
||||
Scaling my-nginx down to 2
|
||||
Scaling my-nginx-v4 up to 3
|
||||
Scaling my-nginx down to 1
|
||||
Scaling my-nginx-v4 up to 4
|
||||
Scaling my-nginx down to 0
|
||||
Scaling my-nginx-v4 up to 5
|
||||
Update succeeded. Deleting old controller: my-nginx
|
||||
replicationcontroller "my-nginx-v4" rolling updated
|
||||
```
|
||||
|
||||
You can also run the [update demo](/docs/tasks/run-application/rolling-update-replication-controller/) to see a visual representation of the rolling update process.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
If the `timeout` duration is reached during a rolling update, the operation will
|
||||
fail with some pods belonging to the new replication controller, and some to the
|
||||
original controller.
|
||||
|
||||
To continue the update from where it failed, retry using the same command.
|
||||
|
||||
To roll back to the original state before the attempted update, append the
|
||||
`--rollback=true` flag to the original command. This will revert all changes.
|
|
@ -0,0 +1,393 @@
|
|||
---
|
||||
assignees:
|
||||
- stclair
|
||||
title: AppArmor
|
||||
---
|
||||
|
||||
AppArmor is a Linux kernel enhancement that can reduce the potential attack surface of an
|
||||
application and provide greater defense in depth for Applications. Beta support for AppArmor was
|
||||
added in Kubernetes v1.4.
|
||||
|
||||
* TOC
|
||||
{:toc}
|
||||
|
||||
## What is AppArmor
|
||||
|
||||
AppArmor is a Linux kernel security module that supplements the standard Linux user and group based
|
||||
permissions to confine programs to a limited set of resources. AppArmor can be configured for any
|
||||
application to reduce its potential attack surface and provide greater defense in depth. It is
|
||||
configured through profiles tuned to whitelist the access needed by a specific program or container,
|
||||
such as Linux capabilities, network access, file permissions, etc. Each profile can be run in either
|
||||
enforcing mode, which blocks access to disallowed resources, or complain mode, which only reports
|
||||
violations.
|
||||
|
||||
AppArmor can help you to run a more secure deployment by restricting what containers are allowed to
|
||||
do, and /or providing better auditing through system logs. However, it is important to keep in mind
|
||||
that AppArmor is not a silver bullet, and can only do so much to protect against exploits in your
|
||||
application code. It is important to provide good, restrictive profiles, and harden your
|
||||
applications and cluster from other angles as well.
|
||||
|
||||
AppArmor support in Kubernetes is currently in beta.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
1. **Kubernetes version is at least v1.4**. Kubernetes support for AppArmor was added in
|
||||
v1.4. Kubernetes components older than v1.4 are not aware of the new AppArmor annotations, and
|
||||
will **silently ignore** any AppArmor settings that are provided. To ensure that your Pods are
|
||||
receiving the expected protections, it is important to verify the Kubelet version of your nodes:
|
||||
|
||||
$ kubectl get nodes -o=jsonpath=$'{range .items[*]}{@.metadata.name}: {@.status.nodeInfo.kubeletVersion}\n{end}'
|
||||
gke-test-default-pool-239f5d02-gyn2: v1.4.0
|
||||
gke-test-default-pool-239f5d02-x1kf: v1.4.0
|
||||
gke-test-default-pool-239f5d02-xwux: v1.4.0
|
||||
|
||||
2. **AppArmor kernel module is enabled**. For the Linux kernel to enforce an AppArmor profile, the
|
||||
AppArmor kernel module must be installed and enabled. Several distributions enable the module by
|
||||
default, such as Ubuntu and SUSE, and many others provide optional support. To check whether the
|
||||
module is enabled, check the `/sys/module/apparmor/parameters/enabled` file:
|
||||
|
||||
$ cat /sys/module/apparmor/parameters/enabled
|
||||
Y
|
||||
|
||||
If the Kubelet contains AppArmor support (>= v1.4), it will refuse to run a Pod with AppArmor
|
||||
options if the kernel module is not enabled.
|
||||
|
||||
*Note: Ubuntu carries many AppArmor patches that have not been merged into the upstream Linux
|
||||
kernel, including patches that add additional hooks and features. Kubernetes has only been
|
||||
tested with the upstream version, and does not promise support for other features.*
|
||||
|
||||
3. **Container runtime is Docker**. Currently the only Kubernetes-supported container runtime that
|
||||
also supports AppArmor is Docker. As more runtimes add AppArmor support, the options will be
|
||||
expanded. You can verify that your nodes are running docker with:
|
||||
|
||||
$ kubectl get nodes -o=jsonpath=$'{range .items[*]}{@.metadata.name}: {@.status.nodeInfo.containerRuntimeVersion}\n{end}'
|
||||
gke-test-default-pool-239f5d02-gyn2: docker://1.11.2
|
||||
gke-test-default-pool-239f5d02-x1kf: docker://1.11.2
|
||||
gke-test-default-pool-239f5d02-xwux: docker://1.11.2
|
||||
|
||||
If the Kubelet contains AppArmor support (>= v1.4), it will refuse to run a Pod with AppArmor
|
||||
options if the runtime is not Docker.
|
||||
|
||||
4. **Profile is loaded**. AppArmor is applied to a Pod by specifying an AppArmor profile that each
|
||||
container should be run with. If any of the specified profiles is not already loaded in the
|
||||
kernel, the Kubelet (>= v1.4) will reject the Pod. You can view which profiles are loaded on a
|
||||
node by checking the `/sys/kernel/security/apparmor/profiles` file. For example:
|
||||
|
||||
$ ssh gke-test-default-pool-239f5d02-gyn2 "sudo cat /sys/kernel/security/apparmor/profiles | sort"
|
||||
apparmor-test-deny-write (enforce)
|
||||
apparmor-test-audit-write (enforce)
|
||||
docker-default (enforce)
|
||||
k8s-nginx (enforce)
|
||||
|
||||
For more details on loading profiles on nodes, see
|
||||
[Setting up nodes with profiles](#setting-up-nodes-with-profiles).
|
||||
|
||||
As long as the Kubelet version includes AppArmor support (>= v1.4), the Kubelet will reject a Pod
|
||||
with AppArmor options if any of the prerequisites are not met. You can also verify AppArmor support
|
||||
on nodes by checking the node ready condition message (though this is likely to be removed in a
|
||||
later release):
|
||||
|
||||
$ kubectl get nodes -o=jsonpath=$'{range .items[*]}{@.metadata.name}: {.status.conditions[?(@.reason=="KubeletReady")].message}\n{end}'
|
||||
gke-test-default-pool-239f5d02-gyn2: kubelet is posting ready status. AppArmor enabled
|
||||
gke-test-default-pool-239f5d02-x1kf: kubelet is posting ready status. AppArmor enabled
|
||||
gke-test-default-pool-239f5d02-xwux: kubelet is posting ready status. AppArmor enabled
|
||||
|
||||
## Securing a Pod
|
||||
|
||||
*Note: AppArmor is currently in beta, so options are specified as annotations. Once support graduates to
|
||||
general availability, the annotations will be replaced with first-class fields (more details in
|
||||
[Upgrade path to GA](#upgrade-path-to-general-availability)).*
|
||||
|
||||
AppArmor profiles are specified *per-container*. To specify the AppArmor profile to run a Pod
|
||||
container with, add an annotation to the Pod's metadata:
|
||||
|
||||
container.apparmor.security.beta.kubernetes.io/<container_name>: <profile_ref>
|
||||
|
||||
Where `<container_name>` is the name of the container to apply the profile to, and `<profile_ref>`
|
||||
specifies the profile to apply. The `profile_ref` can be one of:
|
||||
|
||||
- `runtime/default` to apply the runtime's default profile.
|
||||
- `localhost/<profile_name>` to apply the profile loaded on the host with the name `<profile_name>`
|
||||
|
||||
See the [API Reference](#api-reference) for the full details on the annotation and profile name formats.
|
||||
|
||||
The Kubernetes AppArmor enforcement works by first checking that all the prerequisites have been
|
||||
met, and then forwarding the profile selection to the container runtime for enforcement. If the
|
||||
prerequisites have not been met, the Pod will be rejected, and will not run.
|
||||
|
||||
To verify that the profile was applied, you can expect to see the AppArmor security option listed in the container created event:
|
||||
|
||||
$ kubectl get events | grep Created
|
||||
22s 22s 1 hello-apparmor Pod spec.containers{hello} Normal Created {kubelet e2e-test-stclair-minion-group-31nt} Created container with docker id 269a53b202d3; Security:[seccomp=unconfined apparmor=k8s-apparmor-example-deny-write]
|
||||
|
||||
You can also verify directly that the container's root process is running with the correct profile by checking its proc attr:
|
||||
|
||||
$ kubectl exec <pod_name> cat /proc/1/attr/current
|
||||
k8s-apparmor-example-deny-write (enforce)
|
||||
|
||||
## Example
|
||||
|
||||
In this example you'll see:
|
||||
|
||||
- One way to load a profile on a node
|
||||
- How to enforce the profile on a Pod
|
||||
- How to check that the profile is loaded
|
||||
- What happens when a profile is violated
|
||||
- What happens when a profile cannot be loaded
|
||||
|
||||
*This example assumes you have already set up a cluster with AppArmor support.*
|
||||
|
||||
First, we need to load the profile we want to use onto our nodes. The profile we'll use simply
|
||||
denies all file writes:
|
||||
|
||||
|
||||
{% include code.html language="text" file="deny-write.profile" ghlink="/docs/tutorials/clusters/deny-write.profile" %}
|
||||
|
||||
Since we don't know where the Pod will be scheduled, we'll need to load the profile on all our
|
||||
nodes. For this example we'll just use SSH to install the profiles, but other approaches are
|
||||
discussed in [Setting up nodes with profiles](#setting-up-nodes-with-profiles).
|
||||
|
||||
$ NODES=(
|
||||
# The SSH-accessible domain names of your nodes
|
||||
gke-test-default-pool-239f5d02-gyn2.us-central1-a.my-k8s
|
||||
gke-test-default-pool-239f5d02-x1kf.us-central1-a.my-k8s
|
||||
gke-test-default-pool-239f5d02-xwux.us-central1-a.my-k8s)
|
||||
$ for NODE in ${NODES[*]}; do ssh $NODE 'sudo apparmor_parser -q <<EOF
|
||||
#include <tunables/global>
|
||||
|
||||
profile k8s-apparmor-example-deny-write flags=(attach_disconnected) {
|
||||
#include <abstractions/base>
|
||||
|
||||
file,
|
||||
|
||||
# Deny all file writes.
|
||||
deny /** w,
|
||||
}
|
||||
EOF'
|
||||
done
|
||||
|
||||
Next, we'll run a simple "Hello AppArmor" pod with the deny-write profile:
|
||||
|
||||
{% include code.html language="yaml" file="hello-apparmor-pod.yaml" ghlink="/docs/tutorials/clusters/hello-apparmor-pod.yaml" %}
|
||||
|
||||
$ kubectl create -f /dev/stdin <<EOF
|
||||
apiVersion: v1
|
||||
kind: Pod
|
||||
metadata:
|
||||
name: hello-apparmor
|
||||
annotations:
|
||||
container.apparmor.security.beta.kubernetes.io/hello: localhost/k8s-apparmor-example-deny-write
|
||||
spec:
|
||||
containers:
|
||||
- name: hello
|
||||
image: busybox
|
||||
command: [ "sh", "-c", "echo 'Hello AppArmor!' && sleep 1h" ]
|
||||
EOF
|
||||
pod "hello-apparmor" created
|
||||
|
||||
If we look at the pod events, we can see that the Pod container was created with the AppArmor
|
||||
profile "k8s-apparmor-example-deny-write":
|
||||
|
||||
$ kubectl get events | grep hello-apparmor
|
||||
14s 14s 1 hello-apparmor Pod Normal Scheduled {default-scheduler } Successfully assigned hello-apparmor to gke-test-default-pool-239f5d02-gyn2
|
||||
14s 14s 1 hello-apparmor Pod spec.containers{hello} Normal Pulling {kubelet gke-test-default-pool-239f5d02-gyn2} pulling image "busybox"
|
||||
13s 13s 1 hello-apparmor Pod spec.containers{hello} Normal Pulled {kubelet gke-test-default-pool-239f5d02-gyn2} Successfully pulled image "busybox"
|
||||
13s 13s 1 hello-apparmor Pod spec.containers{hello} Normal Created {kubelet gke-test-default-pool-239f5d02-gyn2} Created container with docker id 06b6cd1c0989; Security:[seccomp=unconfined apparmor=k8s-apparmor-example-deny-write]
|
||||
13s 13s 1 hello-apparmor Pod spec.containers{hello} Normal Started {kubelet gke-test-default-pool-239f5d02-gyn2} Started container with docker id 06b6cd1c0989
|
||||
|
||||
We can verify that the container is actually running with that profile by checking its proc attr:
|
||||
|
||||
$ kubectl exec hello-apparmor cat /proc/1/attr/current
|
||||
k8s-apparmor-example-deny-write (enforce)
|
||||
|
||||
Finally, we can see what happens if we try to violate the profile by writing to a file:
|
||||
|
||||
$ kubectl exec hello-apparmor touch /tmp/test
|
||||
touch: /tmp/test: Permission denied
|
||||
error: error executing remote command: command terminated with non-zero exit code: Error executing in Docker Container: 1
|
||||
|
||||
To wrap up, let's look at what happens if we try to specify a profile that hasn't been loaded:
|
||||
|
||||
$ kubectl create -f /dev/stdin <<EOF
|
||||
apiVersion: v1
|
||||
kind: Pod
|
||||
metadata:
|
||||
name: hello-apparmor-2
|
||||
annotations:
|
||||
container.apparmor.security.beta.kubernetes.io/hello: localhost/k8s-apparmor-example-allow-write
|
||||
spec:
|
||||
containers:
|
||||
- name: hello
|
||||
image: busybox
|
||||
command: [ "sh", "-c", "echo 'Hello AppArmor!' && sleep 1h" ]
|
||||
EOF
|
||||
pod "hello-apparmor-2" created
|
||||
|
||||
$ kubectl describe pod hello-apparmor-2
|
||||
Name: hello-apparmor-2
|
||||
Namespace: default
|
||||
Node: gke-test-default-pool-239f5d02-x1kf/
|
||||
Start Time: Tue, 30 Aug 2016 17:58:56 -0700
|
||||
Labels: <none>
|
||||
Status: Failed
|
||||
Reason: AppArmor
|
||||
Message: Pod Cannot enforce AppArmor: profile "k8s-apparmor-example-allow-write" is not loaded
|
||||
IP:
|
||||
Controllers: <none>
|
||||
Containers:
|
||||
hello:
|
||||
Image: busybox
|
||||
Port:
|
||||
Command:
|
||||
sh
|
||||
-c
|
||||
echo 'Hello AppArmor!' && sleep 1h
|
||||
Requests:
|
||||
cpu: 100m
|
||||
Environment Variables: <none>
|
||||
Volumes:
|
||||
default-token-dnz7v:
|
||||
Type: Secret (a volume populated by a Secret)
|
||||
SecretName: default-token-dnz7v
|
||||
QoS Tier: Burstable
|
||||
Events:
|
||||
FirstSeen LastSeen Count From SubobjectPath Type Reason Message
|
||||
--------- -------- ----- ---- ------------- -------- ------ -------
|
||||
23s 23s 1 {default-scheduler } Normal Scheduled Successfully assigned hello-apparmor-2 to e2e-test-stclair-minion-group-t1f5
|
||||
23s 23s 1 {kubelet e2e-test-stclair-minion-group-t1f5} Warning AppArmor Cannot enforce AppArmor: profile "k8s-apparmor-example-allow-write" is not loaded
|
||||
|
||||
Note the pod status is Failed, with a helpful error message: `Pod Cannot enforce AppArmor: profile
|
||||
"k8s-apparmor-example-allow-write" is not loaded`. An event was also recorded with the same message.
|
||||
|
||||
## Administration
|
||||
|
||||
### Setting up nodes with profiles
|
||||
|
||||
Kubernetes does not currently provide any native mechanisms for loading AppArmor profiles onto
|
||||
nodes. There are lots of ways to setup the profiles though, such as:
|
||||
|
||||
- Through a [DaemonSet](../daemons/) that runs a Pod on each node to
|
||||
ensure the correct profiles are loaded. An example implementation can be found
|
||||
[here](https://github.com/kubernetes/contrib/tree/master/apparmor/loader).
|
||||
- At node initialization time, using your node initialization scripts (e.g. Salt, Ansible, etc.) or
|
||||
image.
|
||||
- By copying the profiles to each node and loading them through SSH, as demonstrated in the
|
||||
[Example](#example).
|
||||
|
||||
The scheduler is not aware of which profiles are loaded onto which node, so the full set of profiles
|
||||
must be loaded onto every node. An alternative approach is to add a node label for each profile (or
|
||||
class of profiles) on the node, and use a
|
||||
[node selector](../../user-guide/node-selection/) to ensure the Pod is run on a
|
||||
node with the required profile.
|
||||
|
||||
### Restricting profiles with the PodSecurityPolicy
|
||||
|
||||
If the PodSecurityPolicy extension is enabled, cluster-wide AppArmor restrictions can be applied. To
|
||||
enable the PodSecurityPolicy, two flags must be set on the `apiserver`:
|
||||
|
||||
--admission-control=PodSecurityPolicy[,others...]
|
||||
--runtime-config=extensions/v1beta1/podsecuritypolicy[,others...]
|
||||
|
||||
With the extension enabled, the AppArmor options can be specified as annotations on the PodSecurityPolicy:
|
||||
|
||||
apparmor.security.beta.kubernetes.io/defaultProfileName: <profile_ref>
|
||||
apparmor.security.beta.kubernetes.io/allowedProfileNames: <profile_ref>[,others...]
|
||||
|
||||
The default profile name option specifies the profile to apply to containers by default when none is
|
||||
specified. The allowed profile names option specifies a list of profiles that Pod containers are
|
||||
allowed to be run with. If both options are provided, the default must be allowed. The profiles are
|
||||
specified in the same format as on containers. See the [API Reference](#api-reference) for the full
|
||||
specification.
|
||||
|
||||
### Disabling AppArmor
|
||||
|
||||
If you do not want AppArmor to be available on your cluster, it can be disabled by a command-line flag:
|
||||
|
||||
--feature-gates=AppArmor=false
|
||||
|
||||
When disabled, any Pod that includes an AppArmor profile will fail validation with a "Forbidden"
|
||||
error. Note that by default docker always enables the "docker-default" profile on non-privileged
|
||||
pods (if the AppArmor kernel module is enabled), and will continue to do so even if the feature-gate
|
||||
is disabled. The option to disable AppArmor will be removed when AppArmor graduates to general
|
||||
availability (GA).
|
||||
|
||||
### Upgrading to Kubernetes v1.4 with AppArmor
|
||||
|
||||
No action is required with respect to AppArmor to upgrade your cluster to v1.4. However, if any
|
||||
existing pods had an AppArmor annotation, they will not go through validation (or PodSecurityPolicy
|
||||
admission). If permissive profiles are loaded on the nodes, a malicious user could pre-apply a
|
||||
permissive profile to escalate the pod privileges above the docker-default. If this is a concern, it
|
||||
is recommended to scrub the cluster of any pods containing an annotation with
|
||||
`apparmor.security.beta.kubernetes.io`.
|
||||
|
||||
### Upgrade path to General Availability
|
||||
|
||||
When AppArmor is ready to be graduated to general availability (GA), the options currently specified
|
||||
through annotations will be converted to fields. Supporting all the upgrade and downgrade paths
|
||||
through the transition is very nuanced, and will be explained in detail when the transition
|
||||
occurs. We will commit to supporting both fields and annotations for at least 2 releases, and will
|
||||
explicitly reject the annotations for at least 2 releases after that.
|
||||
|
||||
## Authoring Profiles
|
||||
|
||||
Getting AppArmor profiles specified correctly can be a tricky business. Fortunately there are some
|
||||
tools to help with that:
|
||||
|
||||
- `aa-genprof` and `aa-logprof` generate profile rules by monitoring an application's activity and
|
||||
logs, and admitting the actions it takes. Further instructions are provided by the
|
||||
[AppArmor documentation](http://wiki.apparmor.net/index.php/Profiling_with_tools).
|
||||
- [bane](https://github.com/jfrazelle/bane) is an AppArmor profile generator for Docker that uses a
|
||||
simplified profile language.
|
||||
|
||||
It is recommended to run your application through Docker on a development workstation to generate
|
||||
the profiles, but there is nothing preventing running the tools on the Kubernetes node where your
|
||||
Pod is running.
|
||||
|
||||
To debug problems with AppArmor, you can check the system logs to see what, specifically, was
|
||||
denied. AppArmor logs verbose messages to `dmesg`, and errors can usually be found in the system
|
||||
logs or through `journalctl`. More information is provided in
|
||||
[AppArmor failures](http://wiki.apparmor.net/index.php/AppArmor_Failures).
|
||||
|
||||
Additional resources:
|
||||
|
||||
- [Quick guide to the AppArmor profile language](http://wiki.apparmor.net/index.php/QuickProfileLanguage)
|
||||
- [AppArmor core policy reference](http://wiki.apparmor.net/index.php/ProfileLanguage)
|
||||
|
||||
## API Reference
|
||||
|
||||
**Pod Annotation**:
|
||||
|
||||
Specifying the profile a container will run with:
|
||||
|
||||
- **key**: `container.apparmor.security.beta.kubernetes.io/<container_name>`
|
||||
Where `<container_name>` matches the name of a container in the Pod.
|
||||
A separate profile can be specified for each container in the Pod.
|
||||
- **value**: a profile reference, described below
|
||||
|
||||
**Profile Reference**:
|
||||
|
||||
- `runtime/default`: Refers to the default runtime profile.
|
||||
- Equivalent to not specifying a profile (without a PodSecurityPolicy default), except it still
|
||||
requires AppArmor to be enabled.
|
||||
- For Docker, this resolves to the
|
||||
[`docker-default`](https://docs.docker.com/engine/security/apparmor/) profile for non-privileged
|
||||
containers, and unconfined (no profile) for privileged containers.
|
||||
- `localhost/<profile_name>`: Refers to a profile loaded on the node (localhost) by name.
|
||||
- The possible profile names are detailed in the
|
||||
[core policy reference](http://wiki.apparmor.net/index.php/AppArmor_Core_Policy_Reference#Profile_names_and_attachment_specifications)
|
||||
|
||||
Any other profile reference format is invalid.
|
||||
|
||||
**PodSecurityPolicy Annotations**
|
||||
|
||||
Specifying the default profile to apply to containers when none is provided:
|
||||
|
||||
- **key**: `apparmor.security.beta.kubernetes.io/defaultProfileName`
|
||||
- **value**: a profile reference, described above
|
||||
|
||||
Specifying the list of profiles Pod containers is allowed to specify:
|
||||
|
||||
- **key**: `apparmor.security.beta.kubernetes.io/allowedProfileNames`
|
||||
- **value**: a comma-separated list of profile references (described above)
|
||||
- Although an escaped comma is a legal character in a profile name, it cannot be explicitly
|
||||
allowed here
|
|
@ -0,0 +1,10 @@
|
|||
#include <tunables/global>
|
||||
|
||||
profile k8s-apparmor-example-deny-write flags=(attach_disconnected) {
|
||||
#include <abstractions/base>
|
||||
|
||||
file,
|
||||
|
||||
# Deny all file writes.
|
||||
deny /** w,
|
||||
}
|
|
@ -0,0 +1,13 @@
|
|||
apiVersion: v1
|
||||
kind: Pod
|
||||
metadata:
|
||||
name: hello-apparmor
|
||||
annotations:
|
||||
# Tell Kubernetes to apply the AppArmor profile "k8s-apparmor-example-deny-write".
|
||||
# Note that this is ignored if the Kubernetes node is not running version 1.4 or greater.
|
||||
container.apparmor.security.beta.kubernetes.io/hello: localhost/k8s-apparmor-example-deny-write
|
||||
spec:
|
||||
containers:
|
||||
- name: hello
|
||||
image: busybox
|
||||
command: [ "sh", "-c", "echo 'Hello AppArmor!' && sleep 1h" ]
|
|
@ -0,0 +1,208 @@
|
|||
---
|
||||
assignees:
|
||||
- madhusudancs
|
||||
title: Setting up Cluster Federation with Kubefed
|
||||
---
|
||||
|
||||
* TOC
|
||||
{:toc}
|
||||
|
||||
Kubernetes version 1.5 includes a new command line tool called
|
||||
`kubefed` to help you administrate your federated clusters.
|
||||
`kubefed` helps you to deploy a new Kubernetes cluster federation
|
||||
control plane, and to add clusters to or remove clusters from an
|
||||
existing federation control plane.
|
||||
|
||||
This guide explains how to administer a Kubernetes Cluster Federation
|
||||
using `kubefed`.
|
||||
|
||||
> Note: `kubefed` is an alpha feature in Kubernetes 1.5.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
This guide assumes that you have a running Kubernetes cluster. Please
|
||||
see one of the [getting started](/docs/getting-started-guides/) guides
|
||||
for installation instructions for your platform.
|
||||
|
||||
|
||||
## Getting `kubefed`
|
||||
|
||||
Download the client tarball corresponding to Kubernetes version 1.5
|
||||
or later
|
||||
[from the release page](https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG.md),
|
||||
extract the binaries in the tarball to one of the directories
|
||||
in your `$PATH` and set the executable permission on those binaries.
|
||||
|
||||
Note: The URL in the curl command below downloads the binaries for
|
||||
Linux amd64. If you are on a different platform, please use the URL
|
||||
for the binaries appropriate for your platform. You can find the list
|
||||
of available binaries on the [release page](https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG.md#client-binaries-1).
|
||||
|
||||
|
||||
```shell
|
||||
curl -O https://storage.googleapis.com/kubernetes-release/release/v1.5.2/kubernetes-client-linux-amd64.tar.gz
|
||||
tar -xzvf kubernetes-client-linux-amd64.tar.gz
|
||||
sudo cp kubernetes/client/bin/kubefed /usr/local/bin
|
||||
sudo chmod +x /usr/local/bin/kubefed
|
||||
sudo cp kubernetes/client/bin/kubectl /usr/local/bin
|
||||
sudo chmod +x /usr/local/bin/kubectl
|
||||
```
|
||||
|
||||
|
||||
## Choosing a host cluster.
|
||||
|
||||
You'll need to choose one of your Kubernetes clusters to be the
|
||||
*host cluster*. The host cluster hosts the components that make up
|
||||
your federation control plane. Ensure that you have a `kubeconfig`
|
||||
entry in your local `kubeconfig` that corresponds to the host cluster.
|
||||
You can verify that you have the required `kubeconfig` entry by
|
||||
running:
|
||||
|
||||
```shell
|
||||
kubectl config get-contexts
|
||||
```
|
||||
|
||||
The output should contain an entry corresponding to your host cluster,
|
||||
similar to the following:
|
||||
|
||||
```
|
||||
CURRENT NAME CLUSTER AUTHINFO NAMESPACE
|
||||
gke_myproject_asia-east1-b_gce-asia-east1 gke_myproject_asia-east1-b_gce-asia-east1 gke_myproject_asia-east1-b_gce-asia-east1
|
||||
```
|
||||
|
||||
|
||||
You'll need to provide the `kubeconfig` context (called name in the
|
||||
entry above) for your host cluster when you deploy your federation
|
||||
control plane.
|
||||
|
||||
|
||||
## Deploying a federation control plane.
|
||||
|
||||
To deploy a federation control plane on your host cluster, run
|
||||
`kubefed init` command. When you use `kubefed init`, you must provide
|
||||
the following:
|
||||
|
||||
* Federation name
|
||||
* `--host-cluster-context`, the `kubeconfig` context for the host cluster
|
||||
* `--dns-zone-name`, a domain name suffix for your federated services
|
||||
|
||||
The following example command deploys a federation control plane with
|
||||
the name `fellowship`, a host cluster context `rivendell`, and the
|
||||
domain suffix `example.com`:
|
||||
|
||||
```shell
|
||||
kubefed init fellowship --host-cluster-context=rivendell --dns-zone-name="example.com"
|
||||
```
|
||||
|
||||
The domain suffix specified in `--dns-zone-name` must be an existing
|
||||
domain that you control, and that is programmable by your DNS provider.
|
||||
|
||||
`kubefed init` sets up the federation control plane in the host
|
||||
cluster and also adds an entry for the federation API server in your
|
||||
local kubeconfig. Note that in the alpha release in Kubernetes 1.5,
|
||||
`kubefed init` does not automatically set the current context to the
|
||||
newly deployed federation. You can set the current context manually by
|
||||
running:
|
||||
|
||||
```shell
|
||||
kubectl config use-context fellowship
|
||||
```
|
||||
|
||||
where `fellowship` is the name of your federation.
|
||||
|
||||
|
||||
## Adding a cluster to a federation
|
||||
|
||||
Once you've deployed a federation control plane, you'll need to make
|
||||
that control plane aware of the clusters it should manage. You can add
|
||||
a cluster to your federation by using the `kubefed join` command.
|
||||
|
||||
To use `kubefed join`, you'll need to provide the name of the cluster
|
||||
you want to add to the federation, and the `--host-cluster-context`
|
||||
for the federation control plane's host cluster.
|
||||
|
||||
The following example command adds the cluster `gondor` to the
|
||||
federation with host cluster `rivendell`:
|
||||
|
||||
```
|
||||
kubefed join gondor --host-cluster-context=rivendell
|
||||
```
|
||||
|
||||
> Note: Kubernetes requires that you manually join clusters to a
|
||||
federation because the federation control plane manages only those
|
||||
clusters that it is responsible for managing. Adding a cluster tells
|
||||
the federation control plane that it is responsible for managing that
|
||||
cluster.
|
||||
|
||||
### Naming rules and customization
|
||||
|
||||
The cluster name you supply to `kubefed join` must be a valid RFC 1035
|
||||
label.
|
||||
|
||||
Furthermore, federation control plane requires credentials of the
|
||||
joined clusters to operate on them. These credentials are obtained
|
||||
from the local kubeconfig. `kubefed join` uses the cluster name
|
||||
specified as the argument to look for the cluster's context in the
|
||||
local kubeconfig. If it fails to find a matching context, it exits
|
||||
with an error.
|
||||
|
||||
This might cause issues in cases where context names for each cluster
|
||||
in the federation don't follow
|
||||
[RFC 1035](https://www.ietf.org/rfc/rfc1035.txt) label naming rules.
|
||||
In such cases, you can specify a cluster name that conforms to the
|
||||
[RFC 1035](https://www.ietf.org/rfc/rfc1035.txt) label naming rules
|
||||
and specify the cluster context using the `--cluster-context` flag.
|
||||
For example, if context of the cluster your are joining is
|
||||
`gondor_needs-no_king`, then you can join the cluster by running:
|
||||
|
||||
```shell
|
||||
kubefed join gondor --host-cluster-context=rivendell --cluster-context=gondor_needs-no_king
|
||||
```
|
||||
|
||||
#### Secret name
|
||||
|
||||
Cluster credentials required by the federation control plane as
|
||||
described above are stored as a secret in the host cluster. The name
|
||||
of the secret is also derived from the cluster name.
|
||||
|
||||
However, the name of a secret object in Kubernetes should conform
|
||||
to the DNS subdomain name specification described in
|
||||
[RFC 1123](https://tools.ietf.org/html/rfc1123). If this isn't the
|
||||
case, you can pass the secret name to `kubefed join` using the
|
||||
`--secret-name` flag. For example, if the cluster name is `noldor` and
|
||||
the secret name is `11kingdom`, you can join the cluster by
|
||||
running:
|
||||
|
||||
```shell
|
||||
kubefed join noldor --host-cluster-context=rivendell --secret-name=11kingdom
|
||||
```
|
||||
|
||||
Note: If your cluster name does not conform to the DNS subdomain name
|
||||
specification, all you need to do is supply the secret name via the
|
||||
`--secret-name` flag. `kubefed join` automatically creates the secret
|
||||
for you.
|
||||
|
||||
|
||||
## Removing a cluster from a federation
|
||||
|
||||
To remove a cluster from a federation, run the `kubefed unjoin`
|
||||
command with the cluster name and the federation's
|
||||
`--host-cluster-context`:
|
||||
|
||||
```
|
||||
kubefed unjoin gondor --host-cluster-context=rivendell
|
||||
```
|
||||
|
||||
|
||||
## Turning down the federation control plane:
|
||||
|
||||
Proper cleanup of federation control plane is not fully implemented in
|
||||
this alpha release of `kubefed`. However, for the time being, deleting
|
||||
the federation system namespace should remove all the resources except
|
||||
the persistent storage volume dynamically provisioned for the
|
||||
federation control plane's etcd. You can delete the federation
|
||||
namespace by running the following command:
|
||||
|
||||
```
|
||||
$ kubectl delete ns federation-system
|
||||
```
|
|
@ -1,5 +1,8 @@
|
|||
---
|
||||
title: Declarative Management of Kubernetes Objects Using Configuration Files
|
||||
redirect_from:
|
||||
- "/docs/concepts/tools/kubectl/object-management-using-declarative-config/"
|
||||
- "/docs/concepts/tools/kubectl/object-management-using-declarative-config.html"
|
||||
---
|
||||
|
||||
{% capture overview %}
|
||||
|
@ -949,8 +952,8 @@ The recommended approach for ThirdPartyResources is to use [imperative object co
|
|||
{% endcapture %}
|
||||
|
||||
{% capture whatsnext %}
|
||||
- [Managing Kubernetes Objects Using Imperative Commands](/docs/concepts/tools/kubectl/object-management-using-imperative-commands/)
|
||||
- [Imperative Management of Kubernetes Objects Using Configuration Files](/docs/concepts/tools/kubectl/object-management-using-imperative-config/)
|
||||
- [Managing Kubernetes Objects Using Imperative Commands](/docs/tutorials/object-management-kubectl/imperative-object-management-command/)
|
||||
- [Imperative Management of Kubernetes Objects Using Configuration Files](/docs/tutorials/object-management-kubectl/imperative-object-management-configuration/)
|
||||
- [Kubectl Command Reference](/docs/user-guide/kubectl/v1.5/)
|
||||
- [Kubernetes Object Schema Reference](/docs/resources-reference/v1.5/)
|
||||
{% endcapture %}
|
|
@ -1,5 +1,8 @@
|
|||
---
|
||||
title: Managing Kubernetes Objects Using Imperative Commands
|
||||
redirect_from:
|
||||
- "/docs/concepts/tools/kubectl/object-management-using-imperative-commands/"
|
||||
- "/docs/concepts/tools/kubectl/object-management-using-imperative-commands.html"
|
||||
---
|
||||
|
||||
{% capture overview %}
|
||||
|
@ -150,8 +153,8 @@ kubectl create --edit -f /tmp/srv.yaml
|
|||
{% endcapture %}
|
||||
|
||||
{% capture whatsnext %}
|
||||
- [Managing Kubernetes Objects Using Object Configuration (Imperative)](/docs/concepts/tools/kubectl/object-management-using-imperative-config/)
|
||||
- [Managing Kubernetes Objects Using Object Configuration (Declarative)](/docs/concepts/tools/kubectl/object-management-using-declarative-config/)
|
||||
- [Managing Kubernetes Objects Using Object Configuration (Imperative)](/docs/tutorials/object-management-kubectl/imperative-object-management-configuration/)
|
||||
- [Managing Kubernetes Objects Using Object Configuration (Declarative)](/docs/tutorials/object-management-kubectl/declarative-object-management-configuration/)
|
||||
- [Kubectl Command Reference](/docs/user-guide/kubectl/v1.5/)
|
||||
- [Kubernetes Object Schema Reference](/docs/resources-reference/v1.5/)
|
||||
{% endcapture %}
|
|
@ -1,5 +1,8 @@
|
|||
---
|
||||
title: Imperative Management of Kubernetes Objects Using Configuration Files
|
||||
redirect_from:
|
||||
- "/docs/concepts/tools/kubectl/object-management-using-imperative-config/"
|
||||
- "/docs/concepts/tools/kubectl/object-management-using-imperative-config.html"
|
||||
---
|
||||
|
||||
{% capture overview %}
|
||||
|
@ -120,8 +123,8 @@ template:
|
|||
{% endcapture %}
|
||||
|
||||
{% capture whatsnext %}
|
||||
- [Managing Kubernetes Objects Using Imperative Commands](/docs/concepts/tools/kubectl/object-management-using-imperative-commands/)
|
||||
- [Managing Kubernetes Objects Using Object Configuration (Declarative)](/docs/concepts/tools/kubectl/object-management-using-declarative-config/)
|
||||
- [Managing Kubernetes Objects Using Imperative Commands](/docs/tutorials/object-management-kubectl/imperative-object-management-command/)
|
||||
- [Managing Kubernetes Objects Using Object Configuration (Declarative)](/docs/tutorials/object-management-kubectl/declarative-object-management-configuration/)
|
||||
- [Kubectl Command Reference](/docs/user-guide/kubectl/v1.5/)
|
||||
- [Kubernetes Object Schema Reference](/docs/resources-reference/v1.5/)
|
||||
{% endcapture %}
|
|
@ -1,5 +1,8 @@
|
|||
---
|
||||
title: Kubernetes Object Management
|
||||
redirect_from:
|
||||
- "/docs/concepts/tools/kubectl/object-management-overview/"
|
||||
- "/docs/concepts/tools/kubectl/object-management-overview.html"
|
||||
---
|
||||
|
||||
{% capture overview %}
|
||||
|
@ -162,9 +165,9 @@ Disadvantages compared to imperative object configuration:
|
|||
{% endcapture %}
|
||||
|
||||
{% capture whatsnext %}
|
||||
- [Managing Kubernetes Objects Using Imperative Commands](/docs/concepts/tools/kubectl/object-management-using-imperative-commands/)
|
||||
- [Managing Kubernetes Objects Using Object Configuration (Imperative)](/docs/concepts/tools/kubectl/object-management-using-imperative-config/)
|
||||
- [Managing Kubernetes Objects Using Object Configuration (Declarative)](/docs/concepts/tools/kubectl/object-management-using-declarative-config/)
|
||||
- [Managing Kubernetes Objects Using Imperative Commands](/docs/tutorials/object-management-kubectl/imperative-object-management-command/)
|
||||
- [Managing Kubernetes Objects Using Object Configuration (Imperative)](/docs/tutorials/object-management-kubectl/imperative-object-management-configuration/)
|
||||
- [Managing Kubernetes Objects Using Object Configuration (Declarative)](/docs/tutorials/object-management-kubectl/declarative-object-management-configuration/)
|
||||
- [Kubectl Command Reference](/docs/user-guide/kubectl/v1.5/)
|
||||
- [Kubernetes Object Schema Reference](/docs/resources-reference/v1.5/)
|
||||
|
|
@ -0,0 +1,258 @@
|
|||
---
|
||||
assignees:
|
||||
- bprashanth
|
||||
title: Run Stateless AP Replication Controller
|
||||
---
|
||||
|
||||
* TOC
|
||||
{:toc}
|
||||
|
||||
A replication controller ensures that a specified number of pod "replicas" are
|
||||
running at any one time. If there are too many, it will kill some. If there are
|
||||
too few, it will start more.
|
||||
|
||||
## Creating a replication controller
|
||||
|
||||
Replication controllers are created with `kubectl create`:
|
||||
|
||||
```shell
|
||||
$ kubectl create -f FILE
|
||||
```
|
||||
|
||||
Where:
|
||||
|
||||
* `-f FILE` or `--filename FILE` is a relative path to a
|
||||
[configuration file](#replication_controller_configuration_file) in
|
||||
either JSON or YAML format.
|
||||
|
||||
You can use the [sample file](#sample_file) below to try a create request.
|
||||
|
||||
A successful create request returns the name of the replication controller. To
|
||||
view more details about the controller, see
|
||||
[Viewing replication controllers](#viewing_replication_controllers) below.
|
||||
|
||||
### Replication controller configuration file
|
||||
|
||||
When creating a replication controller, you must point to a configuration file
|
||||
as the value of the `-f` flag. The configuration
|
||||
file can be formatted as YAML or as JSON, and supports the following fields:
|
||||
|
||||
```json
|
||||
{
|
||||
"apiVersion": "v1",
|
||||
"kind": "ReplicationController",
|
||||
"metadata": {
|
||||
"name": "",
|
||||
"labels": "",
|
||||
"namespace": ""
|
||||
},
|
||||
"spec": {
|
||||
"replicas": int,
|
||||
"selector": {
|
||||
"":""
|
||||
},
|
||||
"template": {
|
||||
"metadata": {
|
||||
"labels": {
|
||||
"":""
|
||||
}
|
||||
},
|
||||
"spec": {
|
||||
// See 'The spec schema' below
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Required fields are:
|
||||
|
||||
* `kind`: Always `ReplicationController`.
|
||||
* `apiVersion`: Currently `v1`.
|
||||
* `metadata`: An object containing:
|
||||
* `name`: Required if `generateName` is not specified. The name of this
|
||||
replication controller. It must be an
|
||||
[RFC1035](https://www.ietf.org/rfc/rfc1035.txt) compatible value and be
|
||||
unique within the namespace.
|
||||
* `labels`: Optional. Labels are arbitrary key:value pairs that can be used
|
||||
for grouping and targeting by other resources and services.
|
||||
* `generateName`: Required if `name` is not set. A prefix to use to generate
|
||||
a unique name. Has the same validation rules as `name`.
|
||||
* `namespace`: Optional. The namespace of the replication controller.
|
||||
* `annotations`: Optional. A map of string keys and values that can be used
|
||||
by external tooling to store and retrieve arbitrary metadata about
|
||||
objects.
|
||||
* `spec`: The configuration for this replication controller. It must
|
||||
contain:
|
||||
* `replicas`: The number of pods to create and maintain.
|
||||
* `selector`: A map of key:value pairs assigned to the set of pods that
|
||||
this replication controller is responsible for managing. **This must**
|
||||
**match the key:value pairs in the `template`'s `labels` field**.
|
||||
* `template` contains:
|
||||
* A `metadata` object with `labels` for the pod.
|
||||
* The [`spec` schema](#the_spec_schema) that defines the pod
|
||||
configuration.
|
||||
|
||||
### The `spec` schema
|
||||
|
||||
The `spec` schema (that is a child of `template`) is described in the locations
|
||||
below:
|
||||
|
||||
* The [`spec` schema](/docs/user-guide/pods/multi-container/#the_spec_schema)
|
||||
section of the Creating Multi-Container Pods page covers required and
|
||||
frequently-used fields.
|
||||
* The entire `spec` schema is documented in the
|
||||
[Kubernetes API reference](/docs/api-reference/v1/definitions/#_v1_podspec).
|
||||
|
||||
### Sample file
|
||||
|
||||
The following sample file creates 2 pods, each containing a single container
|
||||
using the `redis` image. Port 80 on each container is opened. The replication
|
||||
controller is tagged with the `serving` label. The pods are given the label
|
||||
`frontend` and the `selector` is set to `frontend`, to indicate that the
|
||||
controller should manage pods with the `frontend` label.
|
||||
|
||||
```json
|
||||
{
|
||||
"kind": "ReplicationController",
|
||||
"apiVersion": "v1",
|
||||
"metadata": {
|
||||
"name": "frontend-controller",
|
||||
"labels": {
|
||||
"state": "serving"
|
||||
}
|
||||
},
|
||||
"spec": {
|
||||
"replicas": 2,
|
||||
"selector": {
|
||||
"app": "frontend"
|
||||
},
|
||||
"template": {
|
||||
"metadata": {
|
||||
"labels": {
|
||||
"app": "frontend"
|
||||
}
|
||||
},
|
||||
"spec": {
|
||||
"volumes": null,
|
||||
"containers": [
|
||||
{
|
||||
"name": "php-redis",
|
||||
"image": "redis",
|
||||
"ports": [
|
||||
{
|
||||
"containerPort": 80,
|
||||
"protocol": "TCP"
|
||||
}
|
||||
],
|
||||
"imagePullPolicy": "IfNotPresent"
|
||||
}
|
||||
],
|
||||
"restartPolicy": "Always",
|
||||
"dnsPolicy": "ClusterFirst"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Updating replication controller pods
|
||||
|
||||
See [Rolling Updates](/docs/tasks/run-application/rolling-update-replication-controller/).
|
||||
|
||||
## Resizing a replication controller
|
||||
|
||||
To increase or decrease the number of pods under a replication controller's
|
||||
control, use the `kubectl scale` command:
|
||||
|
||||
$ kubectl scale rc NAME --replicas=COUNT \
|
||||
[--current-replicas=COUNT] \
|
||||
[--resource-version=VERSION]
|
||||
|
||||
Tip: You can use the `rc` alias in your commands in place of
|
||||
`replicationcontroller`.
|
||||
|
||||
Required fields are:
|
||||
|
||||
* `NAME`: The name of the replication controller to update.
|
||||
* `--replicas=COUNT`: The desired number of replicas.
|
||||
|
||||
Optional fields are:
|
||||
|
||||
* `--current-replicas=COUNT`: A precondition for current size. If specified,
|
||||
the resize will only take place if the current number of replicas matches
|
||||
this value.
|
||||
* `--resource-version=VERSION`: A precondition for resource version. If
|
||||
specified, the resize will only take place if the current replication
|
||||
controller version matches this value. Versions are specified in the
|
||||
`labels` field of the replication controller's configuration file, as a
|
||||
key:value pair with a key of `version`. For example,
|
||||
`--resource-version test` matches:
|
||||
|
||||
"labels": {
|
||||
"version": "test"
|
||||
}
|
||||
|
||||
## Viewing replication controllers
|
||||
|
||||
To list replication controllers on a cluster, use the `kubectl get` command:
|
||||
|
||||
```shell
|
||||
$ kubectl get rc
|
||||
```
|
||||
|
||||
A successful get command returns all replication controllers on the cluster in
|
||||
the specified or default namespace. For example:
|
||||
|
||||
```shell
|
||||
CONTROLLER CONTAINER(S) IMAGE(S) SELECTOR REPLICAS
|
||||
frontend php-redis redis name=frontend 2
|
||||
```
|
||||
|
||||
You can also use `get rc NAME` to return information about a specific
|
||||
replication controller.
|
||||
|
||||
To view detailed information about a specific replication controller, use the
|
||||
`kubectl describe` command:
|
||||
|
||||
```shell
|
||||
$ kubectl describe rc NAME
|
||||
```
|
||||
|
||||
A successful describe request returns details about the replication controller
|
||||
including number and status of pods managed, and recent events:
|
||||
|
||||
```conf
|
||||
Name: frontend
|
||||
Namespace: default
|
||||
Image(s): gcr.io/google_samples/gb-frontend:v3
|
||||
Selector: name=frontend
|
||||
Labels: name=frontend
|
||||
Replicas: 2 current / 2 desired
|
||||
Pods Status: 2 Running / 0 Waiting / 0 Succeeded / 0 Failed
|
||||
Events:
|
||||
FirstSeen LastSeen Count From SubobjectPath Reason Message
|
||||
Fri, 06 Nov 2015 16:52:50 -0800 Fri, 06 Nov 2015 16:52:50 -0800 1 {replication-controller } SuccessfulCreate Created pod: frontend-gyx2h
|
||||
Fri, 06 Nov 2015 16:52:50 -0800 Fri, 06 Nov 2015 16:52:50 -0800 1 {replication-controller } SuccessfulCreate Created pod: frontend-vc9w4
|
||||
```
|
||||
|
||||
## Deleting replication controllers
|
||||
|
||||
To delete a replication controller as well as the pods that it controls, use
|
||||
`kubectl delete`:
|
||||
|
||||
```shell
|
||||
$ kubectl delete rc NAME
|
||||
```
|
||||
|
||||
By default, `kubectl delete rc` will resize the controller to zero (effectively
|
||||
deleting all pods) before deleting it.
|
||||
|
||||
To delete a replication controller without deleting its pods, use
|
||||
`kubectl delete` and specify `--cascade=false`:
|
||||
|
||||
```shell
|
||||
$ kubectl delete rc NAME --cascade=false
|
||||
```
|
||||
|
||||
A successful delete request returns the name of the deleted resource.
|
|
@ -1,119 +1,7 @@
|
|||
---
|
||||
assignees:
|
||||
- mikedanese
|
||||
title: Best Practices for Configuration
|
||||
---
|
||||
|
||||
This document is meant to highlight and consolidate in one place configuration best practices that are introduced throughout the user-guide and getting-started documentation and examples. This is a living document so if you think of something that is not on this list but might be useful to others, please don't hesitate to file an issue or submit a PR.
|
||||
|
||||
## General Config Tips
|
||||
|
||||
- When defining configurations, specify the latest stable API version (currently v1).
|
||||
|
||||
- Configuration files should be stored in version control before being pushed to the cluster. This allows a configuration to be quickly rolled back if needed, and will aid with cluster re-creation and restoration if necessary.
|
||||
|
||||
- Write your configuration files using YAML rather than JSON. They can be used interchangeably in almost all scenarios, but YAML tends to be more user-friendly for config.
|
||||
|
||||
- Group related objects together in a single file where this makes sense. This format is often easier to manage than separate files. See the [guestbook-all-in-one.yaml](https://github.com/kubernetes/kubernetes/tree/{{page.githubbranch}}/examples/guestbook/all-in-one/guestbook-all-in-one.yaml) file as an example of this syntax.
|
||||
(Note also that many `kubectl` commands can be called on a directory, and so you can also call
|
||||
`kubectl create` on a directory of config files— see below for more detail).
|
||||
|
||||
- Don't specify default values unnecessarily, in order to simplify and minimize configs, and to
|
||||
reduce error. For example, omit the selector and labels in a `ReplicationController` if you want
|
||||
them to be the same as the labels in its `podTemplate`, since those fields are populated from the
|
||||
`podTemplate` labels by default. See the [guestbook app's](https://github.com/kubernetes/kubernetes/tree/{{page.githubbranch}}/examples/guestbook/) .yaml files for some [examples](https://github.com/kubernetes/kubernetes/tree/{{page.githubbranch}}/examples/guestbook/frontend-deployment.yaml) of this.
|
||||
|
||||
- Put an object description in an annotation to allow better introspection.
|
||||
|
||||
|
||||
## "Naked" Pods vs Replication Controllers and Jobs
|
||||
|
||||
- If there is a viable alternative to naked pods (i.e., pods not bound to a [replication controller
|
||||
](/docs/user-guide/replication-controller)), go with the alternative. Naked pods will not be rescheduled in the
|
||||
event of node failure.
|
||||
|
||||
Replication controllers are almost always preferable to creating pods, except for some explicit
|
||||
[`restartPolicy: Never`](/docs/user-guide/pod-states/#restartpolicy) scenarios. A
|
||||
[Job](/docs/user-guide/jobs/) object (currently in Beta), may also be appropriate.
|
||||
|
||||
|
||||
## Services
|
||||
|
||||
- It's typically best to create a [service](/docs/user-guide/services/) before corresponding [replication
|
||||
controllers](/docs/user-guide/replication-controller/), so that the scheduler can spread the pods comprising the
|
||||
service. You can also create a replication controller without specifying replicas (this will set
|
||||
replicas=1), create a service, then scale up the replication controller. This can be useful in
|
||||
ensuring that one replica works before creating lots of them.
|
||||
|
||||
- Don't use `hostPort` (which specifies the port number to expose on the host) unless absolutely
|
||||
necessary, e.g., for a node daemon. When you bind a Pod to a `hostPort`, there are a limited
|
||||
number of places that pod can be scheduled, due to port conflicts— you can only schedule as many
|
||||
such Pods as there are nodes in your Kubernetes cluster.
|
||||
|
||||
If you only need access to the port for debugging purposes, you can use the [kubectl proxy and apiserver proxy](/docs/user-guide/connecting-to-applications-proxy/) or [kubectl port-forward](/docs/user-guide/connecting-to-applications-port-forward/).
|
||||
You can use a [Service](/docs/user-guide/services/) object for external service access.
|
||||
If you do need to expose a pod's port on the host machine, consider using a [NodePort](/docs/user-guide/services/#type-nodeport) service before resorting to `hostPort`.
|
||||
|
||||
- Avoid using `hostNetwork`, for the same reasons as `hostPort`.
|
||||
|
||||
- Use _headless services_ for easy service discovery when you don't need kube-proxy load balancing.
|
||||
See [headless services](/docs/user-guide/services/#headless-services).
|
||||
|
||||
## Using Labels
|
||||
|
||||
- Define and use [labels](/docs/user-guide/labels/) that identify __semantic attributes__ of your application or
|
||||
deployment. For example, instead of attaching a label to a set of pods to explicitly represent
|
||||
some service (e.g., `service: myservice`), or explicitly representing the replication
|
||||
controller managing the pods (e.g., `controller: mycontroller`), attach labels that identify
|
||||
semantic attributes, such as `{ app: myapp, tier: frontend, phase: test, deployment: v3 }`. This
|
||||
will let you select the object groups appropriate to the context— e.g., a service for all "tier:
|
||||
frontend" pods, or all "test" phase components of app "myapp". See the
|
||||
[guestbook](https://github.com/kubernetes/kubernetes/tree/{{page.githubbranch}}/examples/guestbook/) app for an example of this approach.
|
||||
|
||||
A service can be made to span multiple deployments, such as is done across [rolling updates](/docs/user-guide/kubectl/kubectl_rolling-update/), by simply omitting release-specific labels from its selector, rather than updating a service's selector to match the replication controller's selector fully.
|
||||
|
||||
- To facilitate rolling updates, include version info in replication controller names, e.g. as a
|
||||
suffix to the name. It is useful to set a 'version' label as well. The rolling update creates a
|
||||
new controller as opposed to modifying the existing controller. So, there will be issues with
|
||||
version-agnostic controller names. See the [documentation](/docs/user-guide/kubectl/kubectl_rolling-update/) on
|
||||
the rolling-update command for more detail.
|
||||
|
||||
Note that the [Deployment](/docs/user-guide/deployments/) object obviates the need to manage replication
|
||||
controller 'version names'. A desired state of an object is described by a Deployment, and if
|
||||
changes to that spec are _applied_, the deployment controller changes the actual state to the
|
||||
desired state at a controlled rate. (Deployment objects are currently part of the [`extensions`
|
||||
API Group](/docs/api/#api-groups).)
|
||||
|
||||
- You can manipulate labels for debugging. Because Kubernetes replication controllers and services
|
||||
match to pods using labels, this allows you to remove a pod from being considered by a
|
||||
controller, or served traffic by a service, by removing the relevant selector labels. If you
|
||||
remove the labels of an existing pod, its controller will create a new pod to take its place.
|
||||
This is a useful way to debug a previously "live" pod in a quarantine environment. See the
|
||||
[`kubectl label`](/docs/user-guide/kubectl/kubectl_label/) command.
|
||||
|
||||
## Container Images
|
||||
|
||||
- The [default container image pull policy](/docs/user-guide/images/) is `IfNotPresent`, which causes the
|
||||
[Kubelet](/docs/admin/kubelet/) to not pull an image if it already exists. If you would like to
|
||||
always force a pull, you must specify a pull image policy of `Always` in your .yaml file
|
||||
(`imagePullPolicy: Always`) or specify a `:latest` tag on your image.
|
||||
|
||||
That is, if you're specifying an image with other than the `:latest` tag, e.g. `myimage:v1`, and
|
||||
there is an image update to that same tag, the Kubelet won't pull the updated image. You can
|
||||
address this by ensuring that any updates to an image bump the image tag as well (e.g.
|
||||
`myimage:v2`), and ensuring that your configs point to the correct version.
|
||||
|
||||
**Note:** you should avoid using `:latest` tag when deploying containers in production, because this makes it hard
|
||||
to track which version of the image is running and hard to roll back.
|
||||
|
||||
## Using kubectl
|
||||
|
||||
- Use `kubectl create -f <directory>` where possible. This looks for config objects in all `.yaml`, `.yml`, and `.json` files in `<directory>` and passes them to `create`.
|
||||
|
||||
- Use `kubectl delete` rather than `stop`. `Delete` has a superset of the functionality of `stop`, and `stop` is deprecated.
|
||||
|
||||
- Use kubectl bulk operations (via files and/or labels) for get and delete. See [label selectors](/docs/user-guide/labels/#label-selectors) and [using labels effectively](/docs/user-guide/managing-deployments/#using-labels-effectively).
|
||||
|
||||
- Use `kubectl run` and `expose` to quickly create and expose single container Deployments. See the [quick start guide](/docs/user-guide/quick-start/) for an example.
|
||||
|
||||
{% include user-guide-content-moved.md %}
|
||||
|
||||
[Configuration Overview](/docs/concepts/configuration/overview/)
|
||||
|
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue